3 DB2 Critical Design Factors for Big Data Analytics Scalability

Faster, bigger, and better analytical insights. These are the goals that management lays out as the new big data analytics process is talked about during the startup meetings. Looking around the room and listening to the CIO’s comments it becomes abundantly clear that they only know the big data salesmen’s marketing material version not the hard work necessary for success. The reality is you and your colleagues are going to be stuck designing their pet big data analytics project. Remember these three critical DB2 design factors, and your system and analytic applications will have a chance at scalability.

  1. Design separation of active data.
    Any application no matter how big or small uses data within its analytics. Lately, with the big data analytical projects, the working set of the data continues to expand as customer buying habits are compared to national habits to get the extra 2% marketing response or purchase rate. The billions of clicks across the Internet continue to be evaluated to see what is trending, who is polling better, and what are the wedge topics that make you want to support one candidate over another. Topic analytics are made more relevant, important, and valuable as the hypothesis is verified with more data.

    Big data within a DB2 table space definition requires a good partitioning scheme to separate out the new data from the old data. The design separating active data from old data is usually a good first design criteria because it naturally segregates data needed for the analytics versus other previously saved data.

    Separating the new and old data through temporal or range based table space partition designs helps limit the working data set size. The partitioning design can also spread a large table over different DB2 storage groups, a large number of data sets, and many different storage devices. This can be especially beneficial when using specialized storage groups to increase the I/O rate and place more recent data on faster volumes. This separation helps optimize buffer pool(s) memory allocations to get as much memory allocated for the working analytics data set. By separating storage devices, storage groups, and buffer pool allocations, your design can improve I/O efficiency and overall processing elapsed time dramatically.

  2. Application flow leverages the clustering direction.
    DB2 table space partitioning provides big data definition flexibility since the range of the partitions can be cut based on the value of one or more columns and clustering can be aligned with different column(s). This provides for a bi-directional type of flexibility that can be tailored for any big data situation. The flexible DB2 table space partitioning works in two directions, dividing the data for quick retrieval.

    A good example of this type of bi-directional data split is an old phone book. In a phone directory the phone numbers are split by city or area code and then ordered by last name to easily reference your person of interest. While I understand some of the new millennials have never used a phone book, I hope this example helps them develop an understanding of separating your big data bi-directionally, by city then by last name, focusing your search more quickly for any big data situation.

    Big data table space partitioning ranges are best used for parallelism since any insert, update, or delete locking is isolated within a partition. Big data performance can be exponentially improved through 10s, 100s, or 1000s of parallel processing flows across partitions ranges within the database design.

    By syncing the clustering direction with the application processing flow, all the I/O and buffer pool memory can be leveraged for the active application data. As database table space pages are brought into the buffer pool, the clustering brings in the next group of data pages that will be needed by the application. This clustering data sequence is further leveraged as sequential prefetch brings in a block of pages with a single I/O dramatically improving system performance and elapsed processing time for your application.

  3. Minimize all data movements for application I/O efficiency.
    With big data analytics your processes are most likely to be I/O bound. The database and overall performance for every aspect of your system and application, from data delivery to data deletion, needs to be designed to minimize any data movement. The way the data is added into the system should be as streamlined and complete as possible. Minimize the big data I/O by avoiding a staging area and validating and inserting your data with a single process into the database. A single validation and inserting step can avoid sometimes millions or billions of I/Os on a daily or weekly basis and can save a huge amount of storage system I/O work.

    Avoid having any empty column data when the row is initially inserted into the database table. Particularly, avoid any empty columns, large VARCHARs, or XML column definitions within your data rows when the data is inserted.  All columns are best filled in because if they are empty and then filled in later it may cause the row to expand and need to be relocated to another database page when updated. This database row relocation can be highly disruptive to your processing with the extra row relocation overhead and can also disorganize your database tables. Trust me, no one ever wants to have to reorg a table when too many rows are relocated. If your analytics processing doesn’t know the data before insert, fill in the columns anyway with defaults that are the same size. This will keep the row on the DB2 page and help eliminate relocating rows after update processes.

The performance keys to efficient big data analytics is to bring as much of the computing analytical processes to the data that doesn’t need to be moved or manipulated in I/O intensive processes. These three critical scalability performance factors will serve your project design well and help with your success and your CIO’s success.


Dave Beulke is a system strategist, application architect, and performance expert specializing in Big Data, data warehouses, and high performance internet business solutions. He is an IBM Gold Consultant, Information Champion, President of DAMA-NCR, former President of International DB2 User Group, and frequent speaker at national and international conferences. His architectures, designs, and performance tuning techniques

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>