SQL Big Data Performance Tips

After data validation, in most instances SQL performance is the most important big data performance factor in big database applications. Understanding the fine details of DB2 SQL performance tips can make a huge difference in processing time, especially true when going against a big data database. Since every design decision is magnified by tens of billions of rows, one extra column, a miscoded WHERE clause or a table JOIN definition that is out of order will cause overhead to dramatically extend application elapsed time and cause huge CPU SQL and application consumption.

The following SQL performance tips for big data are very similar to the other SQL performance tips blogs I have written; these other SQL Tip blog links are at the bottom of this page. Check them for additional DB2 SQL tip ideas, because every little bit of DB2 SQL performance tuning done against a big data database is always beneficial as it impacts retrieval of billions of rows. In addition, SQL tips are a bit different when dealing with big data, with added significance of very detailed database table knowledge, emphasis given to minimizing the I/O portion of the SQL access path and letting the DB2 engine do the work.

  • The first and most important SQL tip in all the SQL tip blogs I have written, continues to be the most common Let the DB2 database engine do all or as much of the work as possible! Doing the work inside the DB2 database engine continues to be the best performance advantage because DB2 does it faster than your application code. By supplying as much WHERE clause criteria statements as possible and using the correct WHERE clause criteria, you achieve faster performance than executing a SQL statement and then applying application code for filtering, comparing and quantifying the SQL result set data.

    When doing data qualification through the proper SQL WHERE clauses, the I/O of data from the database engine to the application program is minimized for the most efficient performance. Also, through DB2 and the processor chip technology, the SIMD operations can more efficiently skip over unneeded data. With database tables in the tens of billions of rows and SQL statements which need to be executed tens of millions of times to process daily transactions, every extra byte of data transferred can make a huge performance impact on overall I/Os, CPU and SQL performance.

  • The next highest priority SQL big data performance tip is to start your SQL access with the table that is qualified the most and will return the least number of rows. By using good SQL WHERE clauses that minimize the qualifying result data from the first table, any subsequent SQL INNER or OUTER JOINs matching data will be minimized also.

    Starting your SQL JOIN by minimizing the eligible data returned, causes your working set of data to be as small as possible and reduces the SQL JOIN rows matched from the subsequent tables. Since DB2 works with a maximum of two sets of data at a time, minimizing the DB2 SQL JOIN working set size as each additional SQL JOIN tables’ data is evaluated, minimizes data matching or sorting performance impact. Whether you are accessing and qualifying the largest table in your big data system or not, calculate which of the SQL JOINed tables can use the most qualified WHERE clauses to produce the smallest result working set of data as possible. Limiting your DB2 SQL working/result set big data results, improves performance and optimization. Keep your SQL JOIN result/working set small to start and stay as small as possible through all your JOIN operations.

  • Order of the SQL tables in JOINs matters. The DB2 SQL optimizer is very smart, leveraging the database table and index statistics to determine the best and most efficient complex DB2 SQL data access path. Most of the time the DB2 optimizer has the SQL DB2 correct and references the tables and indexes in the most efficient order possible. In almost all cases this DB2 SQL access minimizes the CPU and I/Os consumed to produce the desired result set information.

    Sometimes when the DB2 SQL optimizer is working with the very large big data table and index statistics, the DB2 SQL optimizer does not chose the best order to efficiently reference the tables. When your DB2 SQL is JOINing data tables with tens of billions of rows, it is always a DB2 SQL best practice to review the table access order. For example, a recent review of DB2 SQL performance of a complex seven table DB2 SQL JOIN, showed a table access order that resulted in the optimizer carrying a large number of rows from the third query block to the fourth query block. By reordering the DB2 SQL and moving that INNER JOIN to that table up in the sequence, the access order and overall DB2 SQL performance of the processing was improved. DB2 SQL optimizer usually knows the best, most efficient table access order, but it never hurts to review the DB2 SQL table access order, especially with DB2 SQL for big data tables.

  • Every DB2 SQL SELECTed column should be absolutely necessary for the processing. When dealing with big data tables and huge DB2 SQL result sets, every byte is critical. Any extra columns and their bytes are amplified, potentially taking up space in the DB2 sort area, in DB2 interim SQL query blocks and within the DB2 SQL query result set transmitted across and through the network.

    Choosing only the necessary DB2 SELECT columns, preferably columns that are in an index, can dramatically improve your DB2 SQL performance. By SELECTing only the columns that are in an index, table access can be avoided and only index access needed to retrieve the result set. In most cases the data is in the data pages so both index and data page access are required during an SQL DB2 query. In some cases, only a single additional column beyond the indexed columns is necessary, so this still requires the data pages to be scanned. Analyze your DB2 SQL, especially in your complex SQL JOINs, and determine whether using the INCLUDE index definition option would be beneficial. Of course, your index size increases from the newly INCLUDEd columns, but some crucial DB2 SQL access could be dramatically improved by not having to reference the table data pages.

    Sometimes by adding and INCLUDEing a column within an existing index definition, access to millions of extra data page can be avoided which dramatically improves the DB2 SQL performance. Analyze the columns SELECTed from your high performance, most frequency used or daily executed DB2 SQL statements to determine if INCLUDEing extra columns in your indexes may provide a huge performance improvement opportunity.

  • Clustering and cluster order of the data can make a huge difference for DB2 SQL performance. When your database tables and their clustering indexes are defined, the database INSERT activities try to store the physical data rows in the clustering index order. The DB2 Catalog SQL queries below can help you understand the clustering defined for each of your DB2 z/OS and DB2 LUW tables. You may need security DB2 authorization to execute the SQL below against the DB2 Catalog.

    –For DB2 z/OS
    SELECT NAME
    FROM SYSIBM.SYSINDEXES
    WHERE TBCREATOR   = ‘BEULKE’
    AND TBNAME          = ‘CUSTOMER’
    AND CLUSTERING = ‘Y’–For DB2 LUW
    SELECT *
    FROM SYSCAT.INDEXES
    WHERE TABSCHEMA = ‘BEULKE’
    AND TABNAME             = ‘CUSTOMER’
    AND INDEXTYPE          = ‘CLUS’

When writing your DB2 SQL JOINs use the table data ordering information to reference and JOIN tables with like data clustering orders. JOINing like clustered tables together helps DB2 scan and combine the tables’ data more efficiently. Understanding your tables clustering order and aligning it within your SQL clauses such as GROUP BY, ORDER BY and any WHERE comparison criteria statements can improve DB2 SQL performance.

Understanding, leveraging and implementing these five DB2 SQL performance rules against only a few DB2 SQL big data queries recently improved performance significantly, reducing CPU by 2.7% and about 143.5M GetPages per day. Since this particular performance evaluation produced such positive results the team believes more DB2 SQL queries will benefit from these concepts and a wider review of the most frequently DB2 SQL workload is scheduled. Review your existing DB2 SQL performance workload, apply these concepts and let me know how much you save within your DB2 SQL critical performance workload.

5 More DB2 SQL Performance Tips
Another 5 More DB2 SQL Performance Tips
5 Big Data SQL Performance Tips – Fixing Generated SQL
DB2 11 SQL Performance Improvements
5 More SQL Performance Tips for your Big Data


Dave Beulke is a business strategist, systems architect and performance expert specializing in big data, data warehouses and high-performance internet solutions. He is an IBM Gold Consultant, Information Champion, President of DAMA-NCR, former President of International DB2 User Group (IDUG) and frequent speaker at national and international conferences. His strategies, architectures and performance tuning techniques enhance analytics, security and performance so organizations can better leverage their information assets and save millions of dollars in time to market, CPU, development and overall costs. Follow his blog at davebeulke.com or on Twitter here (@DBeulke) or connect through LinkedIn here (https://www.linkedin.com/in/davebeulke).

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>