5 Big Data SQL Performance Tips – Fixing Generated SQL

After implementing a big database with billions of rows, it’s time to run your applications and reports.  Hopefully, the database is partitioned so the applications can utilize parallelism and complete your application and reports quickly.  When you begin to write your SQL against these tables, take into consideration the first set of DB2 SQL performance tips that I wrote up that can be found here.

Sometimes you’ll discover the database is working beautifully in production and the ad hoc report requests are flowing in with no problem, but soon enough you’ll start to have ad hoc SQL report writing tools power users generating complex SQL against some of your largest tables with tens of billions of rows.  Unfortunately, sometimes report-writer generated SQL is not optimally coded and leads to poor DB2 SQL performance.

Even if your database doesn’t have a billion rows, the following five big data SQL performance tips help everyone deal with GUI-generated SQL, or overly complex SQL created by basic reporting tools that needs to be debugged and optimized.


  1. Examine the data columns and tables SELECTed in the SQL.  Many of these new big data report writers have nice GUI interfaces that allow the users to point and click to develop their reports and underlying SQL.  Unfortunately, mistakes or clicking on unneeded columns sometimes generates extra unnecessary table or data column access in the SQL processing.  It may seem very basic, but including an extra table or some extra data columns in the SQL SELECT list can truly kill SQL performance.  Verify that no extra tables are in the SELECT statement and no extra columns or unneeded columns (e.g. columns are used in the WHERE= clause) are in the SELECT result set. 

    When you have the complete list of data columns and tables needed, review the DB2 objects to verify the statistics are current on the base tables and all their indexes.  With big data tables it is important to insure current statistics on the database objects are available.  With big data tables use RUNSTATs Sampling to improve the utility performance and reduce its CPU.  Again this is basic for DB2 SQL performance, but vital to verify when beginning to debug and improve your big data SQL performance.

  2. Verify the DB2 table joins for improving your DB2 SQL performance.  With many GUI SQL report generators the SQL table joins are not fully optimized due to improper index choice or lack of inclusion of possible index columns.  Sometimes GUI tools don’t evaluate table indexes’ uniqueness, cardinality, relationship to the table partitioning or column frequencies when generating the SQL join.  All of these index considerations are vital for the GUI tool to generate the best SQL joins using WHERE clause predicates to promote index usage.

    Validate that the SQL join is using as many index columns as possible within the table SQL join WHERE clause.  This is especially important for the SQL join matching the two tables’ data, but also because any extra data column filtering can be done at the same time as the join.  This extra filtering can help make the join SQL performance faster and reduce the size of the resulting data set.

    Finally, make sure that the SQL join the best join type possible.  Since these GUI tools sometimes default to always generating only one join type or are not aware of all the DB2 join type possibilities, make sure that the join type used is optimized to improve SQL performance.  While join choice is dependent on your table join requirements, a Left Outer Join or a Nested Loop Join or another join type may be a better choice to improve table access.

  3. Leverage DB2 functions appropriately.  Analyze the SQL to determine the need for the various DB2 functions.  Sometimes the report writer generates SQL with multiple, duplicate or unnecessary DB2 functions.  DB2 DATE functions such as DATE, MONTH, YEAR, DB2 numeric functions like MIN, MAX, and others may not be necessary for the answer set and can be eliminated from the SQL.  These extra functions can cause extra processing and hurt your DB2 SQL performance.

  4. Eliminate data translations.  Sometime the report writer generated SQL uses many CAST, SUBSTR, CASE statements, constants, or formulas in the SQL.  The generated SQL may also have math formulas compared to DB2 table columns within the SQL.  All of these SQL translations are likely to slow down your DB2 SQL performance and lengthen your access path because they may need to be materialized.  Once the DB2 data is translated or materialized, the result may no longer be a good index candidate because it no longer matches the DB2 index column definitions.  Since it does not match the index column definition, these translations hurt the DB2 SQL performance and prevent the DB2 access path from doing an index access.

  5. Understand the ORDER.  This very important issue of ORDER is really two important DB2 SQL performance tips in one.

    First the order in which the big data tables and the smaller tables are JOINed can have a huge impact on DB2 SQL performance.  By making sure the JOIN of the smaller table drives the data SELECTed from the big data DB2 table, your DB2 SQL performance will be improved by minimizing the amount of data referenced by the JOIN activities.  Going after the smaller table first optimizes the amount of data DB2 referenced, using a smaller number of index and data pages to reference a limited set of index entries and their corresponding data pages in big data table.  Make sure to review your DB2 access path query blocks and their order to make sure your SQL references the tables in the order you expect to get the best DB2 SQL performance available.

    Second, the report writer generated SQL may produce a variety of SQL phrases, such as ORDER BY, GROUP BY, DISTINCT, UNION, or JOIN predicates that do not take the clustering order of the table, that will require a DB2 sort.  These and other DB2 statements can require DB2 sorts. If your DB2 access is referencing the massive big data table first the sort requirements for the data could sort billions of rows.  So as I said above, understanding the order of the tables referenced is crucial, especially important when the generated SQL has any of the phrases that required a DB2 sort. This issue alone can create a huge big data DB2 SQL performance problem.

Big data DB2 SQL performance can be created by complex SQL queries are generated by report writer GUI tools.  Use these tips to comb through the generated SQL to improve your big data DB2 SQL performance.

Here’s a list with links to other posts that may help you improve your DB2 performance, security, and ease-of-use:

5 More DB2 SQL Performance Tips
Another 5 More DB2 SQL Performance Tips
5 Big Data SQL Performance Tips – Fixing Generated SQL
DB2 11 SQL Performance Improvements
5 More SQL Performance Tips for your Big Data
Hadoop SQL: 4 Reasons Why BigInsights Is the Best
Vital Java DB2 SQL Performance Considerations
More DB2 Family Security Best Practices Part 7: Preventing SQL Injection
DB2 SQL Security Audits


Also the new DB2 11 for z/OS: An IDUG User Perspective white paper is now available for your e-reader.  Check out the IDUG blog by Julian Stuhler here and remember to add comments to white paper announcement to get a chance to win a free IDUG registration!

Dave Beulke is a system strategist, application architect, and performance expert specializing in Big Data, data warehouses, and high performance internet business solutions.  He is an IBM Gold Consultant, Information Champion, and President of DAMA-NCR, former President of International DB2 User Group, and frequent speaker at national and international conferences.  His architectures, designs, and performance tuning techniques help organization better leverage their information assets, saving millions in processing costs.

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>