Big Data Analytics: More Maddening Basic Issues: NULLS, VARCHARS and DEFAULTS

Last week’s blog information made some upset with its basic advice on Big Data.  I agree these basic database issues should be common knowledge for business and database professionals and not have to be discussed or even considered.   Unfortunately, all the new developers are thinking that they are prodigies of Leonardo da Vinci and their designs are perfect.  These new grand visions with Big Data and NoSQL or DB2 databases lead to interesting designs that have many issues.

Some of the basic database design habits that we in the DB2 community take for granted do not exist in some of these new NoSQL database designs.  So I thought to reinforce and talk one more time about the big data database basics with three more basic issues: NULLs, VARCHARS and DEFAULTS.

While looking at a new big data database table design, I became a bit curious; every column was NULL-able.  Some columns were VARCHARs and one had a long default value of “NOT APPLICABLE.”   While these NULLs, VARCHARs and default values may be valid and appropriate they require many extra bytes for their internal storage.  Storing hundreds or thousands of rows is easily handled.  The problem exists when they are replicated in tens or hundreds of billions of rows of big data. They become a wasteful storage problem especially when they need to be retained for 30+ years. These factors make it imperative that the design is correctly optimized.

For example I found one table definition that was very troubling because it was projected to hold tens or hundreds of billions of rows from various infrastructure sensors within the first two years of the new big data project.  After only running a few months, the NoSQL database ran into difficulties and the company now starting to migrate it to a DB2 for zOS system to leverage storage compression, and compare performance and scalability issues.

MST_PROD_MTR_DATA (
PROD_ID            INTEGER ,
DEPT_KEY          INTEGER ,
METER_NBR        INTEGER ,
METER_VALU      INTEGER,
PRS_VALU          VARCHAR(3) ,
DGE_VALU          VARCHAR(2) ,
SFT_VALU          VARCHAR(14) DEFAULT ‘NOT APPLICABLE’,
CRT_TS             TIMESTAMP NOT NULL
WITH DEFAULT ‘1900-01-01 01:01:01.000000’)

Looking at the definitions listed above showed that there was a lot of wasted space within the table.  First, analysis of the data contained in three INTEGER columns PROD_ID, DEPT_KEY, and METER_NBR showed all of the values below the upper limit capability (+-32k) of a SMALLINT.  Given that project is based to continue to expand with more monitors the DEPT_KEY is the only column that can be changed to a SMALLINT which saves two bytes within each row.

Further analysis showed that none of the columns were ever set to NULL.  After analyzing the Use Cases for the data and the expansion plans for the monitoring data, I discovered that a NULL or unknown value was never valid.  I have had many discussions about NULLs defined in the database table columns over the years. In most Use Cases NULLs should not be used.  If an unknown value is valid or if the column is involved in calculations or accounting there may be a need for NULLs, but in most cases NOT NULL should be specified for all table column definitions.

Next, evaluating the length of the values within the each of the VARCHAR columns showed that there were only 1-byte codes within all three of the VARCHAR columns.  The only instance of anything longer was the ‘NOT APPLICABLE’ within the SFT_VALU column.  This default was ridiculous and was quickly replaced with programming that displayed the NOT APPLICABLE in the application.  Also having small columns defined as VARCHARs is always suspect because of the extra LENGTH two-bytes required by the VARCHAR definition.  If your database is using compressions, it is sometimes better to define the column as fixed character (CHAR) columns and let compression reduce the row size.  Optimizing these column definitions reduced the internal size of the row 57% from 58 bytes to 27 bytes of internal storage.  Having tens or hundreds of billions of these rows just got considerably cheaper in terms of storage, I/O performance and the application had more scalable potential.

MST_PROD_MTR_DATA –NEW
PROD_ID   INTEGER NOT NULL WITH DEFAULT 0, — was 5 now 4 bytes

DEPT_KEY SMALLINT NOT NULL WITH DEFAULT 0, — was 5 now 2 bytes
METER_NBR INTEGER NOT NULL WITH DEFAULT 0, — was 5 now 4 bytes
METER_VALU INTEGER NOT NULL WITH DEFAULT 0, — was 5 now 4 bytes
PRS_VALU  CHAR(1) NOT NULL WITH DEFAULT ‘*’,– was 6 now 1 byte
DGE_VALU  CHAR(1) NOT NULL WITH DEFAULT ‘*’,– was 5 now 1 byte
SFT_VALU  CHAR(1) NOT NULL WITH DEFAULT ‘*’,– was 17 now 1 byte
CRT_TS    TIMESTAMP NOT NULL                — same 10 bytes
    WITH DEFAULT ‘1900-01-01 01:01:01.000000’)

Changing VARCHAR to CHAR columns, getting the DEFAULTs correct and NULLs definitions resolved is basic database design.  Today with these Big Data systems being built by management by magazine, make sure to optimize them before they get too big and out of control.

_______________________________________________________

Dave Beulke is an internationally recognized DB2 consultant, DB2 trainer and education instructor.  Dave helps his clients improve their strategic direction, dramatically improve DB2 performance and reduce their CPU demand saving millions in their systems, databases and application areas within their mainframe, UNIX and Windows environments.

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>