3 Critical Factors for Big Data Analytics Performance

With the start of the New Year comes new/revised project schedules, endless list of agile deliverables, new/maintenance projects, and the roulette of implementations.  All of these efforts challenge the IT staff as no one talks about the business’ insatiable appetite for big data sliced/diced and where analytics is spread out across the corporation in a billion different Excel spreadsheets.

Lost in all this demanding big data analytics work are the constant forces of big data analytical context of the data while tackling the necessity of delivering it in scalable solutions that are always accessible on a variety of devices with personalized interfaces that may be chosen for use at any time of the day.

These efforts face huge obstacles, in spite of us having the availability of the elastic web and cloud computing that will solve any and all problems, just like the salesperson discussed it with the CIO on the golf course over the holidays.  So 2016 should be another easy year to implement the big data analytics while keeping in mind these three critical factors for big data analytics performance.

  1. Achieving 99.99% analytics availability is hard.
    Anyone that has built systems knows that to achieve 99.99% availability takes work and planning.  To achieve 99.99% uptime means that your application only gets one second of downtime for each 10,000 seconds.  Since a day is 60 seconds * 60 minutes * 24 hours = 86,400 seconds that means your system can only be unavailable for 8.6 seconds a day or 3153.6 seconds, less than an hour, a year.

    Many of the internet sites like Facebook, Google, Yahoo, and others don’t achieve the “4-9’s” 99.99%, so planning for your big data analytics needs to implement redundancy and failover to ensure functionality.  These failovers and redundancies give the end user the illusion of availability while providing for the inevitabilities of application and system issues.

    That is why the DB2 Family of products such as DB2 LUW pureScale and DB2 for z/OS Data Sharing are so popular. I have built and witnessed at clients these DB2 LUW and mainframe systems being up for several quarters and have heard of some DB2 instances up for over a calendar year.  Even with their use as DB2 big data analytic solutions with their built-in data access redundancy, the application design must have transaction integrity practices, remediation for redundancy, and application logic for fail-overs in place to mask and handle the application outage and fail over processing.

    Real time data feeds, refreshes of reference data, and normal software maintenance all impact your system’s analytic answers.  Plan your big data analytics system and associated applications to include these situation wisely so your big data analytics answers are always more than 99.99% available.

  2. Data context is more important than the data itself.

    “Fantasies from figures” was an early big data analytics user’s quote when she first started using one of my systems in the early days (1990s) of big data warehouse system.  Believing and trusting the data can only be done if the users understand the context of the big data and believe the analytical results.  Gaining her trust of the results from the big data analytics system was critical for her group, its department, and finally the company wide roll-out and implementation.

    Data knowledge context education was the key to getting her to understand the system.  Discussions and documentation on how the big data was captured, when the various data sources were accessed, and the data validation checklists used against the data are critical for everyone to gain confidence in the data and understand the context of the results.  Only by understanding the complete context of big data details can its results gain acceptance.

    Also gaining in-depth context of the big data analytics processes that builds the results is critical.  Whether the process is an easy addition scan, simple SQL scan, complex correlated SQL, or a multi-step SQL aggregation, understanding all the context factors of the result set is critical.  By understanding the context of the calculations, the result set and big data analytics answers can be believed, used with confidence, and become beneficial to everyone.   

  3. Performance always matters.

    We all want things now or as fast as possible.  Big data analytics is just a bit more difficult because of the big data analytics I/O requirements.  Time for data extraction, time for validation, time to realize end state structures—it all takes a bit longer just from the extra amount of I/O processing.  No matter how much memory, solid state disk or storage arrays are available there are logical and real I/Os within the processing.  Big data adds time to every process within big data analytics.

    With the new “load and go” nature of Hadoop and other non-relational data stores, the simplification of the end-state structures for analytics is being done.  These simplified data analytical data stores try to minimize the negative extra I/O impact in order to get the analytical platform built quicker.  Sometimes extra design analysis is needed to make sure these simpler big data structures don’t sacrifice the data relationships or the data’s intrinsic value and meaning for the improved performance of getting the data structures built faster.

    Simplified big data designs need to be evaluated against more complex designs.  Since every additional data store can mean potentially billions of extra I/Os, its evaluation for potential value and benefit to the in-depth nature of the analytics is critical.  This extra big data structure analysis not only needs to be done for data stores, but also for any supporting reference data stores or indexes for their analytical benefit and potential performance improvement of analytical queries or result set creation.

    Performance in big data should not only be evaluated for creation of the original big data analytical data stores, but for also for the on-going refresh and maintenance of the complete analytics environment.  Periodic review of the performance lifecycle for the on-going big data analytics database is critical as the first benefits of the results and insights are quickly forgotten.  The big data analytical environment and its data refresh lifecycle must be a quickly repeatable process sustained or maintained for the next iteration of the big data analytical processing.  Especially since we all want everything now.

Start developing your next new big data analytics project quickly before the CIO’s next golfing excursion and make sure to have these factors in mind so that your analytics results are positioned for on-going performance and success.

Dave Beulke is a system strategist, application architect, and performance expert specializing in Big Data, data warehouses, and high performance internet business solutions. He is an IBM Gold Consultant, Information Champion, President of DAMA-NCR, former President of International DB2 User Group, and frequent speaker at national and international conferences. His architectures, designs, and performance tuning techniques help organization better leverage their information assets, saving millions in processing costs.

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>