Three success factors for handling Big Data and Data Warehouse Performance Part1

Was 22 billion rows Big Data? As I began the new data warehouse project, I knew that the data warehouse performance needed to be designed from the very first data element. The new project was very interesting with complex calculations, cross referencing reports and Big Data implications and big data warehouse performance challenges. Many standard reports that affected whole industries and government policies were depending on data warehouse performance for timely reports and 100% not even 99.999% accuracy of every bit of the Big Data.

Another consideration was the implementation deadline for this high performance Big Data data warehouse system was only 120 days away and wasn’t movable. All of these reasons made this Big Data system quite the challenge. Our small team of business users, SQL analysts and technical staff were committed and we were successful in completing the implementation. The data warehouse performance was so improved that some of the processes went from 37 hours to under 7 minutes and some reports come back in seconds. Complex calculations and report verification done during our data quality testing found bugs in the old system’s existing reports that had been trusted for years, so we knew everything was 100%. Our overall performance, 100% accuracy and overall quality implementation were a testament to the commitment of a winning team effort.

There were many success factors that drove the team, the design, development and successful implementation of this 22 billion row Big Data system. Three success factors drove the data warehouse performance.

First: Keep it simple. Yes I know everyone talks about keeping it simple in every development and system design. Designing it and doing it simply can be much harder. Instead of endless discussions and meetings, design the big picture and then see how all the reports, every type of interface and users interact with the warehouse. Data warehouse performance is achieved through computing power. Simple designs that leverage parallelism, that drive I/O and calculation computing power for quickly producing reports are key.

Second: Leverage all existing infrastructure and processes. The tight 120-day schedule from start to finish did not allow the team to have issues, only time delayed solutions. Since we used an agile scrum type of development methodology the agile scrum stories and the data warehouse performance of the Big Data reports were tested several times during the short development period. All used the same standard corporate testing procedures, documentation and approval processes that had been used for years. The team contacted everyone that needed to guarantee the data warehouse performance, got them involved early and got their requirements, feedback, improvements and eventual approval for all the components and implementation. Everyone knew the Big Data project schedule ahead of time so all infrastructure and processes were ready. Issues were worked on until solutions were approved and implemented.

Third: Don’t be intimidated by Big Data. I have been fortunate to design, develop, deploy and study data warehouse performance since back in 1988 when I implemented my first “Big Data” system for the financial firm E.F. Hutton. At least back then it was considered Big Data and it seems that all the other data warehouse systems I have dealt with over the many years have had bigger data then the previous one. Big Data will keep getting bigger and it is just another couple of zeros on the end of your table population and design considerations. Data warehouse performance is built on the same principles as many years ago. As I mentioned with the first success factor, design for both I/O and CPU parallelism in every single item of your system. Only then can your Big Data system have the data warehouse performance it needs to speed up your processes and reports from hours to just minutes or even seconds making everyone, especially your users, happy.

 

 


 

 I am speaking at the upcoming Baltimore Washington DC DB2 Users group meeting September 14th where I will be presenting “DB2 Performance Database Discoveries and Recommendations Part 2”. The new meeting agenda information should be posted soon but more information can be found at: http://www.bwdb2ug.org/index.htm

Coming up Sept. 29th DB2 10 for z/OS Performance Training and 30th SQL Performance Training two great DB2 classes are being offered in Washington DC: Sept. 29th DB2 10 for z/OS Performance Training and 30th SQL Performance Training. Get more information here (http://davebeulke.com/db2-performance-tuning-and-sql-training/)

Also I will be presenting at the International DB2 User Group (IDUG) European Conference in Prague. I will be presenting my “DB2 10 Temporal Database Designs for Performance” on Mon, November 14, 1:30 PM – 2:30 PM. For more details on my presentation click here. (http://www.idug.org/e/in/eid=2&req=info&s=97&all=1) I will also be involved in the data warehousing Special Interest Group (SIG) and the final discussion panel, so please join all the presentations, networking and discussions on DB2. Sign up now at www.idug.org.

Come see me in any of these venues and ask me your data warehouse performance questions.

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>