Why Hadoop Is Not the Big Data Solution

The big data Hadoop hype is running into reality, and some companies are not liking Hadoop’s options and features.  Following are some of the Hadoop facts that a friend of mine noted to explain why their company was migrating their big data project from Hadoop and into the DB2 z/OS environment.

Hadoop is not free.  Hadoop is open source and is a free software download.  Unfortunately everyone that supports any type of technology realizes that nothing is free.  This realization is especially common after all the open source bases code evaluation, patches, and operating system coordination.  The software may be free, but consider working with functionality issues that may never be addressed by the open source community, extra software coordination and the wait for functionality that can be project killing and painfully expensive.  Also, there are a number of software support companies that are offering their own derivatives/versions of the Hadoop software with software support contracts.

Many Hadoop components are still not even to Version 1 yet.  Along with the Hadoop software there are several other components typically needed to get your big data project going.  For example two other open source components, Pig and Hive, languages that provide for data flows and the Map Reduce, are only version 0.12.0.  These two typical core components of a Hadoop infrastructure work within the infrastructure, but are hardly easy to use and can be cumbersome for developers to learn.  This early edition of code, sub Version 1 with limited functionality and only elementary ease of use are typical of the open source companion software used with Hadoop.

Hadoop integration can be challenging.  Hadoop’s integration into any environment is interesting because of its utilization of command scripts, lack of standard database application interfaces, and batch mode of processing.  These three factors make Hadoop cumbersome to integrate with other projects, products, and applications because direct access is almost impossible, automating scripts can be complex, and standard interfaces are not available.  Since the integration of Hadoop is difficult, any data loaded and analyzed or answers produced creates yet another silo of data that requires extra effort to re-integrate within the other enterprise system databases to provide big picture analysis.

Hadoop’s SQL interface is not quite there.  While many vendors are contributing to the Hadoop open source effort, the integration of a SQL interface enhancement is still being developed.  This SQL interface will be far from the sophisticated SQL interfaces available in other standard DB2 Family implementations.  It may also be limited by the underlying Hadoop infrastructure which limits recursive or join activities that we have come to leverage within the DB2 Family SQL processes.  Until the Hadoop SQL interface is fully delivered, many vendors are providing SQL “like” functionality.

DB2 can handle your big data, Hadoop isn’t required.  Just because someone said big data doesn’t mean that your project needs Hadoop.  DB2 can process, stream and analyze billions of rows easily.  DB2 also has massive free parallelism through the free z/OS zIIP processors.  Handling one more database within the DB2 environment is easier than handling a new operating system, new open source software, and limited capabilities.

The incremental cost of another DB2 database, along with the maturity, reliability and quality of the DB2 software, available application language interfaces, and robust big data capabilities of DB2 makes these facts more reasons DB2 is the best choice for your big data project.
Dave Beulke is a system strategist, application architect, and performance expert specializing in Big Data, data warehouses, and high performance internet business solutions. He is an IBM Gold Consultant, Information Champion, and President of DAMA-NCR, former President of International DB2 User Group, and frequent speaker at national and international conferences. His architectures, designs, and performance tuning techniques help organization better leverage their information assets, saving millions in processing costs.

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>