3 Big Data Issues: Security, Governance and Archiving

When companies and projects are considering their Big Data applications, it seems that everyone forgets the decades of advancement, computing platforms and databases technologies have gone through.  Platform and database are critical to the success of your company’s Big Data initiative and choosing a mature technology can help your project succeed and give your processing the most optimum performance.

First, the platform is important for your Big Data environment so that it can supply the proper security for user and processing access.  The Hadoop Distributed File System (HDFS) which provides Hadoop with its file access, has been out for a while but is in its early years.  Unfortunately, HDFS doesn’t have sophisticated access controls and some projects are using the open source Apache Accumulo for Cell Level security along with server side programming capabilities.  Having separate packages, Accumulo on top of Hadoop, provides only a fraction of the functionality that has been built in the zOS, UNIX and even Windows platforms for decades.  In terms of security, encryption can be done within Hadoop but the entire HDFS or an application level encryption scheme must be done which requires additional processing steps and can degrade analysis performance.

Putting together and rolling your own integrated security controls and encryption is time consuming and a huge performance overhead.  Many platforms today have time-tested integrated platform and database solutions that provide configurable custom security.  DB2 with its many levels of integrated authorizations provide operating system level, database level, object level, content based authorization and label based row and column security access controls.  All of these integrated capabilities provide customizable security profiles, selective encryption and minimize the performance overhead.  

Secondly, when some Big Data projects get started, data governance is the last thing on everyone’s mind but it is critical for long term success of the effort.  The data governance practices of data security, data standards, data quality and meta-data for context of the big data answers is vital.  To truly understand the answers produced through the Big Data processes and big data analytics, data governance needs to be directly related to all your data.

Some Big Data projects are only analyzing social aspects, data such as tweets or likes within the scope of their brand or product.  Measuring customer sentiment in these tweets or likes contexts may not need much governance.  The analysis of more critical Big Data such as medical records or pharmaceutical drug interaction, needs the governance data standards to make the correct life and death analysis decisions.  Within the established platforms and databases there are a variety of tools and vendors to support and assist governance.  Within the new Hadoop environment you are required to roll your own data governance processes to insure the data is truly being handled properly.

Third, archiving Big Data is another aspect of these projects that becomes critical when you’re creating processes that need to have their analysis verified and answers audited.  When Big Data keeps coming into your system from sensors, monitor data or stock trade data, your database design needs to be temporally designed.  By making sure to understand the Big Data sample result set that provided previous answers, your system needs to quickly compare, validate and audit new answer sets.  Also quickly on-boarding and offloading the old big data is critical to minimize your storage requirements and keep the active data working set as manageable as possible.  

Hadoop has issues for managing its data storage requirements as it defaults to make three copies of the data on its various nodes architecture for processing parallelism and fault tolerance.  While Hadoop can leverage commodity hardware, the extra data copies and I/Os present a performance overhead within its batch processes.  This is a huge contrast to the DB2 database platform that incorporates active compression that can compress a single copy of the data requiring only 10% to 20% of the Hadoop storage requirements.  The DB2 compression capabilities along with the temporal data stores within DB2 provide a great way to automatically archive and move your old data from the active Big Data.

So these are another three key issues to think about with your Big Data project as your company considers its platform and database.  While the new kid database Hadoop may be getting a lot of press, it needs more capabilities to keep up with the mature platforms and databases for data security, governance, archiving  and temporal designs.


Dave Beulke is an internationally recognized DB2 consultant, DB2 trainer and education instructor.  Dave helps his clients improve their strategic direction, dramatically improve DB2 performance and reduce their CPU demand saving millions in their systems, databases and application areas within their mainframe, UNIX and Windows environments.

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>