3 Consideration for Enjoying the Data Lake

With all the outside activities with friends and family, summer vacations are always wonderful. Being outside at the lake enjoying the warm weather and cooling off in the lake are a wonderfully relaxing great times. This is the safe, content image that everyone thinks about when discussing the new “Data Lake” concept. Unfortunately, there are a few considerations when diving into the lake that can ruin your summer very quickly.

Data in One Place

Having a data lake as the main repository of your analytical data is a great concept. A single “Database as a Service” (DBaaS) concept that can provide one stop shopping for all the analytics efforts can be very nice and convenient for end users. Unfortunately, just like the regular summer vacation lake, the Data Lake shoreline can get crowded very quickly, become very noisy, and overburdened by too many people in the lake at once. Traffic issues, long waits, and long lines can make the Data Lake experience prolonged by just the plain and simple fact of being too popular and overwhelmed with activity.

Analytics these days is vital for discovering new revenue grow and cost reduction areas. Having a Data Lake is a wonderful concept; just plan for more capacity, more performance issues, and many more users than you think at first. Too many people and too many processes cause poor performance immediately so make sure to do extensive capacity planning for the Data Lake usage. All the new crowds of users have high expectations for performance, response time, and data standards within their new Data Lake service.

Unknown Data Contexts

By providing, the new Data Lake DBaaS service concept, all the analytical data is in one spot. Pushing data from a variety of different sources into the Data Lake is the new normal process and encouraged by management and potential users who want to mix and match data. Unfortunately, the original sources of the data can get confused with derivative new Data Lake data combinations and different data slices from almost all of the other diverse sources within the Data Lake.

Mixing and matching Data Lake data can become dangerous because the context of the various sources populating the data lake may be different. Because the data lake sources’ context–when the data was captured in the purchase process, the data’s time periods, its variable frequency/grain of the data–can vary,  combining or relating data from two different data sources can be done with wrong or misleading keys. These issues can lead to bad data analytical discoveries and, worse, wrong business decisions which move the company in the wrong direction and can cost extensive time and expenses.

Repeatable Processes

Like all information technology processes, Data Lake analytical processes and their conclusions should be repeatable. Repeatable analytics results offers an opportunity to have significant bottom line impact and bring new insights to the business, customers, or product features. Having repeatable process with repeated results needs to be a standard practice for any analytical process, especially one that can result in considerable efforts and expenses downstream within the business.

With the Data Lake DBaaS configurations, the original data sources, combination of data and mixes of a variety of data sources can be common. Source data definitions, relationship mapping, and data lineage can be difficult to trace. So once your Data Lake analytics comes up with a profitable conclusion, repeating the process over the next set or timeframe of data can become difficult. Getting reliable ETL processes for all the Data Lake data combinations used for your analytics source data can become a nightmare as sources are no longer available, the data cannot be fully traced to its original source, or the data was a one-time out-sourced purchase.

Cloud and Lake Continue to Grow

At the recent Gartner conference, the talk focused on how big and small organization are moving their data to the cloud. The “Cloud First” group-think is taking over for companies deploying new applications. The cloud may be a new strategic initiative for business, but given recent experiences with the previous trends of Data Lakes I remain skeptical.

The bedrocks of data management, governance, and security continue to be critical underpinnings of any Data Lake or cloud storage philosophy. Make sure those are in place before your company makes the jump to these new cloud and Data Lake DBaaS strategies because there is no going back from bad data or hacking situations, only dealing with the aftermath and cleanup.

Dave Beulke is a system strategist, application architect, and performance expert specializing in Big Data, data warehouses, and high performance internet business solutions. He is an IBM Gold Consultant, Information Champion, President of DAMA-NCR, former President of International DB2 User Group, and frequent speaker at national and international conferences. His architectures, designs, and performance tuning techniques help organization better leverage their information assets, saving millions in processing costs.


Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>