Big Data: Big Lies, Big Damn Lies and Statistics

By Dave Beulke, on May 21st, 2013

Setting up the correct metrics is vital for measuring anything. Mean, median and a variety of other complex formulas are being used for an assortment of purposes in Big Data projects. The heart of the issue is whether these complex formulas correctly convey correct measurements for the particular situation.

For example the simple mean or median of an event can be represented incorrectly or tilted to convey a particular point of view. Politicians, car salesmen, or anyone with an outcome agenda continue to build and tilt their answers to their advantage.

The best way to see how statistics can be made to lie is with the simple example of an anomaly or outlier. For example, customer profiling is popular for Big Data projects. Let’s start with seven people in a room and to the goal is to determine information about their incomes. So you get some data and discover their salaries are:

$37,000
$34,000
$28,000
$35,000
$35,000
$12,000
$6,000,000

Now, the median is the middle value, or $35,000. But the mean or average is $883,000. The $12,000 and the $6 million salaries are the “outliers.”

Depending on your agenda, a chart could represent that an average person in this group has an extraordinary income of $833,000 when in reality it is $35,000. Regardless of the agenda or situation, it’s vital to get your Big Data analytics in the proper context. The following are three things to analyze when setting up your Big Data analytics.

First, develop, prototype and publish your Big Data analytics’ formulas so that the findings can be discussed and approved. Understanding the truth and the proper talking points for your Big Data analytics results is critical for the success and management buy-in for your Big Data project.

Next, make your Big Data analytics process completely transparent, repeatable and as simple as possible. Try to minimize or isolate complex algorithms that are sometimes necessary for squeezing the extra value out of the Big Data. Make sure that your process is as understandable as possible so that everyone sees each input, understands each elements’ context, how it is used and its sensitivity to the Big Data results.

Third, leverage existing formulas, industry processes and company process and the company’s existing subject matter experts who know the data. Sometimes there is no substitute for knowing the data. Small subtle minor data anomalies, context of the data and regional impacts need to be factored into your analysis. Only a company subject matter expert will know those important details that will make or break the usefulness of your Big Data Analytics results. Working with and knowing a company’s data takes time and there is no substitute for core data analysis and researching what-if scenarios for testing Big Data results.

As you start your Big Data analytics project remember Campbell’s law, an adage developed by Donald T. Campbell. From Wikipedia Campbell’s law is:

“The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”

This simple adage can put all your Big Data efforts in perspective, so make sure to create Big Data analytics that everyone understands and produces value for your company and business bottom line.

Or make sure there are shining colorful statistical charts that dazzles everyone.

____________________________________________________

My article entitled “Five Imperatives for Superior Java Application Performance” was published during the IDUG conference in the latest Enterprise Systems Journal. You can link to the Java performance article here.

____________________________________________________

Dave Beulke is an internationally recognized DB2 consultant, DB2 trainer and education instructor. Dave helps his clients improve their strategic direction, dramatically improve DB2 performance and reduce their CPU demand saving millions in their systems, databases and application areas within their mainframe, UNIX and Windows environments.

Dave Beulke

Blog Categories

Big Data: Big Lies, Big Damn Lies and Statistics

Leave a Reply Cancel reply

Subscribe to DaveBeulke.com

Search Blog and Articles

Connect with me

Blogroll

Dave Beulke

Blog Categories

Big Data: Big Lies, Big Damn Lies and Statistics

Leave a Reply Cancel reply

Subscribe to DaveBeulke.com

Search Blog and Articles

Connect with me

Tags

Blogroll