Lies, Damn Lies, and Statistics—Do They Apply to Big Data, Too?

“There are lies, damn lies, and then there are statistics!” This data-oriented epithet was popularized by the legendary author, Mark Twain, who we can only presume had a bad experience with statistics at some point.

Who knows? Perhaps Twain’s baseball batting average was below the infamous Mendoza line, and he was therefore sidelined from a sport he loved, into a much less exciting career as a writer of books and letters. I am only guessing here. For the uninitiated, the Mendoza Line is derived from the name of Mario Mendoza, whose mediocre batting average in baseball, came to define the threshold of really awful hitting. But, I digress.

Let’s consider this “lies, damn lies, and statistics” sentiment for a moment. This phrase describes how statistics — yes, data no less — can be skewed towards embracing the most idiotic of ideas — justifying the weakest of arguments. It embodies man’s worst effort to pull the wool over the proverbial eyes of his fellow man. However, as practitioners of data analysis, do we now have to ask ourselves this very question about big data – could such a stinging sentiment be thrown in the face of the big data analyst?

The problem is that statistics can easily be taken out of context. Statistics can be lobbed in the face of anyone willing to toss the data grenade. All too often, pundits do just that. They throw data around like it was growing on a database tree or shining down from a data sun or something. But we, as data-gators, value every itty bitty, bit and byte, of data – from tree, sun, or otherwise. We don’t dilly dally with data. We take data seriously. Therefore, we must be mindful of how data is received and perceived – lest a customer of our insights feels, well, deceived.

Let’s face it. Folks can be intimidated when they get hit over the head with a numbers dump. Data can catch folks off guard. This is typically because bogus insights can be presented under flimflam logic, constructed on assumed premises, and other specious artifices. Insights built this way can be packaged by a data panderer, and delivered to the unsuspecting customer, as true insights – much like the reviled snake oil salesman of the 19th Century. It happens a lot with with simplistic sentiment analysis that might tell one who smiled the most at the party.

As true big data practitioners, establishing the veracity of our insights is not an easy task. Customers are inherently poor at drawing insights from massive data sets without computational help, and when another so-called learned person presents Exhibit A: a big data model of how people are feeling right now – well, how do they question that? How can they tell if it is all merely damn lies and statistics – or not?

I believe you do so by clearly explaining the methodology and process behind big data. You convey to the customer that big data requires a principled and disciplined approach rooted in both scientific methodology, business processes, and human ethics. It is more than hitting the switch on beta software. It is about leveraging distributed expertise across a far reaching network of experts. It is about data curation. It is all this – and more.

The true big data scientist is skilled at removing the noise and clutter from a customers big data repositories, and knows how to develop categories of conversations that capture what is happening, in real-time, about a customer’s products. A purveyor of lies, damn lies, and statistics, well, can’t explain any of that. There is no science to back up their snake oil claims. We, as big data practitioners, need to explain this to our customers – even if they don’t ask.

In the end, perhaps Mark Twain was right: there may be lies, damn lies, and statistics. However, big data is not part of that equation – in my book at least. Numbers can be taken out of context, true. But, it is hard to take big data out of context when the context itself is often the basis for the analysis. If you explain these things to your customers, they should see that too. Lastly, some statistics simply can’t be argued with too. For example, if your batting average is under the Mendoza line, you simply suck at baseball.

Leverage The Power of Big Data Done Right For Your Next Product Innovation