Big data is seen by many as a panacea, while many are skeptical of its ability to solve our most pressing problems. What is the real problem with big data? Blind empiricism. All science that is worth its salt is semi-inductive, or better said, a mix of deduction and induction. Without precursor deductive models, all interpretation is a fishing expedition for tidbits of “interesting” results, which are usually filtered by common sense, which is typically rife with prejudice and fallacious thinking. This New York Times editorial puts the dot on the i.
Is big data really all it’s cracked up to be? There is no doubt that big data is a valuable tool that has already had a critical impact in certain areas. For instance, almost every successful artificial intelligence computer program in the last 20 years, from Google’s search engine to the I.B.M. “Jeopardy!” champion Watson, has involved the substantial crunching of large bodies of data. But precisely because of its newfound popularity and growing use, we need to be levelheaded about what big data can — and can’t — do.
The first thing to note is that although big data is very good at detecting correlations, especially subtle correlations that an analysis of smaller data sets might miss, it never tells us which correlations are meaningful. A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two. Likewise, from 1998 to 2007 the number of new cases of autism diagnosed was extremely well correlated with sales of organic food (both went up sharply), but identifying the correlation won’t by itself tell us whether diet has anything to do with autism.