Guest post by Kaiser Fung, author of Numbersense.
When you hear about Big Data, you almost always hear about the supply side: Behold the data in un-pronounceable units of bytes! Admire the new science inspired by all the data! Missing from this narrative is the consumption side. A direct consequence of Big Data will be the explosion of data analysesthere will be more people producing more data analyses more quickly. This will be a world of confusing and contradictory findings.
In my new book, Numbersense, I argue that the ability to analyze and interpret these data analyses will give one a competitive edge in this world of Big Data.
Numbersense is the noise you hear in your head when you see bad data or bad analysis. After years of managing teams of data analysts, Ive learned that what distinguishes the best from the merely good is not math degrees or computer skills; it is numbersense.
Numbersense is an intangible quality that you cant teach in a classroom. The best way to pick it up is by learning from people who have it. For this blog post, I selected two great analyses of data analyses that have impressed me recently. These are highly instructive examples.
- Eating red meat makes us die sooner! Zoë Harcombe didnt think so.
In March, 2013, nutritional epidemiologists from Harvard University circulated new research linking red meat consumption with increased risk of death. All major mass media outlets ran the story, with headlines such as “Risks: More Red Meat, More Mortality”. This high-class treatment is typical, given Harvards brand, the reputation of the research team, and the pending publication in a peer-reviewed journal. Readers are told that the finding came from large studies with hundreds of thousands of subjects, and that the researchers controlled for other potential causes of death.
Zoë Harcombe, an author of multiple books on obesity, was one of the readers who disputed the story. She heard that noise in her head when she reviewed the Harvard study. In a blog post, titled ”Red meat & Mortality & the Usual Bad Science”, Harcombe outlined how she decided the research was junk science.
She knows this type of research methodology rarely if ever delivers conclusive evidence of causation. Then, she found support from a data table included in the research paper. The table shows that the cohort of people who report eating more red meat also report higher levels of unhealthy behaviors, including more smoking, more drinking, and less exercise. Thus, the increased risk of death observed in the study could have been explained by factors other than red meat consumption.
Chapter 2 of Numbersense looks at the quality of data analyses of the obesity crisis.
- Netflix turns viewers into puppets! Felix Salmon declined to become one.
In February, 2013, Netflix, ever the media darling, premiered House of Cards, a re-make of the successful British television show, their second foray into producing original content for its tens of millions of subscribers. Netflix executives regaled the press with stories of how Big Data analysis took the risk out of their $100 million decision.
Andrew Leonard, the technology reporter for Salon.com, gobbled up the Netflix story, even interpreting it as a symptom of a society-wide shift. Like other news analysts, Leonard was convinced by the pure geek wizardry used to analyze mountains of data collected from Netflix customers. The machine, were told, decided that David Fincher should be the director and Kevin Spacey, the star. From here, it is a short trip to the lala land of viewers as puppets with machines as the overlord.
This analysis aroused the skeptic in Felix Salmon, the finance blogger for Reuters. In his blog post, ”Why the Quants Won’t Take Over Hollywood”, Salmon raised other factors that affect the box office, including billions spent on marketing and publicity, the quality of the writing, the sociopolitical climate, the complex relationship between originals and remakes, and the poor track record of predictive modeling in Hollywood. On this last point, Salmon exhibits a keen sense of the limitations of science, speaking of impossible-to-formulate cocktail of creativity, inspiration, teamwork, and luck.
Chapters 4 and 5 of Numbersense explains how you should judge predictive models used by marketers.
When their respective blog posts surfaced, Harcombe and Salmon were lone voices vetting carefully the claims based on other peoples data analyses. Their well-honed numbersense allows them to stand firm in the face of mountains of data, worship of high science, formidable-sounding technical jargon, and academic reputations. The problems with the original research are far from obvious. The point is not to debunk these studiesno data analysis is ever infalliblebut to figure out for yourself what is credible, and what is junk.