Two dogmas of big data: Understanding the power of analytics for predicting human behavior has been saved
Limited functionality available
The vogue for big data obscures the fact that the economic value of analytics projects often has as much to do with the psychology of de-biasing decisions and the sociology of corporate culture change as with the volumes and varieties of data involved.
“Society became statistical. A new type of law came into being, analogous to the laws of nature, but pertaining to people. These new laws were expressed in terms of probability.” —Ian Hacking
Roughly ten years ago, The Economist magazine quoted the science fiction author William Gibson as saying, “The future is already here—it's just not very evenly distributed.”1 Gibson’s comment is not a bad description of the varying degrees to which analytics and data-driven decision-making have been adopted in the public and private spheres. Much has been done, much remains to be done.
Today few doubt that, properly planned and executed, data analytic methods enable organizations to make more effective decisions. Anecdotal evidence abounds. The city of New York recently began deploying building inspectors using the indications of a predictive model that flags problematic sites. Before the model was implemented, roughly 13 percent of building inspections resulted in a vacate order. Using the model, this figure rose to 70 percent.2 During the 2012 United States presidential election, the data journalist Nate Silver exemplified with considerable flair the superiority of rigorous data analysis and statistical thinking over unaided expert judgment in forecasting eection results.3 Netflix decided to produce its hit series House of Cards, and partially chose the creative team for the series based on an analysis of fine-grained subscriber viewing patterns.4 This cursory list could be extended for pages.5
Academic research corroborates the abundant anecdotal evidence. For example, Erik Brynjolfsson and his collaborators studied a sample of publicly traded firms. They concluded that the firms in the sample that had adopted a data-driven decision-making approach enjoyed 5–6 percent higher output and productivity than would be expected given their other investments and level of information technology usage.6
This story, itself hardly over a decade old, has lately been complicated by the emergence of “big data” as a dominant theme of discussion. Big data is routinely discussed in transformative terms as a source for innovation. “Data is the new oil,” the saying goes, and it will enable scientific breakthroughs, new business models, and societal transformations. A zeitgeist-capturing book title declares that it is a “revolution that will transform the way we live, work, and think.”7 The Cornell computer scientist Jon Kleinberg judiciously declared, “The term itself is vague, but it is getting at something that is real… big data is a tagline for a process that has the potential to transform everything.”8
While there is little doubt that the topic is important, its newness and the term’s vagueness have led to misconceptions that, if left unchecked, can lead to expensive strategic errors. One major misconception is that big data is necessary for analytics to provide big value. Not only is this false, it obscures the fact that the economic value of analytics projects often has as much to do with the psychology of de-biasing decisions and the sociology of corporate culture change as with the volumes and varieties of data involved.
The second misconception is the epistemological fallacy that more bytes yield more benefits. This is an example of what philosophers call a “category error.” Decisions are not based on raw data; they are based on relevant information. And data volume is at best a rough proxy for the value and relevance of the underlying information.
This essay will tackle each of these points in turn, focusing on applications involving the prediction of human behavior in such contexts as students at university, employees on the job, voters at the polls, shoppers in the store, drivers behind the wheel, physicians in the emergency room, and individuals trying to stick to health and medical regimens. An implication of the first point is that rather than wait for the mastery of big data, it is typically possible—and indeed advisable—to pursue near-term applications of analytics that involve readily available data sources. An implication of the second point is that big data is important in predicting behavior, but perhaps not for the reasons most commonly discussed.
An example from the domain of university admissions illustrates how analytics can enable better decisions through more granular and disciplined use of traditional data sources. We recently had the opportunity to work with the University of Toronto, a globally ranked Canadian university, to assess the value of incorporating predictive analytics into the undergraduate admissions process. The specific goal was to build a predictive model capable of distinguishing likely high-achieving students from the rest of the pack. Such a model would enable the university to make offers to students most likely to succeed, earn high marks, and go on to graduate. The data at our disposal contained millions, not billions, of records in a structured form. This is “big data” in the colloquial sense that programming and statistical science—not just spreadsheet analysis—is needed to make sense of it. But it is not “big data” in the more formal “3V” sense of having such high volume, variety, and velocity as to create problems for traditional data processing and analysis technologies.
One major misconception is that big data is necessary for analytics to provide big value. Not only is this false, it obscures the fact that the economic value of analytics projects often has as much to do with the psychology of de-biasing decisions and the sociology of corporate culture change as with the volumes and varieties of data involved.
The potential benefits of this application to students, the university, and society as a whole are apparent. A predictive model provides the admissions officer with a tool that can be used to support making decisions more accurately, consistently, and economically.
Working in close collaboration with the university’s admissions team, it was decided early on to build a transparent and easily interpretable predictive model that uses readily available high school transcript information to predict a particular indicator of academic success at university. This planning phase of the project is analogous to an architect discussing with the client the overall vision for a new dwelling being commissioned. Just as form follows function in architecture, the technical specifics of a model (the mathematical form, the input data) are often affected by its intended use.
...[a] university has the means to improve key admissions decisions using a transparent, interpretable model constructed from an uncontroversial data source using common sense, standard statistical methodology, and a dash of inspired creativity.
With these design elements in place, the hard work began. A well-kept secret of analytics is that, even when the data being analyzed are readily available (in this case it was high school transcript data), considerable effort is needed to prepare the data in a form required for the fun part—data exploration and statistical analysis. This process is called “data scrubbing,” connoting the idea that “messy” (raw, transactional, incomplete, or inconsistently formatted) data must be converted into “clean” (rows and columns) data amenable to data analysis.
While it sounds (and indeed can be) tedious, data scrubbing is counterintuitively the project phase where the greatest value is created. Working with detailed high school transcript data, we constructed hundreds of descriptors for each student applicant. This variable creation (or “feature engineering,” in the vernacular) stage is a major point at which domain expertise, the tacit knowledge of experienced data analysts, and creativity can be introduced into the process. Extending the “data is the new oil” metaphor, this is the process of refining the oil into usable form. Notably, this is the aspect of data science that is most difficult to convey in textbooks and university courses.
At this point, the stage was set for the centerpiece of the project: We used an iterative process, guided in equal measures by statistical science and common sense, to select a predictive model containing a small subset of the hundreds of variables created for consideration. Each of the model variables, somewhat predictive on their own, contributed to a model whose predictive power is greater than the sum of its parts. The model can be viewed as a more granular—and more accurate—alternative to a tried-and-tested predictor: high school grade point average.
Based on an analysis of the model’s predictive accuracy, we estimate that the university can use the model to boost the number of high-achieving students admitted between 5 and 10 percent. Additional data sources and future projects could be considered to further iterate and improve the model’s predictive accuracy and/or build analogous models to support other types of decisions. But for the purpose of this discussion, the major point is that the university has the means to improve key admissions decisions using a transparent, interpretable model constructed from an uncontroversial data source using common sense, standard statistical methodology, and a dash of inspired creativity.
By now there are hundreds of examples, structurally similar to our case study, in which analytics involving the most traditional of data sources outperform traditional modes of decision-making. It is perhaps surprising that, while such examples have appeared in the business press for hardly a decade, they have been known in the academic psychology community for 60 years. Furthermore, they are explained by advances in the behavioral sciences from the past 30 years. And this explanation has nothing to do with big data.9
Consider a few other examples:
Each case (as well as any number of analogous cases) involves “sorting” or “prioritization” decisions that (a) are central to an organization’s operations (medical triage, student retention, hiring); (b) are made repeatedly, typically by experts relying on professional judgment in varying degrees; and (c) incorporate quantifiable information that is readily available, yet commonly used only in informal or limited ways.
And furthermore, it turns out that in each case a fairly simply predictive scoring equation can be counted on to outperform unaided professional judgment.
The finding in fact dates back to the 1954 publication of the psychologist Paul Meehl’s book Clinical Versus Statistical Prediction. Meehl’s “disturbing little book,” as he later called it, documented 20 studies comparing the predictions of human experts with those of simple models. The types of predictions ranged from how well schizophrenic patients would respond to electroshock to how well prisoners would respond to parole. Meehl concluded that in none of the 20 cases could human experts outperform the models.
If this reminds the reader of Michael Lewis’ Moneyball, it is for a very good reason. Lewis’ book recounted the story of a cash-strapped baseball team that out of necessity began to analyze, and act upon, readily available data sources when making scouting decisions. Because the scouting industry was largely judgment-driven at the time, the market for talent was, literally speaking, inefficient: The “price” (salary) of the “asset” (players) simply did not reflect important publicly available information. Because of this market inefficiency, “better management was able to run circles around taller piles of cash.”10 In a recent Vanity Fair profile of Daniel Kahneman, Lewis reported that while writing his book, he was unaware that Meehl’s findings and subsequent findings in behavioral economics (see the sidebar “Go ask Linda”) explained the market inefficiency he had “stumbled upon.”11
Near the end of his career, surveying the field he initiated three decades earlier, Meehl wrote:
There is no controversy in social science which shows such a large body of quantitatively diverse studies coming out so uniformly in the same direction as this one. When you are pushing over 100 investigations, predicting everything from the outcome of football games to the diagnosis of liver disease, and when you can hardly come up with half a dozen studies showing even a weak tendency in favor of the clinician, it is time to draw a practical conclusion.
It is hard to overstate the importance of Meehl’s “practical conclusion” in an age of cheap computing power and open-source statistical analysis software. Decision-making is central to all aspects of business, public administration, medicine, and education. Meehl’s lesson—routinely echoed in case studies ranging from baseball scouting to evidence-based medicine to university admissions—is that in virtually any domain, statistical analysis can be used to drive better expert decisions. The reason has nothing to do with data volume and everything to do with human psychology.
In Thinking, Fast and Slow, the Nobel Prize-winning founder of behavioral economics Daniel Kahneman wrote that during his student days, Paul Meehl was one of his heroes.12 So it’s perhaps no coincidence that the subsequent work of Kahneman and his collaborators has done much to clarify both our understanding of Meehl’s “disturbing” findings as well as the widespread applicability of business analytics.
Kahneman writes of two fictitious mental processes that he calls System 1 (“thinking fast”) and System 2 (“thinking slow”). System 1 mental operations are rapid and automatic; they are biased toward belief and confirmation rather than analysis and skepticism; they tend to jump to conclusions and infer causal relations based on thin, “cognitively available” evidence. They tend to neglect the importance of evidence that is neither emotionally vivid nor in plain sight. In contrast, System 2 mental operations are slow, deliberate, and seek logical coherence rather than “narrative” or “associative” coherence.
The bulk of our mental operations are System 1 in nature. And the rub is that System 1 thinking turns out to be terrible at statistics. Without time, effort, and either tools or special training, the human mind will reliably make novice statistical errors. Surprisingly, this often applies to trained mathematicians and laypeople alike.
So far are we from being natural statistical thinkers that Kahneman calls the human mind “a machine for jumping to conclusions.” This central theme of behavioral economics is famously illustrated with “the Linda story.” A fictional character named Linda is described as a highly intelligent political activist. Now don’t think, blink: Is it more likely that Linda is a bank teller, or a feminist who happens to work as a bank teller? Most people answer the latter even though a moment’s thought reveals that this can’t possibly be right.13 Narrative coherence trumps logical business analytics outlined so far lets a bit of air out of the big data bubble. A timely implication of the decades-old work of Paul Meehl, Daniel Kahneman, and their followers is that analytics projects need not be predicated on big data (in the “3V” sense of the term) to yieldcoherence in a surprising way.
Predictive models, while fast in a literal sense, are “slow thinkers” par excellence. They can accurately weigh together 5, 50, or 5,000 pieces of information with equal ease; they never suffer from low blood sugar; and are immune to cognitive biases and narrative fallacies. Perhaps in hindsight, Paul Meehl’s disturbing finding isn’t so surprising after all.
Hopefully the view of economic value and even transform industries. Even in cases where only traditional data sources are brought to the table, predictive models and analytically derived business decision rules provide value by warding off inefficient or biased decisions. In such applications, models have a prosthetic character: They serve as “eyeglasses” for myopic human minds.
But none of this bursts the big data bubble entirely. Once again the realm of “people analytics” applied to professional sports provides a bellwether example. Sports analytics has rapidly evolved in the decade since Moneyball appeared. For example, the National Basketball Association employs player tracking software that feeds real time data into proprietary software so that the data can be analyzed to assess player and team performance.14 Returning to William Gibson’s image, professional sports analytics is a domain where “the future is already here.”
Given the time and expense involved in gathering and using big data, it pays to ask when, why, and how big data yields commensurately big value. Discussions of the issue typically focus on various aspects of size or the questionable premise that big data means analyzing entire populations (“N=all” as one slogan has it), rather than mere samples. In reality, data volume, variety, and velocity is but one of many considerations. The paramount issue is gathering the right data that carries the most useful information for the problem at hand.
In the context of predicting or analyzing human behavior the relevant aspect is the behavioral content of emerging data sources. Anyone who has worked with large volumes of behavioral data knows that past behavior often does predict future behavior, and often in surprising ways. For example personal credit information not only predicts who is likely to default on a loan; it is also strongly predictive of who is more or less likely to experience an auto accident. Marketing and lifestyle data can be used not only to predict future purchase behavior, but the presence of such lifestyle diseases as diabetes and hypertension.
The computational social scientist Alex “Sandy” Pentland forcefully articulates this point:
I believe that the power of big data is that it is information about people's behavior instead of information about their beliefs. It's about the behavior of customers, employees, and prospects for your new business. It's not about the things you post on Facebook, and it's not about your searches on Google, which is what most people think about, and it's not data from internal company processes and RFIDs. This sort of big data comes from things like location data off of your cell phone or credit card: It's the little data breadcrumbs that you leave behind you as you move around in the world.
What those breadcrumbs tell is the story of your life... Who you actually are is determined by where you spend time, and which things you buy. Big data is increasingly about real behavior, and by analyzing this sort of data, scientists can tell an enormous amount about you. They can tell whether you are the sort of person who will pay back loans. They can tell you if you're likely to get diabetes.15
A recent study conducted at the University of Cambridge Psychometrics Centre dramatically illustrates the power of such “digital breadcrumbs.” The researchers focused on the social network “likes” (positive attitudes about various pieces of online content) of a sample of 58,000 users. They found that using only this information, they were able to predict ethnic origin with 95percent accuracy; male sexual orientation with 88 percent accuracy; political leanings (Democrat or Republican) with 85 percent accuracy; religion (Christian or Muslim) with 82 percent accuracy, and so on. The researchers also found weaker, but still significant, correlations between this information and such latent psychological traits as intelligence, openness, extraversion, and emotional stability. For example, the researchers found that the information gleaned from social network “likes” is nearly as informative as a personality test score measuring an individual’s openness to change.16
The second half of the story relates to emerging sources of behavioral data, and therefore has an ending yet to be written. Clearly the capture and use of data emanating from and pertaining to people’s behaviors is rife with ethical issues that must be faced.
Such results raise major ethical and privacy issues that are far from being resolved. Many organizations will want to avoid such sources of data altogether. Still, such data are already changing business and societal landscapes. Furthermore it is useful to consider whether or how such information could be used in innovative and societally useful, as opposed to invasive, ways.17
Human resources is one promising domain for behavioral data-inspired innovation. Important aspects of the value that individuals bring to their organizations—healthy habits of informal group engagement, communication, and team participation—are currently measured inconsistently or approximately at best. And because of this, their contributions to organizational success are therefore often understood only murkily and rewarded inconsistently. For example, the leaders of Google’s “Project Oxygen” leadership analytics study were surprised to find that technical ability ranked least important on the list of eight attributes they found characteristic of effective managers.18 Less quantifiable attributes such as being results-oriented and caring for the career development of team members were found to be more important than the technical abilities initially assumed to be most important for technical managers. Such findings call to mind Woody Allen’s quip, “80 percent of success is showing up.”
Pentland’s own work illustrates the emerging possibilities for creating digital proxies of traditionally unquantifiable traits. Pentland has developed a device known as the “sociometer”—a wearable electronic badge that captures such second-by-second information about people’s tones of voice, body language, and communication patterns. He calls the sort of “digital breadcrumbs” collected by such devices “honest signals” because, unlike survey responses or social media posts, they are not consciously edited. Sociometric data captures aspects of non-verbal communication and social network relationships that can be surprisingly predictive.
For example, sociometric data are predictive of dating behavior and the outcomes of job interviews and salary negotiations.19 They also shine a light on the dynamics of effective teams. Pentland reports being able to predict which team will win a business plan contest using only sociometric data captured about the interactions of the team members at a cocktail reception. Analysis of sociometric data suggests the recipe for winning teams’ success: Successful teams are characterized by people talking and listening in equal measure, emanating helpful body language, speaking directly with one another rather than through a domineering leader, and so on.20
Indeed it turns out that there is a measurable concept of the “collective intelligence” of groups—highly analogous to individual IQ—which can be partially characterized through the use of sociometric data. Anita Woolley of Carnegie-Mellon University and her collaborators constructed a measure of collective intelligence and found that it is roughly as predictive of group performance as IQ is of individual performance. Surprisingly, collective intelligence is not explained by factors such as group satisfaction, cohesion, or motivation. Instead, the strongest predictors of collective intelligence—and group success—are equality of conversational turn-taking (measured using sociometric data) as well as the ability of the group’s members to read social signals (measured using more traditional psychometric data).21 It is likely that these traits contribute to group performance by enabling better flows of ideas.22
Because they have traditionally been hard to measure quantitatively, behavioral traits such as openness and social intelligence are often viewed as ephemeral or unreliable. However, the steadily increasing availability of computational social science tools and methods suggests the practical possibility of harnessing behavioral data to create more effective teams and systematically reward beneficial behaviors and personality traits that are currently recognized only sporadically.
It is, to say the least, unclear whether real-time monitoring devices will gain widespread acceptance in the business world or society at large. Still, between the scenario of equipping all employees with sociometric devices and the opposite extreme of basing human resource decisions on limited, judgmentally interpreted data, many possibilities can be explored. For example, Salesforce.com is taking steps to hire and cultivate employees based partly on their social intelligence. The data they gather to measure this psychological trait are gleaned from such methods as team-structured workshop-style interview days, personality modeling exercises, and participation on the organization’s online social collaboration tool. The data are gathered transparently and shared with job candidates.23 Similarly, the measure of social sensitivity that Woolley and her collaborators used (alongside sociometric data) to predict collective intelligence is a psychometric test that can be voluntarily taken on-line in approximately 10 minutes.24
We have told a two-part story to counter the two dogmas of big data. The first half of the story is the more straightforward: In domains ranging from the admissions office to the emergency room to the baseball diamond, measurably improved decisions will likely more often than not result from a disciplined, analytically-driven use of uncontroversial, currently available data sources. While more data often enables better predictions, it is not necessary for organizations to master “big data” in order to realize near-term economic benefits. Behavioral science teaches us that this has as much to do with the idiosyncrasies of human cognition as with the power of data and statistics.
The second half of the story relates to emerging sources of behavioral data, and therefore has an ending yet to be written. Clearly the capture and use of data emanating from and pertaining to people’s behaviors is rife with ethical issues that must be faced. At the same time, if the required privacy safeguards can be established, one can envision opt-in uses of behavioral data that serve everyone’s interests.
For example, data from massive open on-line courses (MOOCs) can be used to design better courses of study. Richer behavioral data sources can be used in human resources contexts to select and reward such skills as teamwork and social intelligence in addition to more readily measured technical abilities. Behavioral data could be used to quantify physician “bedside manner” and to improve patient satisfaction and reduce the frequency of malpractice claims. Telematics data from automobiles can be used to help older drivers stay behind the wheel longer; supermarket club-card data can be used to provide early warnings of lifestyle disease risks; and self-tracking data can be used to help people maintain their health.
A sign hanging on Albert Einstein’s door in Princeton’s Institute for Advanced Study read, “Not everything that can be counted counts, and not everything that counts can be counted.” While Einstein’s motto is timeless, emerging behavioral data sources and computational social sciences methods are expanding the domain of what we can count.