The study of history teaches you many key lessons, such as the lack of unreliability from the first person narrative, the inability to understand the scope of something while it is happening, and most importantly that history is written by the victors. Howard Zinn made a living out of showing people just how true this is and how little we understand our own world because of this. This same mistake is made not just in historical analysis, but all the time in data analysis as when we only look at attributes of the “winning” side while forgetting to analyze the non-winning side of the data. While closely related to the Halo Effect, this effect shows itself in what people like Degrasse Tyson and Taleb refer to as the graveyard of knowledge.
Neil Degrasse Tyson best explains the graveyard of knowledge with one of his stories:
You read a study that says that 80% of people who survived a plane crash studied the exit routes before the plane took off. Comfortable with this knowledge, the next time you board a plane, you quickly study the exit routes on the plane. As you do this, you start to analyze that data and you come to a sudden realization, what if 100% of the people who did not survive the crash studied the exit routes?
We don’t know the other side of the story, because there is no one there to report. We only know the people who came back and what they can tell us, but we have no clue what was going on with those that did not come back. Knowledge is lost all the time when people only look at the winners. Winners are the only ones left to tell their stories, so we only look to them for details. The reality is that the important details are rarely only on the winning side, and all the people who never returned give us just as vital knowledge. We lose all the really important information in that graveyard of the people who never returned. In fact, we only can start to understand anything if we have both pieces in order to have context for our information. We focus on those that “survived” that we completely ignore the context of the people who did not. We look only at the behaviors we want and then extract qualities about that group of people, without looking at the population as a whole or more importantly, what would have happened if we did nothing.
We look for people based on the end of their behavior, and not their definition before. We love to know what are the characteristics of people who made a purchase, or of people who come to our site more then 3 times. We look backwards from that winning behavior because that is all we think we have available to us. We love to describe past behavior through correlative behavior, and then attribute “value” to those actions. People who purchase use internal search 2 times on average, therefore internal search must be the cause of that action. People come from social sources spend $4.56 on average, therefore social is worth $4.56. We don’t know what would have happened if the same person didn’t use search or come from social, would they have spent more or less? All of these types of analysis attribute past behavior to end value, missing the point that we don’t know what they would have done otherwise. Is looking at the exit routes helping or hurting your ability to survive? We don’t know if more is better, we instead assume a linear relationship. If campaign X is generating value Y, then doubling spend on X will of course generate 2Y.
Looking only at the data from one group or that define the “winners” means that you have completely lost any value from that data. We can not express how much better or worse an action made things, only that we have X amount of search spend and ended up with Y revanue. Even worse, pretending that you can derive cause and effect from the larger context means that you are not getting value from the actual data itself, but instead propagating your own world view and using the data only to support it. Like the Texas sharpshooter fallacy, you are creating a story to fill in what is most likely random noise from the data. Rates of action, such as 80% of people looked at the exit routes, tell you nothing unless you know both that increasing that number increases your ability survive, and you know the cost and ability to influence people to make that action. I can tell you that 100% of people who are determined to spend $1000 on your site will spend at least $1000, but that doesn’t tell me how I get those people in the first place, or if it is worth my time to spend the resources there for that small population as opposed to the multitude of other alternatives.
People make this mistake all the time in the world of data analysis when they get so caught up on a set path or on looking backwards from an event. They want to know what all the people who purchased did, or what all the people who come to your site 4 times have in common. There is even a whole world of statistical analysis focused on clustering and personas which is making a large push in our industry that is focused on this tendency. The mistake people make is that only a small part of your population fails to tell you the context of that information. Like the plane, knowing the attributes of one group doesn’t tell you the attributes of the population as a whole. Even worse, it assumes that those attributes have anything to do with that behavior. We have no way of knowing if people who survived just happened to look at the exit routes, or if people who look at exit routes are more likely to survive.
In the world of testing, this bias makes itself present in people who want to know actions between steps. They want to know of people who purchased, did they go to a product page or the a search results page. They want to know what path or what people clicked on. Even if this knowledge was not ignoring the graveyard of knowledge, what would it tell you? More people went to the search results page, is that a good thing or a bad thing? You are accomplishing nothing with this data except adding cost and slowing down your ability to make the correct decision. It is easy to get lost in the world of data if you are trying to tell a story or if you want to find a preconceived point, but as soon as you are trying to use the data to find an answer and not just support your point of view, the discipline of what you look at and knowing what it can tell you becomes paramount.
So the question is, 40 years from now, will all the analysis you do be part of the “winning” group, or will it be lost in the graveyard? Stop pretending that data tells you more than it really does and stop only looking at the winning side, and you will be able to derive magnitudes greater value from your data. The discipline of looking at the whole context and of discovering the value of actions is what will grant you results, not just finding stories. Remember that patterns are only patterns, they are neither good nor bad, and it is incredibly easy to forget that even if they are perfect, they tell you nothing about your ability to change them, or the cost to do so. Data can be the most powerful tool in your arsenal, but it can also be abused to no end and provide negative value and a blanket justification for poor decisions.