How Analysis goes wrong is a new weekly series focused on evaluating common forms of business analysis. All evaluation of the analysis is done with one goal in mind: Does the analysis present a solid case why spending resources in the manner recommended will generate additional revenue than any other action the company could take with the same resources. The goal here is not to knock down analytics, it is help highlight those that are unknowingly damaging the credibility of the rational use of data. What you don’t do is often more important then what you do choose to do. All names and figures have been altered where appropriate to mask the “guilt”.
There are many different ways to abuse statistics to get meaning out of data that it really doesn’t have. I often find myself quoting George Box, “All models are wrong, some are useful.” Since “big data” seems to be driving people to more and more “advanced” ways of slicing data, I wanted to start looking at some of the most common ways people misunderstand statistics, or at least what you can do with the information that is presented by the use of various modeling techniques.
The first technique I want to tackle is clustering, or K-means clustering. This type of modeling allows people to divide a group of users into common characteristics. The analysis looks at various dimensions of users who end up doing some task, usually a purchase or total revenue spent, and then statistically derives the defining characteristics of one group that differentiates it from another. This is very similar to decision tree methods as well, but tends to be much more open on how many groups and what dimensions are leveraged.
A typical use of this data would be:
Analysis: Looking at our data set of converters, we were able to build 3 different personas of our users, low value, medium value, and high value users. Based on the fact that high value users interact with facebook campaigns and internal search far more than others, we are going to spend resources to improve our facebook campaigns to get more people there, and look to make internal search more present for users in the other 2 persona groups.
Before we dive in, I feel it is necessary to remind everyone that the point here is not that optimizing facebook campaigns or internal search is right or wrong, only that the data presented in no way leads to that conclusion.
1) Fundamentally one of the problems with clustering is that it only tells you common traits of people in a group. You have no way of knowing if moving people who do not normally interact with your facebook campaigns to do so, if they will in any way behave like those that currently do. Most likely they won’t, and you have no way of knowing if that will generate any revenue at all.
2) There is a major graveyard effect going on, where looking at only those that convert and looking for defining differences avoids looking at the total population and looking at the differences between those that do convert and those that don’t. There is a pretty good chance that people who don’t convert also use internal search.
3) Even if you assume that everything is correct and that the two areas are high valuable, you still don’t have a read on what to change, what the cost to do so is, or even what your ability to influence anything about that group is. You still have no way of saying that this is more valuable then any other action that you could take (including not doing anything).
4) It assumes that just because those are the defining areas, that the place to interact with people is also that same place. It is just as likely that getting people to sign up for email also gets them to look at content on facebook for example.
5) Personas as a whole can be very dangerous, since they create tunnel vision and can lead to groups not looking at exploitability of other dimensions of the same population.
6) At the highest level, it confuses statistical confidence with colloquial confidence. The statistics tell you that these different characteristics are statistically different enough to create a known difference in groups. It in no way tells you that these differences are important to your business or how you change your users behavior.
I am actually a huge fan of statistics and data modeling, but only as a way to think about populations. I get very worried when groups following the results blindly, or do not actively interact with the data to see about important information, like cost and influence. If you have an analysis like this and you know the exploitability of the different dimensions, then you can do another step of analysis and look for size and ability to change the population based on what you know of the exploitability of the defining characteristics. If you do not have that, then the data is interesting but ultimately almost useless.