Category: Statistics

Why we do what we do: When the Sum is Less than the Parts – Simpson’s Paradox

Some of the greatest mistakes people make is by having complete faith in numbers or in their own abilities to use them to get a desired result. While normally there are a great many just biases and logical fallacies that make up human cognition, sometimes there are factors in the real world that conspire to make it even more difficult to act in a meaningful and positive way. One of the more interesting phenomenon in the world of data is the statistical bias known as “simpson’s paradox”. Simpon’s Paradox is a great reminder that one look at data creates a fallacy that can often lead to a very wrong conclusion. Even worse it can allow for claims of success for actions that are negative in the context of the real world.

Simpson’s paradox is a pretty straight forward bias, it is when you have a correlation present in two different groups individually, but when combined they produce the exact opposite effect.

Here is a real world example:

We have run a analysis and show that a variation on the site produces a distinct winner for both our organic and our paid traffic:

But when we combine the 2, we have the exact inverse pattern play out. Version A won by a large margin for both Organic and Paid traffic, but combined it dramatically under performs B:

This seems so counter intuitive, but it plays out in many places in real world situations. You also may find the inverse pattern, one where you see no difference in distinct groups, but combined you see a meaningful difference.

In both cases, logically we would want to presume that A was better than B, but it was not until we add the larger context that we understand the true value.

While this is a trick of numbers, it presents itself far more than you might expect, especially as groups dive into segmentation and personalization. The more people leap directly into personalization with vigor, the more they are leaving themselves open to biases like Simpson’s Paradox. We get so excited when they are able to create a targeted message, and so desperate to show its value and to prove their “metal” that they don’t take the time to evaluate things on the holistic scale. Even worse, they don’t even compare it with other segments or account for the cost to maintain a system. They are so excited by their ability to present “relevant” content to a group that they think needs it, that they fail to measure if it adds value or if it is the best option. Even worse, they then go around telling the world about their great finding, only to be causing massive harm to the site as a whole.

One of the key rules to understand is that as you keep diving down to find something “useful” either from analytics or from causal feedback after the fact, the more likely this plays out. You can use numbers to come to any conclusion with creative enough “discovery”. If you keep diving, if you keep parsing, you are exponentially increasing the chances that you will arrive at a false or misleading conclusion. Deciding how you are going to use data after the fact is always going to lead to biased results. It is easy to prove a point whenever you forget the context of the information or you lose the discipline of trying to use it to find the best answer.

So how do you combat this? The fundamental way is to make sure that you are taking everything to the highest common denominator. Here is a really easy process if you are not sure how to proceed:

1) Decide what and how you are going to use your data BEFORE you act.

2) Test out the content – Serve it randomly to all groups, even if you design the content specifically for one group, test to everyone. If you are right, the data will tell you.

3) Measure the impact of every type of interaction to the same common denominator. Convert everything to the same fiscal scale, and use that to evaluate alternatives against each other. Converting to the same scale allows you to insure that you know the actual value of the change, not just the impact to specific segments.

4) Further modify your scale to account for the maintenance cost to serve to that group. If it takes you a whole new data system, 2 apis, cookie interaction and IT support to target to that group, then you have to get massively higher return then a group you can do in a few seconds.

What you will discover as you go down this path is that you are often wrong, in some cases dramatically so, about the value of targeting to a preconceived group. You will discover not only that many of the groups you think are valuable are not, but also many groups that you would not normal consider for value to be higher valuable (especially in terms of efficiency). If you do this with discipline and over time, you will also learn complete new ways to optimize your site, be it the types of changes, the groups that are actually exploitable, the cost of infrastructure, and the best ways to move forward with real unbiased data.

As always, it is the moments where you prove yourself wrong that you will get dramatic results. Just trying to prove yourself right does nothing but give you the right to make yourself look good.

I always differentiate a dynamic user experience from a “targeted experience”. In the first case, you are following a process, feeding a system, not dictating the outcome, and then measuring the possible outcomes and choosing the most efficient option. In the second, you are deciding that something is good based on conjecture, biases, and internal politics, serving to that group, and then justifying that action. Simpson’s paradox is just one of many ways that you can go wrong, so I challenge you to evaluate what you are doing? Is it valuable or are you just claiming it is? Are you looking at the whole picture, or only the parts that support what you are doing? Are you really improving things, or just talking about how great you are at improving things?

Bridging the Gap: Dealing with Variance between Data Systems

One of the problems that never seems to be eliminated from the world of data is education and understanding on the nature of comparing data between systems. When faced with the issue, too many companies find the variance between their different data solutions to be a major sign of a problem with their reporting, but in reality variance between systems is expected. One of the hardest lessons that groups can learn is to focus on the value and the usage of information over the exact measure of the data. This plays itself out now more than ever as more and more groups find themselves with a multitude of tools, all offering reporting and other features about their sites and their users. As more and more users are dealing with the reality of multiple reporting solutions, they are discovering that all the tools report different numbers, be it visits, visitors, conversion rates, or just about anything else. There can be a startling realization that there is no single measure of what you are or what you are doing, and for some groups this can strip them of their faith in their data. This variance problem is nothing new, but if not understood correctly, it can lead to some massive internal confusion and distrust of the data.

I had to learn this lesson the hard way. I worked for a large group of websites who used 6 different systems for basic analytics reporting alone. I led a team to dive into the different systems and understand why they reported different things and to figure out which one was ”right.” After losing months of time and almost losing complete faith in our data, we discovered some really important hard won lessons. We learned that the use of the data is paramount, that there is no one view or right answer, that variance is almost completely predictable once you learn the systems, and that we would have been far better served spending that time on how to use the data instead of why they were different.

I want to help your organization avoid the mistakes that we made. The truth is that no matter how deep you go, you will never find all the reasons for the differences. The largest lesson learned was that your organization can be so caught up in the quest for perfect data that they forget about the actual value of that data. To make sure you don’t get caught in this trap, I want to help establish when and if you do have a problem, the most common reasons for variance between systems, and some suggestions about how to think about and how to use the new data challenge that multiple reporting systems presents.

Do you have a problem?

First, we must set some guidelines around when you have a variance problem and when you do not. When you have systems designed for different purposes, they will leverage that data in very different ways. No systems will match, and in a lot of cases, being too close represents artificial constraints on the data that is actually hindering its usability. At the same time, if you are too far apart, then that is a sign that there might be a reporting issue with one or both of the solutions.

Here are two simple questions to evaluate if you do have a variance “problem”:

1) What is the variance percentage?

Normal variance between similar data systems is almost always between 15-20%.
For non-similar data systems the range is much larger, and is usually between 35-50%.

If the gap is too low or too large, then you may have a problem. A 2% variance is actually a worse sign then a 28% variance on similar data systems.

Many groups run into the issue of trying too hard to constrain variance. The result is that they put artificial constraints on their data, causing the representative nature of the data to be severely hampered. Just because you believe that variance should be lower does not mean that it really should be or that lower is always a good thing.

This analysis should be done on non-targeted groups of the same population (e.g., all users to a unique page.) The variance for defendant tracking (segments) is going to always be higher.

2) Is the variance consistent in a small range?

You may see variance be in a series of 13, 17, 20, 14, 16, 21, 12 over a few days, but you should not see 5, 40, 22, 3, 78, 12.

If you are within the normal range and you are in the normal range of outcomes, then congratulations, you are dealing with perfectly normal behavior and I could not more strongly suggest that you spend your time and energy on how best to use the different data.

Data is only as valuable as how you use it, and while we love the idea of one perfect measure of the online world, we have to remember that each system is designed for a purpose, and that making one universal system comes with the cost of losing specialized function and value.

Always keep in mind these two questions when it comes to your data:

1) Do I feel confident that my data accurately reflects my users’ digital behavior?

2) Do I feel that things are tracked in a consistent and actionable fashion?

If you can’t answer those questions with a yes, then variance is not your issue. Variance is the measure of the differences between systems. If you are not confident in a single system, then there is no point in comparing it. Equally, if you are comfortable with both systems, then the differences between them should mean very little.

The most important thing I can suggest is that you pick a single data system as a system of record for each action you do. Every system is designed for different purposes, and with that purpose in mind, each one has advantages and disadvantages. You can definitely look at each system for similar items, but when it comes time to act or report, you need to be consistent and have all concerned parties aligned on which system is the one that everyone looks at. Choosing how and why you are going to act before you get to that part of the process is the easiest fastest way to insure the reduction of organizational barriers. Getting this agreement is far more important for going forward than the dive into the causes behind normal variance.

Why do systems always have variance?

For those of you who are still not completely sold or who need to at least have some quick answers for senior management, I want to make sure you are prepared.
Here are the most common reasons for variance between systems:

1) The rules of the system – Visit based systems track things very differently than visitor based systems. They are meant for very different purposes. In most cases, a visit based system is used for incremental daily counting, while a visitor based system is designed to measure action over time.

2) Cookies – Each system has different rules about tracking and storing of cookie information over time. This tracking will dramatically impact what is or not tracked. This is even more true for 1st versus 3rd party cookie solutions.

3) Rules of inclusion vs. Rules of exclusion – For the most part, all analytics solutions are rules of exclusion, meaning that you really have to do something (IP filter, data scrubbing, etc.) to not be tracked. A lot of other systems, especially testing, are rules of inclusion, meaning you have to meet very specific criteria to be tracked. This will dramatically impact the populations, and also any tracked metrics from those populations.

4) Definitions – What something means can be very specific to a system. Be it a conversion, a segment, a referrer, or even a site action. The very definition can be different. An example of this would be a paid keyword segment. If I land on the site, and then see a second page, what is the referrer for that page? Is it the visit or the referring page? Is it something I did on an earlier visit?

5) Mechanical Variance – There are mechanical differences in how systems track things. Are you tracking the click of a button with an onclick? Or is landing on the previous page? Or is it he server request? Do you use a log file system or a beacon system? Is that a unique request or added on to the next page tag? Do you rely on cookies or are all actions independent? What are the different timing mechanisms for each system? Do they collide with each other or other site functions?

Every system does things differently, and as such these smaller changes can build up over time, especially when combined with some of the other reasons listed above. There are hundreds of reasons beyond those listed, and the reality is that each situation is unique and each one is the culmination of the impact of these hundred different reasons. There is no way to ever get to the point where you can accurately describe with 100% certainty why you get the variance.

Variance is not a new issue, but it is one that can be the death of programs if not dealt with in a proactive manner. Armed with this information, I would strongly suggest that you hold conversations with your data stakeholders before you run into the questions that inevitably come. Establishing what is normal, how you act, and a few reasons why you are dealing with the issue should help cut all of these problems off at the pass.

Understand the math behind it all: The N-Armed Bandit Problem

One of the great struggles marketers have when they enter new realms, especially those of analytics and testing, is trying to apply the disciplines of math to what they are doing. They are amazed by the promise of models and of applying a much more stringent discipline then the normal qualitative discussions they are used. The problem is that most marketers are not PHDs in statistics, nor have they really worked with the math applied to their real world issues. We have all this data and this promise of power before us, but most lack the discipline to interact and really derive value from the data. In this series, I want to explain some of the math concepts that impact daily analysis, especially those that a majority of people do not realize they are struggling with, and show you how and where use them, as well as their pragmatic limitations.

In the first of these, I want to introduce the N-Armed bandit problem as it is really at the heart of all testing programs and is a fundamental evaluation of the proper use of resources.

The N-Armed Bandit problem, also called the One-Armed bandit problem or the multi-armed bandit problem, is the fundamental concept of the balance of acquiring new knowledge while at the same time exploiting that knowledge for gain. The concept goes like this:

You walk into a casino with N number of slot machines. Each machine has a different payoff. If the goal is to walk away with the most money, then you need to go through a process of figuring out the slot machine with the highest payout, yet keep as much money back as possible in order to exploit that machine. How do you balance the need to test out the payouts from the different machines while reserving as much money as possible to put into the machine with the greatest payout?

Which one do you choose?

Exploring the casino

As we dive into the real world application of this concept, it is important that you walk away with some key understandings of why it matters to you. An evaluation of the N-Armed bandit problem and how we interact with it in the real world leads to two main goals:

1) Discovery of relative value of actions

2) The most efficient use of resources for this discovery and for exploitation

The N-Armed bandit problem is at the core of machine learning and of testing programs, and does not have a one-size fits all answer. There is no perfect way to learn and to exploit, but there are a number of well known strategies. In the real world, where the system is constantly shifting and the values are constantly moving it gets even more difficult, but that does not make it any less valuable. All organizations face the fundamental struggle in how best to apply resources, especially between doing what they are already doing and exploring new avenues or functional alternatives. Do you put resources where you feel safe, where you think you know the values? Or do you use them to explore and find out the value of other alternatives? The tactics used to solve the N-armed bandit problem come down to how greedy you try to be and about giving you ways to think about applying those resources. Where most groups falter is when they fail to balance those two goals, becoming lost in their own fear, egos, or biases; either diving too deep into “trusted” outlets, or going too far down the path of discovery. The challenge is trying to keep to the rules of value and of bounded loss.

The reason this problem comes into play for all testing programs is that the entire need for testing is the discovery of the various values for each variant, or for each concept, against one another. If you are not allowing for this question to enter your testing, then you are always only throwing resources towards what you assume is the value of a change. Knowing just one outcome can never help you be efficient. How do you know what value you could have gotten by just throwing all your money into one slot machine? While it is easy to convince yourself that because you did get a payout, that you did the right thing, the evaluation of the different payouts is the heart of improving your performance. You have to focus on applying resources, and for all groups there is a finite amount of resources, to achieve the highest possible return.

In an ideal world, you would already know all possible values, be able to intrinsically call the value of each action, and then apply all your resources towards that one action that causes you the greatest return (a greedy action). Unfortunately, that is not the world we live in, and the problem lies when we allow ourselves that delusion. The problem is that we do not know the value of each outcome, and as such need to maximize our ability of that discovery.

If the goal is to discover what the value of each action is, and then exploit them, then fundamentally the challenge is to how best to apply the least amount of resources, in this case time and work, to the discovery of the greatest amount of relative values. The challenge becomes one purely of efficiency. We have to create a meaningful testing system and efficiencies in our organization, either politically, infrastructure, or technically, in order to minimize the amount of resources we spend and to maximize the amount of variations that we can evaluate. Every time we get side tracked, or we do not run a test that has this goal of exploring at its heart, or we pretend we have a better understanding of the value of things via the abuse of data, we are being inefficient and are failing on this question for the highest possible value. The goal is to create a system that allows you to facilitate this need, to measure each value against each other, to discover and to exploit, in the shortest time and with the least amount of resources.

An example of a suboptimal design for testing based on this is any single recipe “challenger” test. Ultimately, any “better” test is going to limit your ability to see the relative values. You want to test out your banner on your front door, but how do you know that it is more important then your other promos? Or your navigation, or your call to action? Just because you have found an anomaly or pattern in data, what does that mean to other alternatives? If you only test or evaluate one thing by itself, or don’t test out feasible options against each other, then you will never know the relative value of those actions. You are just putting all your money into one slot machine, not knowing if has a higher payout then the others near it.

This means that any action that is taken by a system that limits the ability to measure values against each other, or that does not allow you to measure values in context, or that does not acknowledge the cost of that evaluation, is inefficient and is limiting the value of the data. Anything that is not directly allowing you the fastest way to figure out the payouts of the different slot machines is losing you value. It also means that any action that requires additional resources for that discovery is suboptimal.

If we have accepted that we have to be efficient in our testing program, we still have to deal with the greatest limiter of impact, the people in the system. Every time we are limited only to “best practices” or by a HiPPO, then we have lowered the possible value we can receive. Some of the great work by studiers of probability, especially by Nassim Nicholas Taleb, has shown that for systems, over time, the more human level interaction, or the less organic that the system is allowed to be, the lower the value and the higher the pain we create.

Comparing organic versus inorganic systems:

Taleb - Value of a System

We can see that for any inorganic system, one that has all of those rules forced onto it, over time there is a lot less unpredictability then what people think, and that there is almost a guarantee of loss of value for each rule and for each assumption that is entered into that system. One of the fastest ways to improve your ability to discover the various payouts is to have an understanding of just how many slot machines are before you. Every time that you think you are smarter then the system, or you get caught up in “best practices” or popular opinion, you have forced a non-organic limit into the system. You have artificially said that there are less machines available to you. This means that for the discovery part of the system, and the best thing for our program and for gaining value, that we must limit human subjection or rules, in order to insure the highest amount of value.

An example of these constraints is any hypothesis based test. If you are limiting your possible outcomes to only what you “think” will win, you will never be able to test out everything that is feasible. Just because you hear a “best practice” or someone has this golden idea, you have to make sure that you are still testing it relatively to other possibilities, nor can you let it impact your evaluation of that data. It is ok to have an idea of what you think will win going in, but you can not limit yourself to that in your testing. That is the same as walking up to the slot machine with the most flashy lights, just because the guy next to you said to, and only putting your money in that machine.

Everyone always says the house wins, and in Vegas that is how it works. In the real world, the deck may be stacked against you, but that does not mean that you are going to lose. Once you understand the rules of the game and can think in terms of efficiency and exploiting, then you have the advantage. If you can agree that at the end of the day your goal is to walk out of that casino with the largest stack of bills possible, then you have to focus on learning and exploiting. The odds really aren’t stacked against you here, but the only way to really win this game is to be willing to play it the right way. Do you choose the first and most flashy machine? Or do you want to make money?

Confidence and Vanity – How Statistical measures can lead you astray

In dealing with the best ways to change a site for maximize ROI, one of the most common refrains I hear is “is the change statistically confident” or “what is the confidence interval”, which often leads to a long discussion around what do those measures really mean. One of the funniest things in our industry is the over reliance on statistical measures to prove that someone is “right”. Whether it is Z-Score, T-Test, Chi-Squared or other measures, and people love to throw them out and use them as the end-all be-all of confirmation that they and they alone are correct. Reliance on any one tool, discipline, or action to “prove” value does nothing to improve performance or to allow you to make better decisions. These statistical measures can be extremely valuable, when used in the right context and without blind reliance on them to answer any and all questions.

Confidence based calculations are often used in a way that leaves them being the least effective way to the true measures change and importance of data (or “who is correct”) when they are applied to real world situations. They work great in a controlled setting, and with infinitum data, but in the real world, they are just one of many imperfect standards for measuring the impact of data and changes. Real world data distribution, especially over any short period of time, rarely resembles normal distribution. You are also trying to account for distinct groups with differing propensities of action, instead of trying to account for one larger representative population. What is also important to note is that even in the best case scenario, these measures work if you have a representative data set, meaning that just a few hours or even a couple of days of data will never be representative (unless you Tuesday morning visitors are identical to your Saturday afternoon visitors). What you are left with is your choice of many imperfect measures which are useful, but are not meaningful enough to be the only tool you use to make decisions.

What is even worse is that people also try to use this value as a predictor of outcome, so they say things like I am 95% confident that I will get 12% lift. These measures only measure the likelihood of the pattern of outcome, so that you can say, I am 95% confident that B will be better than A, but they are not measures of the scale of outcome, only the pattern.

It is like someone found this new fancy tool, and suddenly has to apply it because they realize that what they were previously doing was wrong, but now, this one thing, will suddenly make them perfect. Like any tool at your disposal, there is a lot of value when used correctly and with the right amount of discipline. When you are not disciplined in how you evaluate data, you will never really understand it and use it to make good decisions.

So if you can not rely on confidence alone, how best to determine if you should act on data? Here are three really simple steps to measure impact of changes when evaluating causal data sets:

1) Look at performance over time – Look at the graph, look for consistency of data, and look for lack of inflection points(comparative analysis). Make sure you have at least 1 week of consistent data (that is not the same as just one week of data). You cannot replace understanding patterns, looking at the data, and understanding its meaning. Nothing can replace the value of just eye balling your data to make sure you are not getting spiked on a single day and that your data is consistent. This human level check gives you the context that helps correct against so many imperfections that just looking at the end numbers leaves you open for.

2) Make sure you have enough data – The amount needed changes by site. Some sites, 1000 conversions per recipe is not enough, some sites 100 per recipe are. Understand your site and your data flow. I cannot stress enough that data without context is not valuable. You can get 99% confidence on 3 conversions over 1, but that doesn’t make it valuable or the data actionable.

3) Make sure you have meaningful differentiation –Make sure you know what your natural variance is for your site (in a visitor based metric system, it is pretty regularly around 2% after a week). There are many easy ways to figure out what it is for the context of what you are doing. You can be 99% confident at .5% lift, and I will tell you have nothing (neutral). You can have 3% lift and 80% confidence, if it is over a consistent week and you natural variance is below 3%, and I will tell you have a decent win.

I have gotten into many debates with statisticians whether confidence provides any value at all in the context of online testing, and my usual answer is that if you understand what it means, it can be a great barometer and another fail safe that you are making a sound decision. The failure is that you can’t just use it as the only tool in your arsenal. I am not saying that there is not a lot of value from P-value based calculations, or most statistical models. I will stress however that they are not panacea nor are they an excuse for not doing active work to understand and act on your data. You have to be willing to let the data dictate what is right, and that means you must be willing to understand the disciplines of using the data itself.