Category: Statistics

7 deadly sins of testing – Not Understanding Your Data

It doesn’t take long working in a data field for you to come across data being used in ways other than what it was intended for. George Canning once correctly quipped, “I can prove anything by statistics except the truth.” One of the hardest struggles for anyone trying to make sense of all the various data sources is an understanding of the data that you are dealing with, what is it really telling you, what is it not telling you, and how should you act. We have all this rich interesting information, but what is the right tool for the job? What is the right way to think about or leverage that data? One of the ways that testing programs lose value over time is when they stop evaluating their data with a critical eye and focus on what is it really telling you. They so want to find meaning in things that they convince themselves and others of answers that the data could not ever provide. Understand your data, understand the amazing power that it can provide, and understand the things it cannot tell you.

Every tool has its own use, and we get the most value when we use tools in the correct manner. Just having a tool does not mean it is the right fit for all jobs. When you come from an analytics background, you naturally look to solve problems with your preferred analytics solutions. When you come from a testing background, you naturally look for testing as the answer to all problems. The same is true for any background, as the reality is when we are not sure, you are wired to turn back to what you are comfortable with. The reality is that you get more value when you leverage each tool correctly, and the fastest way to do that is to understand what the data does and does not tell you from each tool.

Analytics is the world of correlative patterns, with a single data stream that you can parse and look backwards at. You can find interesting anomalies, compare rates of action, and build models based on large data sets. It is a passive data acquisition that allows you to see where you have been. When used correctly, it can tell you what is not working and help you find things that you should explore. What you can not do is tell the value of any action directly, nor can it tell you what the right way to change things is.

Testing is the world of comparative analysis, with only a single data point available to identify patterns. It is not just a random tool to throw one option versus another to settle an internal argument, but instead a valuable resource for active acquisition of knowledge. You can change part of a user experience and you can see its impact on an end goal. What you can not do is answer “why?” with a single data point, nor can you attribute correlated events to your change to each other. You can add discipline and rigor to both to add more insight, but at its core all testing is really telling you is the value of a specific change. It is beholden on you for the quality of the input, just as your optimization program is beholden on the discipline used in designed and prioritizing opportunities.

Yet without fail people look at one tool and claim it can do the other, or that the data tells them more then it really does. Whether it is the difference in rate and value, or it is believing that a single data point can tell you the relationship between two separate metrics. Where we make mistakes is in thinking that the information itself tells you the direction of the relationship of that information, or the cost of interacting with it. This is vital information for optimization, yet so often groups pretend they have this information and make suboptimal decisions.

We also fail to keep perspective on what the data actually represents. We get tunnel vision on what the impact is to a specific segment or group that we lose the view on what the impact to the whole is. To make this even worse, you will find groups targeting or isolating traffic, such as only new users, to their tests and extrapolating the impact to the site as a whole. It does not matter what our ability to target to a specific group is unless that change will create a positive outcome for the site. The first rule of any statistics is that your data must be representative. Another of my favorite quotes is, “Before look at what the statistics are telling you, you must first look at what it is not telling you.”.

Tools do not understand the quality of the inputs, it is up to the user to know when they have biased results or they do not. Always remember the truth about any piece of information, “Data does not speak for itself – it needs context, and it needs skeptical evaluation”. Failure to do so invalidates the data’s ability to make a the best decision. Data in the online world has specific challenges that just sampling random people in the physical world does not have to account for. Our industry is littered with reports of results or of best practices that ignore these fundamental truths about tools. It is so much easier to think you have a result and manipulate data to meet your expectations then it is to have discipline and to act in as unbiased a way as possible. When you get this tunnel vision, both in what you analyze or in the population you leverage, you are violating these rules and leaving the results highly questionable. Not understanding the context of your data is just as bad or worse then not understanding the nature of your data.

The best way to think about analytics is as a doctor uses data. You come for a visit, he talks to you, you give him a pattern of events (my shoulder hurts, I feel sick, etc..). He then uses that information to reduce what he won’t do (if your shoulder hurts, he is not going to x-ray your knee or give you cough medicine). He then starts looking for ways to test that pattern. Really good doctors use those same tests to leave open the possibility that something else is the root cause (maybe a shoulder exam shows that you have back problems). Poor doctors just give you a pain pill and never look deeper into the issue. Knowing what data cannot tell you greatly increases the efficiency of the actions you can take, just as knowing how to actively acquire the information need for the right answers, and how to act on that data, improves your ability to find the root cause of your problems.

A deep understanding of your data gives the ability to act. You may not always know why something happens, but you can act decisively if you have clear rules of action and you have an understanding of how data interacts with the larger world. It is so easy to want to have more data, or to want to create a story that makes it easier for others to understand something. It is not that these are wrong, only that the data presented in no way actually validates that story nor could provide the answers that you are telling others that it does. In its worse, you are distracting from the real issue, at its best, it is just additional cost and overhead to action.

The education of others and the self on the value and uses of data is vital for long term growth of any program. If you do not understand the real nature of your data, then you are subject to biases which remove its ability to be valuable. There are thousands of misguided uses of data, all of which are easy to miss unless you are more interested in the usage of data then the presentation and gathering of data. Do not think that just knowing how to implement a tool, or knowing how to read a report, tells you anything about the real information that is present in it. Take the time to really evaluate what the information is really representing, and to understand the pros and cons of any manipulation you do with that data. Just reading a blog or hearing someone speak at a conference does not give you enough information to understand the real nature of tools at your disposal. Dive deep into the world of data and the disciplines of it, choose the right tools for the job, and then make sure that others are as comfortable with that information as you are. It can be difficult to get to those levels of conversations or to convince others that they might be looking at data incorrectly, but those moments when you succeed can be the greatest moments for your program.


7 Deadly Sins of Testing – Confusing Rate and Value

One of the most difficult principles to understand for many people in our industry is that rate and value is not the same thing. One of the fastest ways for a program to go astray is to confuse one for the other. It is easy for people to understand the need for agreeing on what you are trying to accomplish, or why they need to have leadership. It is even easy to talk about the need for efficiency and that it is ok to be wrong, but yet even when people get past that point, they still consistently miss this critical difference. We so desperately want to explain our value to the company, that we confuse the value of our actions in an attempt justify our actions. This fundamental loss of understand leads to a wide range of poor decisions and bad understand that dramatically limits the positive impact a testing or analytics program can have.

A rate is simply a ratio or a description of actual outcome; it is the same thing as me telling you that I have a $4.23 RPV for a population or that I got 5000 conversions. This is a description of past behavior, and is simply an outcome, not a description of why or how that outcome came to be. Where people lose focus is that the value, or ability to positively or negatively influence that outcome, is not tied to those gross numbers. A description of rate tells you nothing about an individual action, since you are not comparing that outcome, only describing it. Increasing your conversions does not inherently create more revenue, nor does the revenue by itself reflect positive value generated by an action. We measure things by saying we ran a campaign and then we got $3.56, this is not the same as telling you anything about the value of that campaign. Value would be the difference in running that particular campaign versus not doing anything, or running a different campaign. The rate is the end outcome, the value of that action is how much it improved or decreased performance.

People are so conditioned to express their contribution or to explain their value as the outcome of a group. I am responsible for the product page, or SEO, or internal campaigns so therefore I must be the sole reason for the generation of that value. Just because your department or your product produces 10 million dollars, it does not mean that is representative of your value. Value is simply what would have happened if you had done nothing, or if you chose a different route. Value is what we are really talking about in optimization, we are discovering the various tools and options that allow us to influence our current state and improve performance. We have a way to measure the efficiency of different actions and choose actions based off of a rational process instead of opinion and “experience”. The value of an action is the amount it increases or decreases the bottom line, which means your value is the ability to choose the best influencers, and avoid the worst ones. Stop defining actions by the rate and instead think in terms of value and you will completely change your view of the world. The question is never what did it do, but would it have done if I simply stopped to exist, or if we chose any of the other routes available to us. Which one would have generated the highest possible outcome?

So where does this cause people to go astray? The first place is in assuming that past behavior reflects value, instead of the rate of action. People are so used to doing complicated analysis that shows that people who click on section Y are worth $3.45. What they are missing is that you are expressing the rate of revenue from people that click on that section, not the value of the section. Using correlative information it is impossible to know what that is really influencing, is it positive or negative? Would you make more or less from having it? Or what about other alternatives? It is a fundamental shift in how you view the world, not focusing on what was, but only focusing on the influence and cost of changes. Getting caught on this definition often leads to misallocation of resources and groups holding items sacred that are negative to end performance.

This change in viewing the world also requires also that every person accepts that their inputs, skills, and responsibilities are part of this system, and that it is rarely going to be a perfect match between what is best for their group and what is best for the organization as a whole. You are not defined by the rate output of your responsibility, but what you do with it. What matters is the ability to view everything as working together to improve the whole, which necessitates the need to not focus on individual groups, items, or interactions. When we are trying to generate value for the organization, and improve our bottom line, the least important item is what does that do to item X, or to section Y. That information is rarely meaningful as it is a single data point, is always going to cause a cognitive dissociation with what is best for the site and the whole. Getting people to act rationally is not inherent to the human condition, but is vital to getting the best results.

The easiest way to prove this simple dissociation between rate and value with testing is to do an inclusion/exclusion test and simply remove each item one at a time. If you know the rates before hand, or if you believe that an item is worth some value, then it would mean that you would drop that entire value when you remove the item. In reality, you will find little connection between that correlative value and outcomes, and will be shocked by how often you find things that you thought were valuable, but that turn out to be negative to the total page performance.

Testing is an amazing tool that allows you the ability to see the value of items. It is not very useful for the rate of outcomes, since we are trying to compare outcomes, but it gives you so much more insight then what you had before. It frees you up to see the world different and to tackle real problems that you could never tackle before. Understanding what information is telling you, what it isn’t, and how best to leverage different types of information together is what changes myth to reality in your use of data. In order to start down that path, you must first deeply understand the difference between rate and value, and understand that your job is not to focus on rate, but instead to discover value.

Understand the Math Behind it All: Bayesian Statistics

Most marketing people have only a passing interaction with statistics, and often times only understand it as a measure of how it has impacted their daily life. One of the funny things people don’t realize is that there are two completely different competing schools of thought when it comes to statistics. Most people are familiar with frequentist statistics, having dealt with things like normal distribution, bell curves, and established probabilities. The other school, Bayesian statistics, is a realm that fewer people are familiar with, but just as applicable. In fact, the move over the last few years is for more people to change from the frequentist model to Bayesian techniques.

So what is Bayesian statistics? To put simply, Bayesian analysis is the use of conditional or evidential probabilities. It looks at what you know of the environment and past knowledge, and allows you to infer probabilities based off of that data. It asks what is the likelihood of something happening based on our knowledge of past conditions and the context of them in the world. Where frequintist statistics can be viewed as much more of a evaluation of the larger data collection and judging the chances of something happening again based off of those results, Baysian is about the likelihood a set of results reflects the larger reality and about making inference based on the limited data set.

Whereas a frequentist model looks at an absolute basis for chances, something like the population of females is 52%, so that means that if I select someone at random from my office, I have a 52% chance of picking a female. The chances are purely based on the total probability. The Bayesian approach is to rely on past knowledge and then adjust accordingly. If I know that 75% of my office is male, and I grab a person, then I know that I have a 25% chance of picking a female.

So is it 52% or 25%? Both are correct answers depending on what question you are really asking, but both look at things differently. Frequentists look at the larger perspective of all chances, and base things off that ideal look at the world. Bayesian users use much more personal or past knowledge to infer information. Bayesian thinkers would much rather answer what are the chances that the total population is 52% female based on the fact that only 25% are female in this office. The risk with using Bayesian logic is that you are allowing for bias and poor data collection to dramatically later how you view things. The gain is that while frequentist will often be right in a controlled setting and over time, Bayesian has the chance to give you better information based on what you know. Bayesian logic also allows you to do conditional logic statements, like based on the office scenario before and a little bit more contextual knowledge, you can answer “what is the likelihood that if you choose a women that she would have blond hair?”. Bayesian techniques are often used for logic reasons, because it allows you to make a conclusion about the likelihood something is the best answer based on what you know. Both techniques are at risk for black swan type of analysis, though Bayesian analysis can be even more influenced by only focusing on the known.

So why is this important? All testing tools and models are almost always relying on frequentist techniques to give you the global view of something as to how often it fits into a pattern. This is why you see things like 92% confidence when evaluating things, we know that under similar circumstances, 92% of the sample means will fit into that window. Those techniques give you answers in an ideal situation and over time, but that may not be true of specific periods or non normal events. They don’t take into account the context of this specific situation, nor prior history relevant specifically to the situation. They often times don’t take into account even the contextual knowledge of the other recipes and information contained in that same test. They might be true of normal circumstances, but not of a special sale or seasonal activity. Bayesian techniques rely on prior knowledge that for testing is rarely available, and for analytics is problematic at best. They might reflect special circumstances, but not give a good long term view due to those same mitigating circumstances.

In all cases, nothing will replace understanding the context of what your data tells you, the patterns of it, and knowing how and when to act. You have to appreciate what the statistics are telling you, but also appreciate what they aren’t telling you. Any overt belief in a measure, by itself is always going to be problematic. Just getting a statistical answer is not a replacement for the context and the environment by which you are gathering data, nor making a decision.

No matter what techniques you use, no matter which camp you are in for the correct way to look at things, there is never a time when you can ignore the problems of any single type of analysis. You can not replace using discipline and logic in your actions. Statistics are just a tool, they can not replace proper reasoning, yet too many people look at it as a magical panacea to remove responsibility for action. Always remember that there are multiple ways to look at a problem, let alone hundreds of ways to solve it. Figuring out the efficient and best way for you is the real key.

Understand the math behind it all: Normal Distribution

If the N-Armed bandit problem is the core struggle of each testing program, then normal distribution and the related central limit theorem is the windmill that groups use to justify their attempts to solve the N-Armed bandit problem. The central limit theorem is something that a lot of people have experience with from their high school and college days, but very few people appreciate where and how it fits into real world problems. It can be a great tool to act on data, but it can also be used blindly to make poor decisions in the name of being easily actionable. Being able to use a tool requires you to understand both its advantages and disadvantages, as without context you really achieve nothing. With that in mind, I want to present normal distribution as the next math subject that every tester should have a much better understanding of.

The first thing to understand about normal distribution is that it is only one type of distribution. Sometimes called a Gaussian distribution, the normal distribution is easy identifiable by its bell curve. Normal distribution comes into existence because of the central limit theorem, which states that any group, under sufficiently large number of independent random variables, and with a continuous variable outcomes, the mean will approximate a normal distribution. To put another way, if you take any population of people, and they are independent of each other, then an unbiased sample of them will eventually turn into an attractor distribution, so that you can measure a mean and a standard deviation. This gives you the familiar giant clumping of data points around the mean, and that as you move farther and farther away from that point, the data distribution becomes less and less in a very predictable way. It guarantees that an unbiased collection done over a long period of time, the mean will reach normal distribution, but in any biased or limited data set, you are unlikely to have the a perfectly normal distribution.

The reason that we love these distributions is that they are the easiest to understand and have very common easy to use assumptions built into them. Schools start with these because they allow an introduction into statistics and are easy to work with, but just because you are familiar with them does not mean the real world always follows this pattern. We know that over time, if we get collect enough data in an unbiased way, we will always reach this type of distribution. It allows us to infer a massive amount of information in a short period of time. We can look at distribution of people to calculate P-Score values, we can see where we are in a continuum, and we can easily allow us to group and attack larger populations. It allows us to present data and tackle it in a way with a variety of tools and an easy to understand structure, freeing us to the steps of using the data, not figuring out what tools are even available to us. Because of this schools spend an inordinate amount of time in classes presenting this problems to people, without informing them of the many real world situations where they are may not be as actionable.

The problem is when we force data into this distribution when it does not belong, so that we can make those assumptions and so we act with a single measure of “value” of the outcome. When you start trying to apply statistics to data, you must always keep in mind the quote from William Watt, “Do not put your faith in what statistics say until you have carefully considered what they do not say.

There are a number of real world problems with trying to force real world data into a normal distribution, especially in any short period of time.

Here are just a quick sample of real world influences that can cause havoc when trying to apply the central limit theorem:

1) Data has to be representative – Just because you have a perfect distribution of data for Tuesday at 3am, it has little bearing on being representative of Friday afternoon.

2) Data collection is never unbiased, as you can not have a negative action in an online context. Equally you will have different propensities of action from each unique groups, and with an unequal collection of those groups to even things out.

3) We are also stuck with the data set that is constantly shifting and changing, from internal changes and external changes in time, so that as we gather more data, and as such take more time, the time we take to acquire that data means that the data from the earlier gathering period becomes less representative of current conditions.

4) We have great but not perfect data capturing methods. We use representations of representations of representations. No matter what data acquisition technology you use, there are always going to be mechanical issues which add noise on top of the population issues listed above. We need to focus on precision, not become caught in the accuracy trap.

5) We subconsciously bias our data, through a number of fallacies, which leads to conclusions that have little bearing on the real world.

In most real world situations, we more closely resemble multivariate distribution then normal distribution. What this leaves us with is very few cases in the real world that get the point that we can use normal distribution with complete faith, especially in any short period of time. Using it and its associated tools with blind loyalty can lead to groups making misguided assumptions about their own data, and lead to poor decision making. It is not “wrong” but it is also not “right”. It is simple another measure of the meaning of a specific outcome.

Even if the central limit theory worked perfectly in real world situations, you still have to deal with the differences between statistical significance and significance. Just because something is not due to noise, it does not mean that it answers the real question at hand. There is no magical solutions to remove the need for an understanding of the discipline of testing nor the design of tests that answer questions instead of just pick the better of two options.

So how then can we use this data?

The best answer is to understand that there is no “perfect” tool to make decisions. You are always going to need multiple measures, and some human evaluation to improve the accuracy of a decision. A simple look at the graph and having good rules around when you look at or leverage statistical measures can dramatically improve their value. Not just running a test because you can, and instead focusing on understanding the relative value of actions is going to insure you get the value you desire. Statistics is not evil, but you can not just have blind faith. Each situation and each data set represents its own challenge, so the best thing you can do is focus on the key disciplines of making good decisions. These tools help inform you, but are not meant to replace discipline and your ability to interpret the data and the context for the data.