7 deadly sins of testing – Not Understanding Your Data

It doesn’t take long working in a data field for you to come across data being used in ways other than what it was intended for. George Canning once correctly quipped, “I can prove anything by statistics except the truth.” One of the hardest struggles for anyone trying to make sense of all the various data sources is an understanding of the data that you are dealing with, what is it really telling you, what is it not telling you, and how should you act. We have all this rich interesting information, but what is the right tool for the job? What is the right way to think about or leverage that data? One of the ways that testing programs lose value over time is when they stop evaluating their data with a critical eye and focus on what is it really telling you. They so want to find meaning in things that they convince themselves and others of answers that the data could not ever provide. Understand your data, understand the amazing power that it can provide, and understand the things it cannot tell you.

Every tool has its own use, and we get the most value when we use tools in the correct manner. Just having a tool does not mean it is the right fit for all jobs. When you come from an analytics background, you naturally look to solve problems with your preferred analytics solutions. When you come from a testing background, you naturally look for testing as the answer to all problems. The same is true for any background, as the reality is when we are not sure, you are wired to turn back to what you are comfortable with. The reality is that you get more value when you leverage each tool correctly, and the fastest way to do that is to understand what the data does and does not tell you from each tool.

Analytics is the world of correlative patterns, with a single data stream that you can parse and look backwards at. You can find interesting anomalies, compare rates of action, and build models based on large data sets. It is a passive data acquisition that allows you to see where you have been. When used correctly, it can tell you what is not working and help you find things that you should explore. What you can not do is tell the value of any action directly, nor can it tell you what the right way to change things is.

Testing is the world of comparative analysis, with only a single data point available to identify patterns. It is not just a random tool to throw one option versus another to settle an internal argument, but instead a valuable resource for active acquisition of knowledge. You can change part of a user experience and you can see its impact on an end goal. What you can not do is answer “why?” with a single data point, nor can you attribute correlated events to your change to each other. You can add discipline and rigor to both to add more insight, but at its core all testing is really telling you is the value of a specific change. It is beholden on you for the quality of the input, just as your optimization program is beholden on the discipline used in designed and prioritizing opportunities.

Yet without fail people look at one tool and claim it can do the other, or that the data tells them more then it really does. Whether it is the difference in rate and value, or it is believing that a single data point can tell you the relationship between two separate metrics. Where we make mistakes is in thinking that the information itself tells you the direction of the relationship of that information, or the cost of interacting with it. This is vital information for optimization, yet so often groups pretend they have this information and make suboptimal decisions.

We also fail to keep perspective on what the data actually represents. We get tunnel vision on what the impact is to a specific segment or group that we lose the view on what the impact to the whole is. To make this even worse, you will find groups targeting or isolating traffic, such as only new users, to their tests and extrapolating the impact to the site as a whole. It does not matter what our ability to target to a specific group is unless that change will create a positive outcome for the site. The first rule of any statistics is that your data must be representative. Another of my favorite quotes is, “Before look at what the statistics are telling you, you must first look at what it is not telling you.”.

Tools do not understand the quality of the inputs, it is up to the user to know when they have biased results or they do not. Always remember the truth about any piece of information, “Data does not speak for itself – it needs context, and it needs skeptical evaluation”. Failure to do so invalidates the data’s ability to make a the best decision. Data in the online world has specific challenges that just sampling random people in the physical world does not have to account for. Our industry is littered with reports of results or of best practices that ignore these fundamental truths about tools. It is so much easier to think you have a result and manipulate data to meet your expectations then it is to have discipline and to act in as unbiased a way as possible. When you get this tunnel vision, both in what you analyze or in the population you leverage, you are violating these rules and leaving the results highly questionable. Not understanding the context of your data is just as bad or worse then not understanding the nature of your data.

The best way to think about analytics is as a doctor uses data. You come for a visit, he talks to you, you give him a pattern of events (my shoulder hurts, I feel sick, etc..). He then uses that information to reduce what he won’t do (if your shoulder hurts, he is not going to x-ray your knee or give you cough medicine). He then starts looking for ways to test that pattern. Really good doctors use those same tests to leave open the possibility that something else is the root cause (maybe a shoulder exam shows that you have back problems). Poor doctors just give you a pain pill and never look deeper into the issue. Knowing what data cannot tell you greatly increases the efficiency of the actions you can take, just as knowing how to actively acquire the information need for the right answers, and how to act on that data, improves your ability to find the root cause of your problems.

A deep understanding of your data gives the ability to act. You may not always know why something happens, but you can act decisively if you have clear rules of action and you have an understanding of how data interacts with the larger world. It is so easy to want to have more data, or to want to create a story that makes it easier for others to understand something. It is not that these are wrong, only that the data presented in no way actually validates that story nor could provide the answers that you are telling others that it does. In its worse, you are distracting from the real issue, at its best, it is just additional cost and overhead to action.

The education of others and the self on the value and uses of data is vital for long term growth of any program. If you do not understand the real nature of your data, then you are subject to biases which remove its ability to be valuable. There are thousands of misguided uses of data, all of which are easy to miss unless you are more interested in the usage of data then the presentation and gathering of data. Do not think that just knowing how to implement a tool, or knowing how to read a report, tells you anything about the real information that is present in it. Take the time to really evaluate what the information is really representing, and to understand the pros and cons of any manipulation you do with that data. Just reading a blog or hearing someone speak at a conference does not give you enough information to understand the real nature of tools at your disposal. Dive deep into the world of data and the disciplines of it, choose the right tools for the job, and then make sure that others are as comfortable with that information as you are. It can be difficult to get to those levels of conversations or to convince others that they might be looking at data incorrectly, but those moments when you succeed can be the greatest moments for your program.

Rant – Stop Testing to Only New Users!

As I see one again one of the surest signs that you don’t know what you are doing in testing, I am forced to clarify one of the greatest misconceptions when it comes to testing.

The first rule of any statistical analysis is that the data must be representative. It doesn’t matter how statistically accurate your count of blue cars is if you are trying to measure the impact to the entire freeway…

So why then do people think tests should only be to “new” users. This is one of the most consistent misunderstandings and confusions of cause and effect. People who have been to your site have come there for a reason and are coming back for a reason. Just because they saw the old site, it does not mean they are not fundamentally important to your new analysis. They are not repeat users just because of the new site (causation), they are people who have declared an intent and are researching or repeat interactiors with your brand and products. They represent anyone who has previously wanted or been interested in what you are selling who have happened to have been on your site before (correlation).

This means that while there is some possible interaction with seeing an old and new experience, that ignoring them is you telling me that you do not care and are not interested in the revenue generated by anyone who has ever come to your site or purchased from you before or who has thought about purchasing from you before. There will be some interaction from the change in experience (if they even remember) that is spread evenly over each sample, but that is especially mitigated in a visitor based analysis (to measure performance over time). What is not accounted for when you do not allow them into your test however is 100% of all people who have given you revenue before!

That means that if you have any business where you would like people to repeat use or purchase from you, that not including them invalidates all your data, since your data set is both biased (people who have never been interested in you before) and not representative of your long term population.

To put another way to match this time of year, this is exactly the same of only polling fox news watchers to represent all of America for the presidential election. Or MSNBC watchers to vote for all social bills. It is a population, you will get a result, and if you really want to ignore reality, you can go with the data, but it in no way represents the entire population or in any way tells you what matters to the entire population.

The Difference between Success and Failure with Personalization

Personalization is such a buzzword right now that it is nearly impossible to have a conversation in the digital marketing space without it coming up. Everyone is on this quest for a “personalized experience” or to make sure that they are doing what every other group is doing. You constantly hear about all this new technology and all these new ways to accomplish this task. There is more tools and information about our users now than ever before, and yet there are very few groups or people who actually can differentiate between success and failure for personalization.

The most fundamental thing people forget about “personalization” is that I can “personalize” an experience in almost infinite ways. I can change copy, I can change work flow, I can change layout or features of the experience. Even better, I can do this for the same user in a thousand different ways. I am a returning user to your site… but I am also a user in the afternoon, who came from Google, who has been on the site 12 times, who has made 3 purchases, and who is using FireFox. So the question is not CAN I personalize an experience, at this point there are a thousand different tools and ways to do so. So the simple act of creating an experience is not the goal, the goal is to do so in the way that generates the greatest ROI for my organization.

The question needs to be, how do I discover the most valuable way to change the experience?

What we need to incorporate in any concept of personalization is a way to measure these different concepts against each other. We have to build into every process a period of discovery, using tools that allow us to know the two most valuable pieces of information when it comes to personalization:

What is my ability to change their behavior?

What is the cost to do so?

There is no way to acquire that information without actively making changes and seeing the outcome. Measuring that different groups have different behavior is easy, but what does that tell you about your ability to change that behavior? Just because one group of users purchase twice as often as another, how do you know your ability to change that behavior? How do you know that a different experience will do anything more than a static similar experience for both?

And that is the difference between success and failure when it comes to personalization. Are you just serving up an experience because you can? Or have you done the active acquisition of knowledge that shows not only that it improves performance, but that it is the best way to increase performance.

I want to give a functional example so that you can see this in action. Let’s take the exact same concept and see it executed under both ways of thinking.

Let us say that it is coming up on the holiday season, and you want to serve up a holiday shipping message to people who have purchased on your site before.

If my goal is increased revenue, then the steps would be as follows:

Discovery
1. Create multiple executions of the message (how do you know if the concept or the execution is the issue with one offer?)

2. Take 2 to 3 other messages that could be used there (one will most likely be your default content), other concepts such as specific products or specific site offerings. Hopefully you are just reusing existing content.

3. Serve all the offers to EVERYONE

4. Look at the results by segment and calculate the total gain by giving a differentiated experience:
i. If you are correct, then the highest performing recipe for the previous shopper segment will be one or both of the shipping messages. Default content would then be the winner for the non-purchaser segment (the comparable segment).
ii. If you are wrong, then any other segment will have a higher winner for any of the offers. Be open to permutation winning that you never thought of. Being wrong is always going to provide the greatest return

Exploitation
5. Push live the highest revenue producing opportunity found

Let us see how groups that get little, no, or negative value from “personalization” do the same task:

1. Push the single piece of creative to the repeat purchaser segment.

2. Hope

See the fundamental problem is that in the second scenario you have no way of knowing if it is valuable, or not. Blind belief that you are providing value is not the same as providing value. Most groups think that if they just report the outcome, or the rate of action of that group, that it somehow represents the value of that action. It doesn’t. Value only comes from the improvement of performance by that action. If you aren’t actively acquiring that information, then you have no way of knowing the value of any action. Even worse, we are adding cost and we suffering from opportunity cost from the gain we should be getting.

I want to show some simply math to show you the difference in the two groups. Let us say in the first test we have the 5 different experiences and that we are looking only at 10 different comparable segment groups (segments only matter if there is a different outcome for the comparable group). This might include things like new/returning, work hours/non-work hours, search/non search, Firefox/chrome/Internet Explorer, or any other of the infinite ways of dividing your users using any and all of the information that is available to you. You can always do more, but for the sake of argument and of efficiency, 10 different pools of the same population is enough. Segments are only valuable for targeting if we serve things to the comparable segment. If I assume that everything is purely random, then I have a 1 in 5 chance of my offer being the best. I also have a 1 in 10 chance of my segment being the MOST valuable.

(1/5) * (1/10) = 2%

So if everything is random, then I have a 2% chance that I picked the best outcome (the one that drives the highest revenue for my site), which means that in 98% of the scenarios, I have cost my site money. But let’s assume that you are REALLY good at picking segments and content based on your experience and your analysis. Having worked with nearly 300 different organizations, experience shows that the best of people who aren’t relying on causal data are no better than 2 times random guesses for choosing a better option (they guess a right answer twice as often than just the random sample).

Most groups do not fall into that category. In reality, most groups actually are worse than random at choosing the best option.

That means the math is only:

(2/5) * (2/10) = 8%

Let’s say you are the best person in the world at what you do, with great analysis and all sorts of tools, so that you are three times better:

(3/5) * (3/10) = 18%

So if you are absolutely amazing at what you do, then 18% of the time, you will have guessed the right message for the right group. 82% of the time, another outcome is better and most likely significantly better. You can reduce that to 0% of the time a better performing option with a few simple steps and accepting that we do not always understand the patterns before us. If we go back to random chance, then 20% of the time just doing nothing (your default offer) actually performs better for everyone. If you are the betting type, which would you take? 8% versus 100%? Especially when the scale of impact can be massive.

Remember that in all scenarios you are going to get an outcome, so that can’t be the measure of success. The process of finding the right answer is far more important than a conversation around the function of a tool. Nor can discussing only the impact of one segment, since we are not comparing it to others in context. The question is did doing this one thing provide MORE value than doing another action (or doing nothing), and the only way to answer that is to compare outcomes. All of the downside is when you look at “personalization” as just a function that you make a decision on and just do. All of the upside is when you discover value and then exploit it. There is nothing more valuable then when you are wrong, but the only way to discover that is when you create a system that enables it.

The difference between a success and a failure with personalization comes down to this:

If the goal is to make money, the question to ask is not to ask CAN I do personalization, but how do I put steps in place to ensure that I am having both a discover and an exploitation phase to my actions?

Change the Conversation: Defining Success

One of the more common refrains I hear as I speak with different organizations or read industry blogs, is how do you deal with a failed test? People speak of this as if it is a common or accepted practice, one that you need to help people understand before you move forward. The irony of these statements is that when most groups are speaking, they are measuring the value of the test by if they got a “winner”, a recipe that beat their control. People almost always come into testing with the wrong idea of what a successful test is. Change what what success means, and you will be able to change your entire testing program.

Success and failure of any test is determined before you launch the test, not by the measurement of one recipe versus another. A successful test may have no recipe beat the control, and an unsuccessful test may have a clear single winner. Success is not lift, because lift without context is nice but almost meaningless.

Success is putting an idea through the right system, which enables you find out the right answers and that allows you to get performance that you would not have otherwise. If all you do is test one idea versus another that you were already considering, you are not generating lift, you are only stopping negative actions. In addition, if I find something that beats control by 5%, that sounds great, until you add context that if I had tested 3 other recipes, they would result in a 10%, 15%, and 25% change. Do you reward the 5% gain, or the 20% opportunity loss?

In the long run, a great idea poorly executed will never beat a mediocre idea executed correctly.

You can measure how successful a test will be by asking some very simple questions before you consider running the test:

1) Are you prepared to act on the test? – Do you know what the metric you are using is? Do you have the ability to push results? Is everyone in agreement before you start that no matter what wins, you will go with it? Do you know what the rules of action are and when you can call a winner and when is too soon? If you answered no to any of those questions, then any test you run is going to be almost meaningless.

2) Are you challenging an assumption? – This means that you need to make sure that you can see not only if you are correct, but if you are wrong. It also means that you need to have more than 1 alternative in a test. Alternatives need to be different from each other and allow for an outcome outside of common opinion to take hold. Consider any test with a single alternative to be a failure as there is no way to get a result with context.

3) Are you focusing on should over can?– This is when we get caught up on can we do a test, can we target to a specific group, or making sure that we can track 40 metrics. It is incredibly easy to get lost in the execution of a campaign, but the reality is that most of the things we think are required aren’t, and if we can not tie an action back to the goal of a test, then there is no reason to do it. These items should be in consideration based on your infrastructure, and based on value. Prioritize campaigns by how efficient they are to run, and never include more then you need to take the actions you need to take. Any conversation that you are having that is focused purely on the action is both inefficient and a red herring taking you away from what matters.

So how then do you make sure that you are getting success from a test? If nothing else, you need to build a framework for what will define a successful test, and then make sure that all actions you take fill that framework. Getting people to agree to these rules can seem extremely difficult at first, but having the conversation outside of a specific test and making it a requirement that they follow them will help ensure that your program is moving down the right path to success.

Here is a really simple sample guideline to make sure all tests you run will be valuable. Each organization should build their own, but they will most likely be very similar:

  • At least 4 recipes
  • One success metric that is site wide, same as other tests, and directly tied to revenue
  • No more than 4 other metrics, and all of these must be site wide and used in multiple tests
  • Everyone in agreement on how to act with results
  • Everyone prepared to do a follow-up test based on the outcome
  • At least 7 segments and no more than 20, with each segment at least 5-7% of your population and all must have a comparable segment
  • If interested in targeting, test must be open to larger population and use segments to either confirm beliefs or to prove yourself wrong. (e.g. if I want to target to Facebook users, I should serve the same experiences to all users and if I am right, then the content I have for Facebook users will be the highest performer for my Facebook segment).

One of the most important things that an optimization program can do is make sure that all tests follow a similar framework. Success in the long run follows from how you approach the problem, not by the outcome of a specific action. You will notice that in no point here is the focus on the creation of test ideas, which is where most people spend way too much time. Any test idea is only as good as the system by which you evaluate it. Tests should never be about my idea versus yours, but instead about the discovery and exploitation of comparative information, where we can figure out what option is best, not if my idea is better than yours.

What variant won, whose idea was it, and generating test ideas are some of the biggest red herrings in testing programs. You have to be able to move the conversation away from the inputs, and instead focus people on the creation of a valuable system by which you filter all of that noise. Do not let yourself get caught in a trap of being reactive, instead proactively reach out and help groups understand how vital it is that we follow this type of framework.

Change the conversation, change how you measure success, and others will follow. Keep having the same conversation or let others dictate how you are going to act, and you will never be able to prove success.