Rant – Why whichtestwon Makes you a Worse Tester

There is nothing less important then what the winning recipe was of a test.

I want to let that sink in.

Everyone loves to get caught up on which recipe won, because it is what you look at and it is what others want to know, but as a tester, it is the way that you arrive a that you arrive at that answer that determines if you actually provide value or just an answer. Individual outcomes interest people who have something invested in being “right” where consistent meaningful discipline is what matters for people who are invested in improving things consistently. If you only discovered something that is the 2nd best out of 10 different feasible alternatives, you wouldn’t pick the 2nd best, but when you only compare two things, that is most like what you are doing. You haven’t accomplished anything and you are actually losing money. If you didn’t actually measure outcomes of multiple alternatives, or if you didn’t measure against a global site wide metric, or if you did not account for the cost to arrive at that conclusion, then you are fooling yourself into thinking you have accomplished something when all you did was take resources from others to make yourself look good. It may impress others, but it has not provided one bit of value to the organization.

In order to be the best alternative, you need context of the site, the resources, the upkeep and the measure of effectiveness against each other. Even is something is better, without insight into what other alternatives would do it is simply replicating the worst biases that plague the human mind. Figuring out the better of two options is an answer, finding out the value of different feasible alternatives is providing value. Finding out who was right “picking the winner” is great for people’s ego, but making sure you are measuring multiple alternatives and that you are choosing the options that provide the highest return to the largest population for the lowest cost is what makes you successful.

To make it worse, people then look at the results and think that they will get the same result for their site, and in the worst case, they do. Sites like whichtestwon, which focus on letting people find out what won amongst two options sound great, and capture people’s attention. They let you guess and pat yourself on the back when you are right or wrong, but the reality is that they are designed to feel good but not actually provide value. If you wanted a site like that to provide value, then they would require

The problems of a tester are two fold, one in convincing others to test, and second in improving the testing to make sure that you are maximizing return and lowering cost. A good tester needs to be able to balance both, since there is little to gain outside of personal reward in just foolishly running tests. But sites like whichtestwon? It is designed to assist the first; to provide evidence for people that you can get an positive outcome (missing that you also get outcomes from other uses of the same resources) without actually giving any real insight into if you did provide a positive outcome (an outcome, by itself, tells you nothing). It is designed exclusively for people to abuse to push their own agenda. To take a quote directly from their tour:

Site shows stats from various A/B tests – Finally I’ve got evidence to show clients on a load of design decisions!”

That shows everything that is wrong. Testing should be about seeing what the value of different test variants are, not making the case for a specific one that you want. In order to be successful, you have to prove yourself wrong. If it would have worked the first time, then there was no point in the test (and you are wasting resources to run the test) and you have learned nothing. You should not be given “credit” when you are adding additional cost and providing nothing more then validation for others. When you are wrong, when you have tested what you want and tested other alternatives, and you find other alternatives prove to be more efficient, even if what you wanted was better than control, that is the moment you are truly gaining something from your testing efforts.

There is a plague of people in our industry who try everything that can to show how much value they got from a single test. Who view testing as a way to get what they want up on the site over the HiPPo or someone else. Who abuse testing to push their agenda and who then take credit when they find something that proves better then what was there before. The act of running a test is not a measure of success, nor is having an outcome. Added value only comes from finding an outcome that is different then what you would have already done. In order to do that, you must measure multiple feasible alternatives and find an outcome different then what people want. If you aren’t able to do so, then the most fundamental problem you have is you, and how you think about testing. If you are able to, then the individual outcome, what won, is far less important than how you got there and what you chose not to do. The measure of a testing program is how often they are proving people wrong, and about how consistently you can do that with the least amount of resources possible.

Being a good tester means that you always know the relative costs. It means that you know how often something works, not just if it did one time. To be good, you should be able to create meaningful actionable lift on all your tests, not jump up for joy and promote yourself to the world when you managed to find one thing better on 1 out of 5 tests. Don’t settle for taking the easy road and trying to take credit. Add value, be better, learn how to look at things and you will actually create value, today and always. If you go down that road however, then no one cares which variant won, it has no bearing on long term success. Great, you found the thing to push from this campaign, that is just one small step on a long road of continuous action. You wouldn’t reward someone because they managed to turn write their name on a test, so please do not think that whichtestwon somehow does anything to inform you how to be a better tester.

If you really wanted to see a site like whichtestwon matter, then show the variants that didn’t win. Show multiple options for each outcome and show what the best option was? Give us a measure of the cost and give us the internal roadblocks that you had to overcome. Let us know if that outcome was greater or worse then others for that group and what they are doing with the results to get a better more efficient result next time. If you are interested in anything more than self-promotion, post the things that don’t work. Tell us how often something wins, not the one time it did win. Use the site to find examples of where you were wrong and inform yourself that you are not right… ever. The most we can ever hope to be is a little less wrong and working on a way to speed up the process for discovering just how wrong we are.

Understand the Math Behind it All: Bayesian Statistics

Most marketing people have only a passing interaction with statistics, and often times only understand it as a measure of how it has impacted their daily life. One of the funny things people don’t realize is that there are two completely different competing schools of thought when it comes to statistics. Most people are familiar with frequentist statistics, having dealt with things like normal distribution, bell curves, and established probabilities. The other school, Bayesian statistics, is a realm that fewer people are familiar with, but just as applicable. In fact, the move over the last few years is for more people to change from the frequentist model to Bayesian techniques.

So what is Bayesian statistics? To put simply, Bayesian analysis is the use of conditional or evidential probabilities. It looks at what you know of the environment and past knowledge, and allows you to infer probabilities based off of that data. It asks what is the likelihood of something happening based on our knowledge of past conditions and the context of them in the world. Where frequintist statistics can be viewed as much more of a evaluation of the larger data collection and judging the chances of something happening again based off of those results, Baysian is about the likelihood a set of results reflects the larger reality and about making inference based on the limited data set.

Whereas a frequentist model looks at an absolute basis for chances, something like the population of females is 52%, so that means that if I select someone at random from my office, I have a 52% chance of picking a female. The chances are purely based on the total probability. The Bayesian approach is to rely on past knowledge and then adjust accordingly. If I know that 75% of my office is male, and I grab a person, then I know that I have a 25% chance of picking a female.

So is it 52% or 25%? Both are correct answers depending on what question you are really asking, but both look at things differently. Frequentists look at the larger perspective of all chances, and base things off that ideal look at the world. Bayesian users use much more personal or past knowledge to infer information. Bayesian thinkers would much rather answer what are the chances that the total population is 52% female based on the fact that only 25% are female in this office. The risk with using Bayesian logic is that you are allowing for bias and poor data collection to dramatically later how you view things. The gain is that while frequentist will often be right in a controlled setting and over time, Bayesian has the chance to give you better information based on what you know. Bayesian logic also allows you to do conditional logic statements, like based on the office scenario before and a little bit more contextual knowledge, you can answer “what is the likelihood that if you choose a women that she would have blond hair?”. Bayesian techniques are often used for logic reasons, because it allows you to make a conclusion about the likelihood something is the best answer based on what you know. Both techniques are at risk for black swan type of analysis, though Bayesian analysis can be even more influenced by only focusing on the known.

So why is this important? All testing tools and models are almost always relying on frequentist techniques to give you the global view of something as to how often it fits into a pattern. This is why you see things like 92% confidence when evaluating things, we know that under similar circumstances, 92% of the sample means will fit into that window. Those techniques give you answers in an ideal situation and over time, but that may not be true of specific periods or non normal events. They don’t take into account the context of this specific situation, nor prior history relevant specifically to the situation. They often times don’t take into account even the contextual knowledge of the other recipes and information contained in that same test. They might be true of normal circumstances, but not of a special sale or seasonal activity. Bayesian techniques rely on prior knowledge that for testing is rarely available, and for analytics is problematic at best. They might reflect special circumstances, but not give a good long term view due to those same mitigating circumstances.

In all cases, nothing will replace understanding the context of what your data tells you, the patterns of it, and knowing how and when to act. You have to appreciate what the statistics are telling you, but also appreciate what they aren’t telling you. Any overt belief in a measure, by itself is always going to be problematic. Just getting a statistical answer is not a replacement for the context and the environment by which you are gathering data, nor making a decision.

No matter what techniques you use, no matter which camp you are in for the correct way to look at things, there is never a time when you can ignore the problems of any single type of analysis. You can not replace using discipline and logic in your actions. Statistics are just a tool, they can not replace proper reasoning, yet too many people look at it as a magical panacea to remove responsibility for action. Always remember that there are multiple ways to look at a problem, let alone hundreds of ways to solve it. Figuring out the efficient and best way for you is the real key.

Why we do what we do: Why “why?” Doesn’t Matter — the Texas Sharpshooter Fallacy

When choosing the next fallacy to cover, I faced a tough choice as there are so many different fallacies that describe the same human behavior: The belief that we know or can answer things we can’t by assigning pattern or reason to things without actual cause. We are wired to want to explain why things happen, but in order to accomplish that task, we ignore or use only data we want and we supplant our own points of view as the core reason things happen. We believe that the world is far more established and easy to understand then it really is. My favorite fallacy that covers this behavior is the Texas Sharpshooter Fallacy, which is when someone assigns pattern or reason to random chance.

The name Texas Sharpshooter comes from this “story”:

A cowboy takes aim at a barn and starts shooting randomly. When he is done, we walks up and notices that there are a large number of holes in one area and fewer holes in another. He then paints a bull’s-eye over the area where there are a large number of holes. To anyone walking up, it looks like he was a good shot and mostly hit where he was aiming.

Now while I am sure that we can all think of cases where others have done this with data, the first thing you need to understand is that we all do this… all the time. We see patterns and rationalize our own actions, whether it is why we do things in a certain order or even why we believe certain “truths” about the world. We rationalize decisions after we make them, and while they are not all random, our understanding of why we do things is often flawed at best and completely delusional other times. The human brain actually engages the rationalization part after the action part, meaning that we always act, then think of why we act, not the other way around. We draw circles around the patterns of our own behavior and then accept those circles as the logic that lead to the decision. This makes our understanding of why people do things often extremely flawed, since so much of how we view others behaviors is through the context of our own “understanding” of what drives our own actions. We so want to come up with a why, and we dive so deep, that we miss the point that we will never truly know. Nor does it matter, sense we are describing a pattern, one that we can engage and interact with and build rules around, without needed to know all the causes of that pattern.

One of my favorite examples of this in the real world is a psychology professor in Baltimore that does the same demonstration each year. He starts his lecture by bringing a chicken up in a cage on stage. The cage has a feeder that is set to dispense food pellets at random time intervals. He then covers the cage and talks for an hour and half. At the end of the presentation, he takes the cover off and without fail, the chicken is found doing some behavior over and over again; it has convinced itself that this behavior is why the food comes out. The food comes out no matter what it does, and it has no control, but it has convinced itself that it is in control of the situation. We are all like that, we have to explain things so bad that we will believe anything, or will paint bulls-eyes, where they aren’t to make ourselves feel like we have more control then we really do.

We like to believe we are smarter then that chicken, but we aren’t. In our world, data is our food, so we assign patterns to explain changes in what we observe. Data becomes a crutch to accomplish this task. We so want to have a story to tell others and ourselves that we find one in the data. We believe that because conversions went up, the message must have “resonated” or because one group has a different winner then another group, that it is because of their socioeconomic status or because they are more familiar with technology. We have no way of knowing this, but we convince ourselves and others that this is the reason why. The reality of the situation is that we need “why” to help us feel like we understand, but acting and using data in no way requires a why so much as it requires a willingness to act.

Looked at from a data perspective, this means that when we see a noticeable meaningful change, often from testing, we are left to think of why it happened. People are fascinated with the “why?” often at the cost of what comes next. The reality is that we are always going to be looking only at a noticeable change and then apply rationalization after. We get so caught up in the why that we miss the truth that we will never really know nor does it matter. Having a clear plan of action for our data means that we never need to know the why to be successful, and in fact insures that the more we dive in and try to answer it, the more we are wasting resources. Acting on data requires willingness and alignment, it is decided before something happens. Rationalization is what happens afterwards. Why does not change your need to act on the data, nor does it allow you have some sudden insight into human behavior. At best you have a single data point, at worst you are painting bull’s-eyes around holes and calling them insight.

Marketers have been trying to figure out the “why” for a long time, and while there is a lot of people that claim to know, the reality is at best we have pattern, and at worst we have stories we present to make ourselves look good. You can not derive pattern from a single data point, yet we are obsessed with trying to do that very thing. If we are honest with how we go about collecting data, and we are open to consistent and meaningful action from testing, then why will never matter. If we are following the data and disciplined, then we know how we are going to act based on the results, not why the results happened. If you are disciplined in how you think about users, then you know that a story or a single data point will never tell you anything. If we really want to make things personal, then we won’t force “personas” on people, but instead let data tell you the casual value of changing the user experience and for whom it works best.

At its worst, the Texas Sharpshooter Fallacy represents our need to show that we are more in control or know more than we really do. We use the need to explain why to make stories and to help communicate our value to others. My background is in historical analysis, and one of the first things you learn is how little value comes from the first person narrative. It shows far more about the fallacies of the person speaking then it does for providing real information about what really is happening. Data at its heart is meant to improve situations, not to allow you to come up with a story that satisfies your world view.

Why is not a question that you can ever truly answer, yet most people in marketing are obsessed with a Sisyphean quest to answer it. The reality is that it is a question that has nothing to do with how you act on data or the disciplines needed to be successful. We do not need to know why for everything, even if it seems to hold all the answers. We just need to know what to do with what is in front of us and to appreciate how little we really know about the world in which we live.

Optimizing the Organization: Getting the Right People in the Right Roles

If you think about the trinity of key elements to a successful optimization organization, the one that really determines the ability to drive the others are the people. It’s not enough to get people that are skilled or know the tool, those can be trained; what is most important are to find people that will and can challenge the status quo and help change your organization for the better. You need people whose mission in life is to find better ways to do things and who do not accept the status quo. If you keep doing what you were doing, how do you expect to get better?

Your people and the roles they fill, are what will allow you to change your organizations view of testing and organize in efficient ways to accomplish tasks. So what people do you need? What roles do you need to fill and how do you know that you are doing the right things in those roles? There are common key roles within each testing program. All these roles might be filled by one person or they may be filled by 5 people each. It is not the number of people, but how they think and act that is most important. What matters most is the ability to put the right hat on when needed and to think and act in a way to be successful. Success here is defined as making the company the most money in the most efficient way possible. If your goal is not to assist your company in making money over making yourself look good, then please ignore the success and failure part of this post.

If you want to succeed, you need each of these roles to be identified, leveraged correctly, and to work together. Having just one bad apple, or letting someone turn into a villain, in your group will guarantee that you are limiting your results.

Program Sponsor –

Role – This is the person whose neck is on the line to make things work. They are there to align teams and to get everyone working towards a common goal. In order to accomplish this, your sponsor needs both the willingness and power get change done in your organization. They do not need to be involved in every day to day activity, but when it comes time to make sure that other groups are not upset by challenging their imminent domain and to make sure that there is not infighting, these need to be the real leaders to get that done. They also help make sure the other roles are doing their job and making sure that the program is not being used to just to make one person or group look good. Their last duty is to make sure that things don’t just work right one time, but over the long haul, and that the program is assisting the entire organization to change how it views and accepts testing.

You will be successful if – You take the time to understand that disciplines of testing and how to make it work in your organization. You make sure people are held accountable for their actions and for delivering results, and make sure that there is constant efficient effort with all parts of the group. You stop people from just trying to make stories up about how “right” they are and instead actively try to get people to appreciate being wrong. You make it clear that you and the team are not there to be yes men to other groups, but that by working together that you can all work towards a common central goal.

You will fail if – You don’t get involved or get too involved. There is a fine line to setting the stage and allowing others to do what they need. You will also not succeed if you allow the program to cower to other powerful people’s whims or do not deal with the need for IT, design, analytics, and marketing to work together towards a single common site goal. You will also fail if you are more interested in building your empire then actually getting results. More often than not programs fail when people try to make things look better for them over actually achieving results and if you are not willing to stop that behavior, then you will not have a successful program. If you accept just getting an outcome over getting the best outcome, then your program will never be truly successful.

Testing Owner –

Role – This is the person who runs the day to day operation of the testing program. Unlike the sponsor who sets the higher level agenda and holds people accountable, your testing owner is the person who makes the magic happen and makes sure that the program is run day to day. Liaising with other groups, setting the daily agenda, making sure consistent resources are leveraged and making sure that the program does not get too caught up individual tests or ideas are your primary duties. In order to do that, you will spend a large majority of your time educating and working with other groups to change their perceptions of what makes a successful test, that you are measuring the value of every action. Creating the momentum and changing people’s perspectives to insure a consistent optimization process and not just a series of projects is your day to day directive.

You will be successful if – You can navigate the political waters and keep the many plates spinning. If you can do all that and make sure that every action is not wasted on non-strategic things, then you have a real chance to be a rock star. You must work proactively to educate people about the disciplines of testing and how it is both different and valuable to the other disciplines in your organization. Being proactive and not letting others set the conversation based on their agendas is the primary directive of this role.

The number one challenge is not the day to day activities, but keeping the conversation at the right level and not allowing any of the small details such as technical implementation, test ideas, resources, or questions from outsiders to derail the program. It’s easy to do once, but doing it every day for months and years is the real discipline.

You will fail if – You are more interested in playing nice or promoting yourself then getting the right answer. If you get caught up in the concepts from reading the latest blog or conference or thinking that just getting a single answer from a test somehow makes you successful, then you are on the path to failure. You have to stop yourself and your team from thinking that just the act of getting a result somehow makes you successful.

Most programs fail to see higher value because the people in this role are so desperate to get a single result that they try to make it like other groups, or they start rewarding other groups for just finding something better and not for finding the best use of resources. They get so frustrated by the politics that they start making excuses, to others and especially to themselves, that they have accomplished more then they really have. More then anything, it is the failure to stand up and to do the tough things like challenge people and not allow popular opinion to drive things, even if they might help you politically get ahead, that is the leading killer of programs.

Technical resource –

Role – This is the person who knows how everything gets done and can make magic happen with code. They own the creation of test code and the process by which is works with other code on the site. They should be part of every discussion working on the efficiency of efforts and talking about the best of the 50 ways to accomplish this task. They can also be helpful in liaising with other IT resources to make sure that everyone is on the same page.

You will be successful if – You are part of the conversation but understand that just because you can do something doesn’t mean you should. You will need to make sure that we are looking at things from how they fit together in a discipline based view as opposed to what can we do. In order to do that, you will be part of the conversation talking about efficiency of not just doing one thing, but making sure that all tests are deconstructed and that you are testing multiple feasible alternatives, not just what is being asked originally of you. They are also strongly working with and educating the IT group to make sure that other projects do not interfere with and are enhanced by your testing efforts, nor are testing efforts slowed down by getting stuck in IT processes designed for very different types of projects.

You will fail if –– You think that your job is just to do what others tell you and that process matters more than outcomes. You will also fail if you try to do testing like other IT projects, especially QA. Many people will want to reward you for making their ideas come to life, but fundamentally if you are not stopping bad ideas, if you think it is someone else’s problem, then you are promoting yourself and not adding value. If you cannot find a way to make things work pragmatically in a speedy manner, then you have no chance of succeeding.

Optimization Analysts –

Role – The optimization analysts is the person who owns the reports and the outcomes of the testing program. They may also be the one that sets things up in the tool and who helps project manage day to day activities. When it comes time to make calls on outcomes, they take lead, creating reports and helping educate others on what was learned from each and every action. They will need to know a little bit about many different things and should be able to hold people accountable to established rules of action, while at the same time talking to other groups on their level.

You will fail if – You set the way that things are done before a test, not after. If you don’t cower to every question that is asked or pretend that you know something that you don’t. If you are willing to remind people to focus on what matters and not just dive into every question because you could find some data. If you work tireless in finding better ways to do things.

Ultimately, you are successful here by what you convince others to not do far more then what you do.

You will not succeed if – You try to answer every question after the fact or if you allow others to change measurements to meet their personal agendas. If you do not understand the biases of people and how they are trying to use your and the data your provide for their own self-promotion. If you are not willing to step up and challenge people or to remind them of what matters, then you will never succeed. If you get caught up on the latest “buzz” or “expert” and forget to work towards the right goals, then you will also not succeed.

Designer –

Role – You work with the testing team to create the materials needed for each test. You make sure that you are testing not just the concept but also the execution of each concept by producing multiple ways for each item thrown your way. You own the relationship with your design teams and you work with the team to make things look good enough to go live on the site, while making sure that you are open to new things that don’t meet current standards. You will be the one making sure that you have a large amount of different feasible alternatives for every effort. You will also be the one helping communicate what is learned and how it changes or breaks current “best practices” or style guides for your group.

You will be successful if – You make sure that you are testing out everything that is feasible and that the alternatives are different enough from each other. You get excited by the chance to try new things and know that the worst thing you can do is try to get “approval” for ideas.

You will fail if – You get too caught up in what has been there before or style guides or “experience” or any of a hundred other excuses and fail to test out things that go outside of everyone’s comfort zone. You don’t get excited by the challenge of being creative and get too caught up on requesting or wanting feedback. You will also fail if you allow tests to get too caught up on content and fail to address the other much more efficient parts of the user experience such as real estate, function, and presentation.

There are of course many other roles and the nature of each of the ones listed is different for each organization. What does not change are the disciplines to succeed. Make sure that you are aware of who is doing what for your group, and even more importantly you are doing the things to succeed and avoiding the pitfalls that each person is bound to come across. You won’t be perfect day one, but that doesn’t allow anyone in any role to avoid the responsibility of getting better and making things work for you.