July 11, 2012

Why we do what we do: The Graveyard of Knowledge

The study of history teaches you many key lessons, such as the lack of unreliability from the first person narrative, the inability to understand the scope of something while it is happening, and most importantly that history is written by the victors. Howard Zinn made a living out of showing people just how true this is and how little we understand our own world because of this. This same mistake is made not just in historical analysis, but all the time in data analysis as when we only look at attributes of the “winning” side while forgetting to analyze the non-winning side of the data. While closely related to the Halo Effect, this effect shows itself in what people like Degrasse Tyson and Taleb refer to as the graveyard of knowledge.

Neil Degrasse Tyson best explains the graveyard of knowledge with one of his stories:

You read a study that says that 80% of people who survived a plane crash studied the exit routes before the plane took off. Comfortable with this knowledge, the next time you board a plane, you quickly study the exit routes on the plane. As you do this, you start to analyze that data and you come to a sudden realization, what if 100% of the people who did not survive the crash studied the exit routes?

We don’t know the other side of the story, because there is no one there to report. We only know the people who came back and what they can tell us, but we have no clue what was going on with those that did not come back. Knowledge is lost all the time when people only look at the winners. Winners are the only ones left to tell their stories, so we only look to them for details. The reality is that the important details are rarely only on the winning side, and all the people who never returned give us just as vital knowledge. We lose all the really important information in that graveyard of the people who never returned. In fact, we only can start to understand anything if we have both pieces in order to have context for our information. We focus on those that “survived” that we completely ignore the context of the people who did not. We look only at the behaviors we want and then extract qualities about that group of people, without looking at the population as a whole or more importantly, what would have happened if we did nothing.

We look for people based on the end of their behavior, and not their definition before. We love to know what are the characteristics of people who made a purchase, or of people who come to our site more then 3 times. We look backwards from that winning behavior because that is all we think we have available to us. We love to describe past behavior through correlative behavior, and then attribute “value” to those actions. People who purchase use internal search 2 times on average, therefore internal search must be the cause of that action. People come from social sources spend $4.56 on average, therefore social is worth $4.56. We don’t know what would have happened if the same person didn’t use search or come from social, would they have spent more or less? All of these types of analysis attribute past behavior to end value, missing the point that we don’t know what they would have done otherwise. Is looking at the exit routes helping or hurting your ability to survive? We don’t know if more is better, we instead assume a linear relationship. If campaign X is generating value Y, then doubling spend on X will of course generate 2Y.

Looking only at the data from one group or that define the “winners” means that you have completely lost any value from that data. We can not express how much better or worse an action made things, only that we have X amount of search spend and ended up with Y revanue. Even worse, pretending that you can derive cause and effect from the larger context means that you are not getting value from the actual data itself, but instead propagating your own world view and using the data only to support it. Like the Texas sharpshooter fallacy, you are creating a story to fill in what is most likely random noise from the data. Rates of action, such as 80% of people looked at the exit routes, tell you nothing unless you know both that increasing that number increases your ability survive, and you know the cost and ability to influence people to make that action. I can tell you that 100% of people who are determined to spend $1000 on your site will spend at least $1000, but that doesn’t tell me how I get those people in the first place, or if it is worth my time to spend the resources there for that small population as opposed to the multitude of other alternatives.

People make this mistake all the time in the world of data analysis when they get so caught up on a set path or on looking backwards from an event. They want to know what all the people who purchased did, or what all the people who come to your site 4 times have in common. There is even a whole world of statistical analysis focused on clustering and personas which is making a large push in our industry that is focused on this tendency. The mistake people make is that only a small part of your population fails to tell you the context of that information. Like the plane, knowing the attributes of one group doesn’t tell you the attributes of the population as a whole. Even worse, it assumes that those attributes have anything to do with that behavior. We have no way of knowing if people who survived just happened to look at the exit routes, or if people who look at exit routes are more likely to survive.

In the world of testing, this bias makes itself present in people who want to know actions between steps. They want to know of people who purchased, did they go to a product page or the a search results page. They want to know what path or what people clicked on. Even if this knowledge was not ignoring the graveyard of knowledge, what would it tell you? More people went to the search results page, is that a good thing or a bad thing? You are accomplishing nothing with this data except adding cost and slowing down your ability to make the correct decision. It is easy to get lost in the world of data if you are trying to tell a story or if you want to find a preconceived point, but as soon as you are trying to use the data to find an answer and not just support your point of view, the discipline of what you look at and knowing what it can tell you becomes paramount.

So the question is, 40 years from now, will all the analysis you do be part of the “winning” group, or will it be lost in the graveyard? Stop pretending that data tells you more than it really does and stop only looking at the winning side, and you will be able to derive magnitudes greater value from your data. The discipline of looking at the whole context and of discovering the value of actions is what will grant you results, not just finding stories. Remember that patterns are only patterns, they are neither good nor bad, and it is incredibly easy to forget that even if they are perfect, they tell you nothing about your ability to change them, or the cost to do so. Data can be the most powerful tool in your arsenal, but it can also be abused to no end and provide negative value and a blanket justification for poor decisions.

June 20, 2012

7 Deadly Sins of Testing – Weak Leadership

Leadership is one of those things that is often described as “I can’t describe it, but I will know it when I see it.” It is the mercurial item that everyone knows that they need, but is often lacking on many different levels in larger organizations. If lack of alignment on a single success metric is the greatest “sin” of a testing program, then lack of leadership is easily the second leading factor of programs not achieving success.

The first thing to realize is that leadership is not something that is tied to position. It can come from any person and any level in an organization. While it is true that you need sponsorship and assistance from higher levels in an organization, the truth is that leadership and management are not correlated, and in a really successful program, you should expect it to come from each and every level. Leadership is the ability to change perceptions, to hold people accountable, and to keep a program focused on what matters. It is the ability to keep people from abusing the program or of trying to prove themselves right; instead changing the conversation towards one that benefits everyone. There is also the need for leaders to make sure that people stop focusing on can and focus instead on should, and of making sure that no one is doing an action just because it will a single person, no matter how high up in the organization, happy. In the world of testing, it is far less getting people to do a certain action but the ability to stop them from doing any of the large number of actions which drive a program into the ground. It may not be in your job description, but if you care about doing the right thing, then it is everyone’s job to help the company grow and to get better.

If your program is doing one of the following: tracking clicks, have multiple metrics, only testing one thing at a time, only testing “test ideas” as they come filtered down to your program, targeting only, analyzing results like they are analytics, diving into every question, not proving people wrong, trying to answer “why?” for every result, constantly diving into technical questions, not consistently testing, or not constantly improving how you do all of those actions and more, then you are suffering from a lack of leadership.

Leadership needs to be able to frame a conversation, to stop people from just responding to requests and to instead proactively set the definition of success. It is up to everyone, but especially the leader of a program to hold people accountable for their actions and focus on the increase of value on the outcome solely. You will almost always need a program sponsor who is aligned with these goals and who is willing to be the stick to force action, but the truth is anyone from the lowest analyst or developer to the CEO can take the lead on these actions.

Stop worrying about what your title is, and instead focus on what can you do, today, to fight the battles that need to be fought and to help make sure that none of these things are taking place in your organization. What is far more important is a willingness to do so and the ability to do it consistently, over time, especially as it gets hard, frustrating, or complicated. Leadership is dealing with the tough battles, not the easy ones. You cannot worry about getting credit, instead you have to focus on making everyone better. It is not about you, it is about everyone working together. It is not always about being “right”, but instead about working with and driving each component of any program so that it is focused on the discipline of success and of always driving people to be better and to do better.

Testing programs, like any other group, often find themselves from time to time with people who can and will do the things necessary to succeed. They “luck” into individuals that create giant steps forward and improve things. What happens then when these people leave? Or get put on other projects? Or lose their focus on what matters and instead focus on their individual success? These programs then stagnate or revert to a plateau of generally accepted competence. Leaders make sure that others are ready and capable of taking charge and that the more people that can drive the program forward, the less is on anyone persons shoulders. The challenge is to build out not just one person who does these things, but then to make sure that others learn and are expected to do the same. You need to hold people accountable to be just as much a leader, and for being capable of doing these things in the absence of that one person.

So many people are afraid of creating their own replacement or of duplicating skill, even though these skills are universal and it only makes you that much more valuable. Take the time to evaluate where you are and what you can do to be better. Work with people at all levels to get a read on what needs to happen, learn, grow, and have a focus. So many people at companies are capable of talking about leadership, but very few are capable of the actions necessary to succeed. Take the time to think about how much of what you do is just what is asked, or how much do you do to actually improve things. Are just trying to make your boss happy, or are you truly changing things for the better? What about the people around you?

Leadership starts from within. There is no greater difference between a follower and a leader then the willingness to change and the willingness to act. It is easy to blame people from the top, and the reality is there may be many people at many levels who have no interest in improving things or changing, but you can always find someone who sees the need to improve. Don’t wait for tomorrow to make the right call or to challenge the status quo.

Programs fail when they become reactive and stagnant, and when they are used by others to facilitate an agenda. All of those problems can be dealt with and overcome over time with leadership. What you do to be better, and what you do to make others aware of the need for them to be better, is what will define how far you and your program really go and what you can really achieve. Help make people higher then you aware of what they need to think about, challenge the people next to you to be better, and help those lower then you to constantly learn and grow until they are above you. A rising tide raises all ships.

June 4, 2012

The Tao of Testing: The Interplay of Strategy and Process

One of the most common refrains from established testing organizations is the need for process. They use the “need” of improved process as the excuse for why they don’t do more, why it takes so long to run tests, why they only get ok results from the program, and why they don’t impact the business in a bigger way. They love to talk about how many tests they can run, but are hesitant to talk about the real impact of the program. Process becomes this constant refrain which serves no purpose but to explain that persons existence and to insure that the no one gets upset. It doesn’t matter what is going on as long as consistent action is happening and other people are using it, as if just the act itself is the provider of value. Process gets turned into this mythical jabberwocky whose function is simply to make it impossible for programs to rule the world. Why do groups so easily accept process as the cure all for what ails them, instead of focusing on improving the efficiency of their program?

Process is just a pattern of actions. It’s what those actions are, and how they are leveraged that truly matter. A great process makes it easy to do the right things quickly and consistently, but it also enables you do the wrong things consistently as well. So many groups are limited because they get to a point where they are proficient at running a test, but have not spent the time to learn and accept what makes a real methodology or strategy to testing. They thinking in terms of how to make their lives easier or how do I do more, not about getting more from what they are already doing. They accept the first answer they find and are more then happy to propagate a myth simply because it gets people to do more. People are so happy to get a result from their test that is actionable, that they try to recreate ways to do that more and more, instead of focusing on ways to get better results or to increase the likelihood of a positive outcome. Groups in that stage are forced to try and run more and more tests in order to increase the value of the program, confusing excess resource usage with value.

It might seem frustrating because you want to do more, but the question is never how do you get more resources or how do you run more tests, but how do you get more from what you do have. Running a test when you start seems like a Sisyphusian effort, but the question should not be how do you make the boulder smaller, but what is waiting for you when you make it to the top of the hill. It is not random, nor is it impossible to get great results consistently, but you have to change how you think about the world before you can change the hill you are trying to climb. You have to stop trying to impress or make people happy and instead truly focus on the discipline of making each action more valuable then the last. To top it off, there are a large number of groups who convince themselves that the larger hill must have a better reward, when often the inverse is correct. Climb the right hill, and the path is shorter and the boulder smaller. Climb the wrong hill, and all you are looking for is a smaller boulder to get off your shoulders.

This is not to say that a proper process cannot dramatically improve a program, however process becomes a problem when held up as a holy grail as opposed to just a means to an end. I want to walk you through a mental exercise where we evaluate four possible scenarios for an organization, and see what the outcome would be. In this case, we are simply dividing groups who have proper process in place and those that have proper discipline and methodology in place.

Group #1: Poor Process & Poor Strategy

Groups in this quadrant suffer from resource shock for testing. They have accepted that they might only see results anything positive from 25–50% of their tests and are struggling to run a number of tests. They are often frustrated by their inability to do more, and as such they try to make up for it by making larger and more complex tests, often requiring even more resources at each turn and slowing down the process even more. They are often running tests in-line with other groups in the company and trying to piggyback their efforts to show that they add value.

Most organization start here and sadly never leave, even when they dump resources into hiring more people or hiring an agency. Groups in this phase suffer from inconsistent returns, low value, and poor adoption in the organization. The only saving grace is that they are saved by the low amount of resources that are dumped into the program and by the small exposure highlighting their inefficiency. To make it worse, agencies love to keep people in this state, as it makes their efforts to get a result seem much greater then they really are and it justifies massive amounts of hours to achieve these minuscule results. They get one or two results over long periods of time and then spend hours justifying it as a great example of the power of their efforts.

At this point, you are a weak man trying to push a rock up a large hill that you are not sure what is on the other side of. It can be done, but it takes a long time to succeed.

Group #2: Poor Process & Proper Strategy

Organizations in this phase have very poor resources, low communication with other groups, and only the most basic of infrastructures on the site. They have to fight IT and product management and are often stuck running a few tests a month or quarter. The tradeoff is that they are maximizing their return through proper use of resources, learning, and best not better testing. They are at a state where they have lots of roadblocks and low resources available to them, but they are planning their program around the most efficient ways to exploit the resources they. They are not settling for the excuses, nor are they running a perfect program, but they are focusing the testing program now that they have instead of trying to make what little they have fit into a predetermined mindset of what “success” is. They are focusing on using what they have to its maximum return instead of facilitating others ideas or “test ideas”. They are not trying to fit a square peg into a round hole, but instead trying to find what fits best into the hole they have access to.

At this stage, you expect that every test they run will give them multiple pieces of information about the value of feasible alternatives against each other and expect that 100% of tests will feed the next test and produce actionable insight. They expect that 75%-95% of tests will also deliver actionable meaningful lift of multiple alternatives which they can then use to measure the efficiency against each other to choose the best option. They might only run a few tests a month, but they are usually getting magnitudes greater direct return from their entire program then programs spending 20x on resources.

Groups in this phase use very little resources and due to that are not able to move at the speed they want. These groups though are able to maximize their efficiency by focusing on getting the largest return for what they have and making sure that every action returns positive and meaningful information. They could do more with better processes and greater resources, but they are still producing results that knock other initiatives out of the water.

At this point, you are still a weak man, but you have chosen shorter hills and you have a really good idea what is waiting for you on the other side. You are not letting some person miles away dictate the hill your go over, but instead finding the best use of resources. Each hill you peak points you to the next hill to tackle instead of predisposing the order of hills that you will tackle.

Group #3: Good Process & Poor Strategy

Groups in this phase have accepted that testing is a great thing for their organization. They love to test and are great at getting tests live. They can often expect to get hundreds of tests live a year and are great at assigning resources, creating a charter, having QA go through a test, and have multiple resources assigned to get a test live and going. Often times they might also work with agencies and pay outside people to make sure that they have those resources. They are still only getting results from 25–50% of their tests, and are often finding that they have to spend more and more resources to get larger and bigger creative as they run through their “roadmap”, but at least they do have results because they have run so many tests. Often times groups in this stage have come up with non meaningful measures of success, such as click through rate or bounce rate as a measure in order to prove to executives that they are doing far more then they really are. This is the point that a large number of very mature programs find themselves.

Groups in this phase have large dollar figures they can point to for outcomes, but when looked at for efficiency are often getting very poor ROI compared to efficient testing programs and are taking resources away from other efforts in the organization. They have built an empire for the people running the program and because of the nature of testing, are often one of the top revenue drivers for the company, but are dumping money to do so. I love that these groups appreciate and believe in testing to the degree that they do, but in many ways not maximizing their investments. By propagating the myth that more tests equals better results, they allow people to confuse getting a result from a test with getting a meaningful result of viable alternatives. Often times these groups get about the same monetary return from their programs as the organizations in group #2 above, but are spending magnitudes more on resources to do so.

At this point, you are spending money for a giant crew and large industrial equipment to push the same rock up usually now steeper hills where you still don’t know what is waiting for you on the other side. There are lots of groups out there waiting and asking to take your money to be the crew and to give you the tools so that you too can mount this hill.

Group #4: Good Process & Proper Strategy

There are very few groups that are in this quadrant, but those that constantly see returns and have built out their programs so that they can test everything and have everyone on the same page to act and test what is necessary. Because they are being efficient and learning as they go, each test exponentially has a higher chance of a positive return and they are hitting on 90–100% of their tests to generate at least 2–5% lift and to compare alternatives. They will run a number of tests, but they focus on the outcomes of the tests and not the raw number it takes to get somewhere. They are also changing the path of the entire organization, showing the causal value of different alternatives and helping to stop current activities while discover new ones.

Groups in this phase see magnitudes higher gross return on their efforts. They are often not spending much more in the way of resources then organizations in group #1, and they are making high multiples of the gross returns of groups in group #3.

At this point, you have gotten some people that know how to move a boulder, and you have bought some equipment to move it, but you are chosen much better hills with much more consistent returns and you are choosing to point the people only at the hills that have the highest returns. The boulder has become irrelevant to the hills you are tackling.

A proper process and more resources simply highlight the dimensions of your program. If you are inefficient or do not know how to run a proper program, it allows you to do more and get a higher gross return. There are many people who hide behind process in order to justify their jobs, and that is a shame, as they could do so much more if they are just willing to play less politics and focus more on achieving results. If you care about being efficient and about getting the maximum amount of return however, then you have to focus on both parts, and understand that more is not always better. If you are only hitting on 50% or less of your tests, or you only get a single outcome from your tests, then you have to run more than double the number of tests in order to get the same gross return of someone who is hitting on 100% of them (compound returns and efficiency of measuring multiple alternatives against each other).

It might seem frustrating because you want to do more, but the question is never how do you get more resources or how do you run more tests, but how do you get more from what you do have. It is easy to blame process or to think that a better process somehow creates a magical panacea that solves everything, but that is because we are looking externally. It is far easier to blame others then to look internally and make sure you are doing the most with what you have. Be honest with yourself and use what you have, and you will be amazed at the results you achieve and how little in the way of resources you need to achieve them. Change how you and others think first, and then process can truly make a big impact.

May 23, 2012

7 Deadly Sins of Testing – No Single Success Metric

In my nearly 9 years working in testing and data, I have worked with or evaluated close to 300 different sites and their testing programs. While I wish that I had all these great stories of amazing programs left and right and amazing results that they are all receiving, the sad truth is that there is no perfect program and there are very few you would even want to take something from as a starting point. The reasons that programs get into this stage are legion, but there are 7 common “sins” that destroy programs. I want to go through each of these 7 deadly sins and look at how they manifest and how to fight them. What you will find is that all of these sins come from the same place, a lack of understanding, either willful or not, of how to think about testing or what the difference is between being the hero or the villain. What you choose to do about these sins is up to you, as there can be no greater retribution then evaluating your own actions, finding your own weakness, and then turning that into a strength.

The first of these sins, and by far the most evil and damaging, is failure to align your program on a single success metric. So many programs fail because they optimize to their KPIs, or to the concept of the test, or even worse, to what the group running the test is measured on. They optimize to improve their concepts, not to improve the site. What makes this sin especially dangerous is that it will make it look like you are greatly successful, as you will get a return, and because the thing you are tracking is not site wide revenue metric, you will often finds the magnitude of change is dramatically higher.

The reason this is a sin is that you are mistaking the concept or the area for the end result. You are ignoring the unintentional consequences of the test to focus on what you want to find out, not what you need to find out. You are assuming that the world only works exactly how you think it does and you are abusing the data to prove your point. A classic example of this is testing to improve “bounce rate” or clicks. In both cases, you are mistakenly thinking in a linear fashion, assuming that the rate of action is the same as the value of the action. Only in cases where the rate is the same as value would you see a tie together, but you will not know the value unless you look at the global impact. To put more simply, if the reduction of bounce rate or the increase of clicks matter, it will impact the bottom line. If it does not, you will see that in the bottom line as well. In both cases, the intermediary action, the bounce or the click, is simply a means to an ends but we are forgetting this in order to make the concept easier for us to understand. In all cases, you can see a massive increase, but because it is not tied to the end result, you have no clue if it is helping or hurting your site make additional revenue or be more efficient.

In way too many cases, when you go through and evaluate global impact after the fact, you find that the increase that you are shooting for comes at the cost of higher value actions, meaning that improving your click through rate to your section is costing your site total revenue. When you don’t do the work to find out, then you will continue to waste money and decrease performance, while at the same time have many great impressive large lifts to talk about to your boss.

What is especially frustrating with this sin is that there are many different groups and “experts” out there that are more than happy to propagate the myth or to abuse it to make themselves look good. Agencies are especially notorious for this behavior. They let you pick a sub metric and optimize to that, which has the double advantage of feeding your ego and avoiding dealing with the core issues that will define your success. Even worse, they will talk you into or let you pick multiple metrics, and if the first one doesn’t show how amazing they are, they will find one deeper in to show how big an impact they had for you. Look at any of the hundreds of posted “success stories” that flood the market, and look at how many of them are based on improving a metric that has nothing to do with site success. This is their fail safe to make you feel better about your program while simultaneously sucking more money out of your pocket.

For many groups, figuring out what the purpose of their site is or what defines success site wide (almost always revenue) is a difficult and time consuming task. It is also the single greatest maker of if you will receive any value from your test. I refuse to work with a group unless they have figured out what they are trying to do for the entire site and then will only run a test if they agree to make decisions only off the impact to that bottom line. The results then can tell you so much. If you find that you are not impacting it much, that means that you are only doing better testing and only testing what you want. If you find that promoting item X increase revenue for that item or group, but the site loses money, you should re-evaluate your priorities for merchandising. If you find that getting more people to your cart doesn’t increase revenue, then you are not optimizing to value. In all cases, the actual reason why things are happening is almost completely irrelevant, but the value derived is from acting on a meaningful way only on site wide goals.

Look at what you are doing and see if you are committing this sin? Are you tracking different success metrics for different tests? Do you look at dependent metrics, such as a limited product set or only conversions from people who click on something? Are you looking at metrics that have no tie to any site wide success like bounce rate or clicks? If you are doing any of those, then you are committing the greatest sin of testing. You are wasting your time and energy to sub optimize and are assuring that you can never know the real impact of your tests.

Finding your single success metric can be difficult and can cause a lot of headaches getting buy-in, but unless you are willing to do the hard work, then what is the purpose of your program. You have no chance of finding real value, and the best you can do is make something think you are having a much greater impact then you really are. There are many bricks on the path into and out of the darkness, it is up to you which direction you are traveling on them.

Share this:

Share this:

Share this:

Share this: