The Need for Speed

One of the largest differentiators between starting, middling, and mature successful programs is the speed at which they can execute. The more you move, and the more efficiently that you move, the more value you get and the more you learn. You can’t get anything from running 100 bad tests, but if you have even a little discipline in your test design, then the faster you move, and the faster you can act on that data, then the more value you will receive and the greater your chance to learn and grow. If it takes you 2 months to just answer a simple business question, how can you expect to keep up with the industry? Speed is the differentiator in the magnitudes of value you receive once you have acted on and enabled the other disciplines of optimization. You have to move fast to insure that you are getting the most from your program.

Dealing with hundreds of different testing groups, you see a very different approach to what is acceptable as far as the speed of getting a test live, from concept to execution. Some groups, it can take months to get a test live, while others are able to execute in mere hours or minutes. No matter what you think is normal, the reality is that every group needs to work every day to get faster and more efficient with their program. The sad truth is that most groups treat acting fast as a failure of the system, letting the weight of their own size work against them to the point that it feels like you are Sisyphus whenever you try to get different teams to work together and an expedited manner. If you want to really get value from your program, you have to create the infrastructure before you try to run, in order to maximize speed and make sure that you are executing at a high enough level for success.

One of the most common things I talk about with clients is that nothing will show your organizational inefficiencies more than trying to act quickly. Whether it is an out of touch IT group, a lack of training, a need to get approval for every change, a slow unresponsive design team, or a thousand other barriers, every group has a large number of roadblocks that stop them from moving quickly. But this doesn’t mean that you can just accept these barriers as being status quo; you have to maximize each effort and make sure that you conversations and energies are spent on improving the infrastructure of your testing program, and not just on running any one test.

There are many different ways to organize in order to succeed, whether it is a L3PS model, a hub and spoke model, or many others, each group needs to figure out what works best for them. It is the act of finding and creating these efficiencies that make programs great, not that great programs have these items available to them. Great programs have common items in place and aligned in order to facilitate speed and to make sure that barriers and eventual road bumps are dealt with as quickly as possible:

1) Sponsorship – All programs that really achieve levels of success have someone higher up who is taking an active role in the program and making sure that things are communicated properly in order to facilitate the various groups working together. It is nearly impossible to really grow your program without someone greasing the gears of your organization.

2) Consistent resources – Testing can’t be a project type of basis, it has to be a core team who are learning and growing their skills in order to become better at what they do. There are many different ways to allocate these resources, but successful teams start with dedicated time and usually move to large teams of people who do nothing but work on optimization efforts.

3) Technical infrastructure – You need to make sure that your site can properly track your single success metric, and that you can get a large majority of tests live with no technical involvement. This means that you need both global pieces of infrastructure, but also properly set up the pieces needed to tackle the key pages on your site.

4) Knowledge Transfer / Knowledge repository – There is a large amount of education needed in order to really move forward with your program. The programs that have the most trouble are the ones that try to run optimization like they do analytics, or visa-versa. You need to train people on how best to think about testing, how best to think about data, and how best to answer their key questions. At the same time, you need some sort of living knowledge base that accumulates your knowledge and is available for those that want to become more involved. Make sure that you are separating the lessons learned from the individual test results.

5) Program built around efficiency – So often where programs falter and slowdown is that they try to only test massive tests that are built to do the one idea that someone higher up wants to try. While you shouldn’t shun these tests all together, you will find that almost always those tests do not get the results that you expect and are not built for learning. Instead test with the resources you have and try to maximize the amount of learning and recipes that you are getting from each test. The goal is not to run as many tests as possible, as 50 awful tests will not be as valuable as one well run test, but instead focus on maximizing returns and keeping a steady cadence of testing to make sure you are always moving forward.

6) Avoid Overly Technical programs – This is the real killer of most groups. You have all the energy that you need to be successful, but all of it is wasted because you are trying to solve the wrong problem. They try to over leverage their IT teams in order to facilitate testing, instead of building out the proper infrastructure throughout. For the most part (maybe 8 out of 10 tests) you should be able to train your mother in order to run the test. CSS and extremely simple repeatable JavaScript functions are your friend. If you are relying on IT to write a long complicated JavaScript offer, or your team does not have the skill to jump in and get a test up on your site in less than an hour, you need to take a deep look into how you are going about your testing. It may seem like getting a test live that quickly is inconceivable with your current organization, but that is why it is so important that you are moving forward to fix those problems so that one day, it seems like just another average task.

There are many other pitfalls that can hurt the infrastructure of a program, from overdone QA processes, too many sign off points, blocking or lack of involvement from senior leadership, unwillingness to listen to numbers or feedback from your UX or branding teams, to a thousand others. The real key is that you cannot ignore them, nor can you think they get solved with some magic elixir. It takes time to move your program forward, and there are always potholes and road bumps on the way. All programs go through an evolution, but at the end of the day, it is your willingness to move forward and try to overcome those barriers that will determine your success.

You have to want to move fast in order to do the things necessary to accomplish it. You have to have the need for speed if you ever want to achieve it. It is doable, and there are lots of resources out there to help, but in the end, it all starts with your willingness to act.

Why we do what we do: Forced Reality – Conjunction Fallacy

One of the funnier trick so of the human mind is the want to pigeon hole or describe things in as much detail as possible. While there are stereotypes and other harmful versions of this, the inverse is usually far more likely to cause havoc with your optimization program, and as such it is the next bias that you need to be aware of; Conjunction Fallacy or “the tendency to assume that specific conditions are more probable than general ones.”

The classic example of this fallacy is to ask someone, “which is most likely true about a person on your site? Did they come from search, or did they come from your paid search campaign code that landed on your #3 landing page and who then looked at 3 pages before entering your funnel?”. Statistically, there is no way for the second statement to be more likely then the first one, since the first one incorporates the second one and a much larger audience, meaning that the scale is magnitudes greater. Yet we often times find ourselves trying to think or do analysis in the most detailed terms possible, hoping that some persona or other minute sub segment is somehow more likely to be valuable then the much larger population.

This mental worm tends to make its appearance the most often when groups set out to do segment analysis or to evaluate user groups. We dive into groups and try to figure out the rate of actions that we want to exploit. Whether it is an engagement score, category affinity, or even simple campaign analysis, we dive so deep into the weeds that we will miss a very simple truth. If the group is not large enough, then no matter what work we do, it is never going to be worth the time and effort to exploit it for revenue. The other trade for this is the inability or want to not group these same users into larger groups that may be far more valuable to interact with. Whether it is people who have looked at 5 category pages and signed-up for newsletters or other inefficient levels of detail, you need to always keep an eye on your ability to do something with the data.

This also plays out in your biases towards what type of test you run. Even if internet explorer and Firefox users may be worth more or more exploitable than campaign code 784567 which is only 2% of your users, this bias makes you want to target to that specific group so much more, both as a sign of your great abilities, but also because we want to be more specific with our interactions with people. Even if the group is much more exploitable, the smaller scale of impact still make it far less valuable to your site.

Here are some very simple rules for segmentation that will make sure that you combat this fallacy:

1) Test all content to all feasible segments, never predispose that you are targeting to group X.

2) Measure all segmentation and targeting against the whole, so that you have the same scale in order to compare relative impact.

3) All segments needs to be actionable and comparable, meaning the smallest segments generally are going to be greater than 7-10% of your population depending on your traffic volume.

4) Segments need to incorporate more than site behaviors and direction to the site, try to include segments of all descriptions in your analysis. Just because you want to target to a specific behavior does not mean that behaviors have more value than non-controllable interactions such as the time of day.

5) Be very very excited when you prove your assumptions wrong on which segment matters most or is the best descriptor of exploitable user behavior.

If you follow those rules, you are going to get more value from your segment interactions and you will stop yourself from falling down this pitrap. We often times have to force a system on ourselves to insure that we are being better than we really are, but when it is over, we can look back and see how far we have come and how much we grew because of that discipline. Revel in those moments, as they will be the things that give you the greatest value to yourself and your program.

Rant – 2012 Predictions

Since I am seeing a number of 2012 prediction threads for my industry, let me throw a couple out of my own:

1) Hundreds of people will claim that analytics/targeting/testing/modeling/etc has allowed them to achieve amazing new success (and will publish or speak about it) – that success will be by doing what they were already doing and will be almost 100% pure BS. It amazed me that people think that they will actually get value from new tools if they use them to do the same old things that were failing before this one tool. Tool sales and agencies will continue to push this promise, now more about solutions then features, but will continue to push people down roads that lead to the same disaster they were trying to avoid by buying the tool.

You have to think differently and challenge the status quo to get any meaningful result.

2) Old will be new again – “personalization”, “dynamic pages”, “modular design”, and many others will make a come back, only to be replaced by the end of the year with updated terms for old concepts (see Blink, mobile, social, email, flash as older examples of things to come back). – This will once again be because new “leaders” will be brought in to replace the old ones who failed, promising to do the next big thing, but really just replacing the old with older as they do only what they are comfortable with.

In the end, if you want a “new” new year, you have to change you and challenge yourself, otherwise you will get the same results you always get.

Confidence and Vanity – How Statistical measures can lead you astray

In dealing with the best ways to change a site for maximize ROI, one of the most common refrains I hear is “is the change statistically confident” or “what is the confidence interval”, which often leads to a long discussion around what do those measures really mean. One of the funniest things in our industry is the over reliance on statistical measures to prove that someone is “right”. Whether it is Z-Score, T-Test, Chi-Squared or other measures, and people love to throw them out and use them as the end-all be-all of confirmation that they and they alone are correct. Reliance on any one tool, discipline, or action to “prove” value does nothing to improve performance or to allow you to make better decisions. These statistical measures can be extremely valuable, when used in the right context and without blind reliance on them to answer any and all questions.

Confidence based calculations are often used in a way that leaves them being the least effective way to the true measures change and importance of data (or “who is correct”) when they are applied to real world situations. They work great in a controlled setting, and with infinitum data, but in the real world, they are just one of many imperfect standards for measuring the impact of data and changes. Real world data distribution, especially over any short period of time, rarely resembles normal distribution. You are also trying to account for distinct groups with differing propensities of action, instead of trying to account for one larger representative population. What is also important to note is that even in the best case scenario, these measures work if you have a representative data set, meaning that just a few hours or even a couple of days of data will never be representative (unless you Tuesday morning visitors are identical to your Saturday afternoon visitors). What you are left with is your choice of many imperfect measures which are useful, but are not meaningful enough to be the only tool you use to make decisions.

What is even worse is that people also try to use this value as a predictor of outcome, so they say things like I am 95% confident that I will get 12% lift. These measures only measure the likelihood of the pattern of outcome, so that you can say, I am 95% confident that B will be better than A, but they are not measures of the scale of outcome, only the pattern.

It is like someone found this new fancy tool, and suddenly has to apply it because they realize that what they were previously doing was wrong, but now, this one thing, will suddenly make them perfect. Like any tool at your disposal, there is a lot of value when used correctly and with the right amount of discipline. When you are not disciplined in how you evaluate data, you will never really understand it and use it to make good decisions.

So if you can not rely on confidence alone, how best to determine if you should act on data? Here are three really simple steps to measure impact of changes when evaluating causal data sets:

1) Look at performance over time – Look at the graph, look for consistency of data, and look for lack of inflection points(comparative analysis). Make sure you have at least 1 week of consistent data (that is not the same as just one week of data). You cannot replace understanding patterns, looking at the data, and understanding its meaning. Nothing can replace the value of just eye balling your data to make sure you are not getting spiked on a single day and that your data is consistent. This human level check gives you the context that helps correct against so many imperfections that just looking at the end numbers leaves you open for.

2) Make sure you have enough data – The amount needed changes by site. Some sites, 1000 conversions per recipe is not enough, some sites 100 per recipe are. Understand your site and your data flow. I cannot stress enough that data without context is not valuable. You can get 99% confidence on 3 conversions over 1, but that doesn’t make it valuable or the data actionable.

3) Make sure you have meaningful differentiation –Make sure you know what your natural variance is for your site (in a visitor based metric system, it is pretty regularly around 2% after a week). There are many easy ways to figure out what it is for the context of what you are doing. You can be 99% confident at .5% lift, and I will tell you have nothing (neutral). You can have 3% lift and 80% confidence, if it is over a consistent week and you natural variance is below 3%, and I will tell you have a decent win.

I have gotten into many debates with statisticians whether confidence provides any value at all in the context of online testing, and my usual answer is that if you understand what it means, it can be a great barometer and another fail safe that you are making a sound decision. The failure is that you can’t just use it as the only tool in your arsenal. I am not saying that there is not a lot of value from P-value based calculations, or most statistical models. I will stress however that they are not panacea nor are they an excuse for not doing active work to understand and act on your data. You have to be willing to let the data dictate what is right, and that means you must be willing to understand the disciplines of using the data itself.