Category: Rants

MVT – Why Full Factorial vs. Partial Factorial Misses the Entire Point

One of my first introductions to the larger world of testing was getting a chance to serve on a panel about Multivariate testing. I remember how divergent the opinions were and how bad the misconceptions were of the entire process. Just about everyone I talked to had these same common preconceived notion of how to use multivariate testing, and even worse almost all those notions were based on their need to propagate their sales pitches. Now as I work with more and more organizations, you see the same bad ideas replicating and groups continue to not understand the true value from multivariate testing. MVT testing is something that holds all these promises, but when done for the wrong reasons, multiplies the worst of testing, instead of facilitating the best of testing. Even worse, groups then confuse the issue, focusing on the method of the test, and not the fundamental mindset that created it. Many groups then get into debates around the “value” of the different multivariate methods out there, which is nothing more than a fools errand since any method is going to fail.

Too many times people get caught up on the “advantages” or “disadvantages” of the various forms of multivariate analysis. There are many advantages of full factorial testing, from fewer rules, better insight into interactions across tested elements, and the ability to test out non uniform concept arrays. There are many advantages to partial factorial testing, speed, forced conformity to better testing rules, more efficient use of resources. What does not matter is which one allows you to throw things at a wall and get an answer. When you are busy trying to answer the wrong question, then you can fail with any tool. It is only when you are trying to succeed that the differences between tools matter.

The fundamental use of multivariate testing for most groups is to combine multiple badly conceived A/B tests, so that they can quickly throw them all together so they can find a combination that increases results. So many groups want to try out this combination of ideas, so they think a MVT campaign is the solution. Fundamentally you can use the test that way, it is a both statistically a valid outcome and will guarantee a result, but at what cost? The challenge is that you will wasting resources, time, and are guaranteed to get a suboptimal outcome from this flawed way of thinking. Any form of multivariate testing that is just used as a massive collaboration of individual tests is always going to be inefficient, since you are replicating and adding the imperfections of those individual tests in a way that magnifies those imperfections. If your goal is simply that individual outcome, and it is for way too many programs and especially agencies, then you will never get any true value from multivariate testing until you change your mindset.

Fundamentally the concept of trying to just find a combination misses a fundamental truth, that you are spending a massive amount of resources, creating all these permutations and offers, without an understanding of the efficiency of each resources.

1) All the ideas come from preconceptions and hypothesis about what does work

2) The addition of all new variants adds cost in the creation and the data acquisition to be meaningful

If we instead focus on multivariate testing as a means to filter our resources instead of simply combine them, then we are able to achieve efficiency. If we try to limit our resources and only apply them where we will get the most return, then we must always via multivariate testing as a tool to learn and be efficient, not one to just throw things out to see what works.

The classic example of a multivariate test is testing a button. Let us say I have a medium orange purchase button currently on my site. I might think that red might be better than orange, and my UX person thinks that buy now will perform better because he saw it on a few other competitor sites. You throw it out by also adding a slightly larger button and you get a predicted best combination of large orange buy now. You slap yourself on the back, and you move forward. The reality is that each of those factors, size, color, copy have a massive amount of feasible alternatives, and all we did was look at a very limited biased set of them.

Let me propose a better way. Look at that same test, but instead of preconceiving the outcome, look for the value of each factor. If we took the same test, and we found out that size matters more than color, despite what you thought going in. If we spend as little resources as possible to achieve that understanding, then we have left the maximum amount of resources available to apply to the winning factor or element. If we have learned that size matters, we can shift our resources away from less influential elements and then apply the resources towards as many different feasible alternatives of the execution of the winning factor. Instead of being limited to testing 3-4 sizes, we can know the value of size and then create as many different alternatives as possible. Not only have we used less resources, but they have been applied towards the most influential part of our experience.

Even better, I now have learned that size matters most, and I have an outcome that is different and greater then I would have before. In fact I have shifted the system so that the absolute worst thing that can happen is that I end up with the same alternative I would have before, but for less time and resources. I have also added a much higher upside so that I can get a better outcome by having an alternative that I would not have previously included come out the winner. I have also tested out more alternatives of the important factor so that I am not limiting my output by the single input of popular opinion. I have leveraged multivariate testing as a way to learn what matters and to focus my future efforts on that. I no longer have to create alternatives for factors that have no influence, and can instead focus resources on testing as many different feasible alternatives I can for the things that do influence behavior.

The less you spend to reach a conclusion, the greater the ROI. The faster you move, the faster you can get to the next value as well, also increasing the outcome of your program. What is more important is to focus on the use of multivariate as a learning tool ONLY, one that was used to tell us where to apply resources. One that frees us up to test out as many resources for feasible alternatives on the most valuable or influential factor, while eliminating the equivalent waste on factors that do not have the same impact. The goal is to get the outcome, getting overly caught up in doing it in one massive step as opposed to smaller easier steps, is fool’s gold.

You CAN leverage multivariate tests in a large number of ways, and let me tell you that there are enough 15×8 tests out there to show that statistically, it is a statistically valid approach. The question is never what can you do, but what SHOULD you do. Just because I can test a massive amount of permutations does not mean that I am being efficient or getting the return on my efforts that I should. We can’t just ignore the context of the output to make you feel better about your results. You will get a result no matter what you do, the trick is constantly getting better results for fewer resources.

If you are stuck in the realm of trying to show results from a single test, or are not thinking in terms of your testing program as a learning optimization machine, then you aren’t going to get results you need no matter what you do. multivariate tests are useful only in the context of your program, if you are stuck thinking in terms of just the outcome of that specific test, you will never achieve the results that you want.

If you shift to think about it in context of a larger program, then multivariate tests are just one of many tools you have at your disposal to achieve those goals. Don’t let the promises and sales pitches of a few divert your attention away what matters. And if you are focusing on what matters, then the nature of which type of multivariate test you use becomes almost completely moot.

Rant – A Result is Not the Same as the Value

Just to make something clear, you will get a result from any action you take. Some good, some bad, some are neutral, but in all cases, there is a result. That result comes from not only what you do, but the context of the environment where you made it. Just slapping yourself on the back for the result you get misses the entire point. How is doing what you did, how is that any better then any other action you could have taken? If you don’t know this, then you have no clue the action value of what you chose to do. You could have just chosen the second worst of the 1000 different options you have in front of you, but because you ignore the context, you are happy, you pat yourself on the back and you tell people all about how great you are.

The goal is not to get a result, it is to become more efficient, so that you get more for what you are spending, or the same for less. Any action that doesn’t add efficiency is not adding value for your company.

History has shown that it is always the more efficient company that succeeds. That means you need two piece of information for any action: The “value” of the action (how much better it is then other alternatives) and the cost of that action. To do that, you have to be able to leverage two key things: Causal information (value) and discipline (cost) to measure the value of relative actions, and the cost to achieve them.

Correlative data, by itself, can NEVER tell you this, since it can only tell you a single rate of action of what happened in the past. It can only tell you what did happen, not what could happen, or the value of those actions. You have people who click on a link that see on average 3 pages per visit? Cool, but what about if they clicked on something else? or if that wasn’t even a possible action for them? What would they have done? would they have done more or less? Would they have seen 5 pages? 2? What about the cost to get them to see 5 pages? What about the cost to redesign that page, or the relative cost of impacting that segment versus another? Correlative data only tells you the rate of action, it is impossible for correlative data to tell you the value of those actions, since that is a comparative analysis. That difference is the only thing that matters, and you simply can’t answer it with correlative data alone…

So why then do we hear nothing but the value of this correlative data? Why do we pretend it tells us more then it does? Why do we pretend it informs decisions that it doesn’t? Why do we build fancy dashboards, or initiatives, or entire strategies on missing this fundamental point about the data you are using? It isn’t the data, it has always told you exactly the same thing, the problem is that we pretend we get so much more from it then we really do.

You can’t rely on passive action, you can’t be passive in data acquisition or you actions of that data. Passive data is nice to have, but is not enough. You have to create a cycle of using causal information to inform correlative information, not the other way around if you want real value.

Rant – 2012 Predictions

Since I am seeing a number of 2012 prediction threads for my industry, let me throw a couple out of my own:

1) Hundreds of people will claim that analytics/targeting/testing/modeling/etc has allowed them to achieve amazing new success (and will publish or speak about it) – that success will be by doing what they were already doing and will be almost 100% pure BS. It amazed me that people think that they will actually get value from new tools if they use them to do the same old things that were failing before this one tool. Tool sales and agencies will continue to push this promise, now more about solutions then features, but will continue to push people down roads that lead to the same disaster they were trying to avoid by buying the tool.

You have to think differently and challenge the status quo to get any meaningful result.

2) Old will be new again – “personalization”, “dynamic pages”, “modular design”, and many others will make a come back, only to be replaced by the end of the year with updated terms for old concepts (see Blink, mobile, social, email, flash as older examples of things to come back). – This will once again be because new “leaders” will be brought in to replace the old ones who failed, promising to do the next big thing, but really just replacing the old with older as they do only what they are comfortable with.

In the end, if you want a “new” new year, you have to change you and challenge yourself, otherwise you will get the same results you always get.

Rant – Statistics is a Tool & Not an Answer

I read a post by a famous blogger today that brought up one of least favorite things about our industry. One of the funniest things in our industry is the over reliance on statistical measures to prove that someone is “right”. Whether it is Z-Score, T-Test, Chi-Squared or other measures, and people love to throw them out and use them as the end-all be-all of confirmation that they and they alone are correct. Besides the hubris of the situation, it is just a new tool for people to abuse and not understand.

What is funny about this is they are some of the worst true measures of impact and importance of data (or “who is correct”). They work great in a controlled setting, and with infinitum data, but in the real world, they are just one of many imperfect standards for measuring the impact of data and changes. You cannot replace understanding patterns,looking at the data, and understanding its meaning. I would settle for people understanding what those tests even tell you (hint, it is not 95% confident you will get a 10%lift).

It is like someone found this new fancy tool, and suddenly has to apply it because they realize that what they were previously doing was wrong, but now, this one thing, will suddenly make them perfect.

For the record, three really simple steps to measure impact of changes:

1) Look at performance over time – Look at the graph, look for consistency of data, and look for lack of inflection points(comparative analysis). Make sure you have at least 1 week of CONSISTENT data (that is not 1 week of data). This also gets into why visitor based metric systems are much better for this analysis, and also why you need to think in terms of propensity of action.

2) Make sure you have enough data – The amount needed changes by site. Some sites, 1000 conversions per recipe is not enough, some sites 100 per recipe are. Understand your site and your data flow. I cannot stress enough that data without context is not valuable.

“Information is not knowledge”

3) Make sure you have meaningful differentiation –Make sure you know what your natural variance is for your site (in a visitor based metric system, it is pretty regularly right at 2% after a week). Make sure that the lift is consistent, and that it is more than mechanical noise (something laboratory stats equations DON’T account for). You can be 99% confident at .5% lift, and I will tell you have nothing (neutral). You can have 3% lift and 80% confidence, if it is over a consistent week (this is not daily performance) and I will tell you have a decent win.

I am not saying that there is not a lot of value from P-value based calculations, but I will stress that they are not panacea nor are they an excuse for not doing active work to understand and act on your data.