Confidence and Vanity – How Statistical measures can lead you astray
In dealing with the best ways to change a site for maximize ROI, one of the most common refrains I hear is “is the change statistically confident” or “what is the confidence interval”, which often leads to a long discussion around what do those measures really mean. One of the funniest things in our industry is the over reliance on statistical measures to prove that someone is “right”. Whether it is Z-Score, T-Test, Chi-Squared or other measures, and people love to throw them out and use them as the end-all be-all of confirmation that they and they alone are correct. Reliance on any one tool, discipline, or action to “prove” value does nothing to improve performance or to allow you to make better decisions. These statistical measures can be extremely valuable, when used in the right context and without blind reliance on them to answer any and all questions.
Confidence based calculations are often used in a way that leaves them being the least effective way to the true measures change and importance of data (or “who is correct”) when they are applied to real world situations. They work great in a controlled setting, and with infinitum data, but in the real world, they are just one of many imperfect standards for measuring the impact of data and changes. Real world data distribution, especially over any short period of time, rarely resembles normal distribution. You are also trying to account for distinct groups with differing propensities of action, instead of trying to account for one larger representative population. What is also important to note is that even in the best case scenario, these measures work if you have a representative data set, meaning that just a few hours or even a couple of days of data will never be representative (unless you Tuesday morning visitors are identical to your Saturday afternoon visitors). What you are left with is your choice of many imperfect measures which are useful, but are not meaningful enough to be the only tool you use to make decisions.
What is even worse is that people also try to use this value as a predictor of outcome, so they say things like I am 95% confident that I will get 12% lift. These measures only measure the likelihood of the pattern of outcome, so that you can say, I am 95% confident that B will be better than A, but they are not measures of the scale of outcome, only the pattern.
It is like someone found this new fancy tool, and suddenly has to apply it because they realize that what they were previously doing was wrong, but now, this one thing, will suddenly make them perfect. Like any tool at your disposal, there is a lot of value when used correctly and with the right amount of discipline. When you are not disciplined in how you evaluate data, you will never really understand it and use it to make good decisions.
So if you can not rely on confidence alone, how best to determine if you should act on data? Here are three really simple steps to measure impact of changes when evaluating causal data sets:
1) Look at performance over time – Look at the graph, look for consistency of data, and look for lack of inflection points(comparative analysis). Make sure you have at least 1 week of consistent data (that is not the same as just one week of data). You cannot replace understanding patterns, looking at the data, and understanding its meaning. Nothing can replace the value of just eye balling your data to make sure you are not getting spiked on a single day and that your data is consistent. This human level check gives you the context that helps correct against so many imperfections that just looking at the end numbers leaves you open for.
2) Make sure you have enough data – The amount needed changes by site. Some sites, 1000 conversions per recipe is not enough, some sites 100 per recipe are. Understand your site and your data flow. I cannot stress enough that data without context is not valuable. You can get 99% confidence on 3 conversions over 1, but that doesn’t make it valuable or the data actionable.
3) Make sure you have meaningful differentiation –Make sure you know what your natural variance is for your site (in a visitor based metric system, it is pretty regularly around 2% after a week). There are many easy ways to figure out what it is for the context of what you are doing. You can be 99% confident at .5% lift, and I will tell you have nothing (neutral). You can have 3% lift and 80% confidence, if it is over a consistent week and you natural variance is below 3%, and I will tell you have a decent win.
I have gotten into many debates with statisticians whether confidence provides any value at all in the context of online testing, and my usual answer is that if you understand what it means, it can be a great barometer and another fail safe that you are making a sound decision. The failure is that you can’t just use it as the only tool in your arsenal. I am not saying that there is not a lot of value from P-value based calculations, or most statistical models. I will stress however that they are not panacea nor are they an excuse for not doing active work to understand and act on your data. You have to be willing to let the data dictate what is right, and that means you must be willing to understand the disciplines of using the data itself.
Hi Andrew, great blog, I’ve been enjoying reading your various articles. Quick question; do you have any thoughts on how to calculate a significance score when running revenue based experiments? For example; my control page sells an item for $50 and we’d like to test a variant which has a discounted price of $35. I haven’t found much/any information on this topic , so any thoughts would be appreciated.
All the best, Richard
Thanks for the comments and the question. To me there are really 3 things in this question:
1) Non discrete (continuous) values for a P-Score (Z-Test/Test, etc…). You can usually use some form of ANOVA type of calcuation but the real problem is that you have to continuous update it and account for all the users that are of zero value, which can lead to massive distrabution effects. What most tools do and not tell you is actually only look at completed sales totals and then calute based on that (using a sum of rev squared approach) which means that when tools give confidence on RPV they are really giving you confidence on AOV, which is pretty useless.
2/3) The biggest problem with the 35 vs. 50 question (outside of it being a 2 experience test, try 30, 35, 40, 50, 55, 60, 75 or something to that effect) is that have to ensure that you are focusing on net revenue not gross revenue. Confidence means little to start with and has no bearing if you are stuck looking at gross revenue changes. I would suggest checking out this article:
As well as this one about the problems with 2 experience tests:
I hope that helps.