My first trip through the common heuristics of conversion rate optimization looked at two of the more common testing ideas and how they usually reach false or limiting conclusions. In my second part I want to look at general testing theory best practices and how they can be major limiting factors in the success of your program.
It is important to remember that you are always going to get an outcome so this is not about can you make money. How you and the people in your organization think about testing is the largest factor in what you value that optimization produces. This is an evaluation of the efficiency of the method and how much does it produce for the same or less resources. In concept you can spend infinite amount of resources to achieve any end goal, but the reality is that we are always faced with a finite amount of time and population, which means we must always be looking for ways to improve inefficient systems. If we continue to be limited by these common heuristics then the industry as a whole will continue to produce minimal results compared to what it can and should be producing.
Always have a Hypothesis –
There is not more misunderstood term then hypothesis. In all likelihood it is because most are familiar only with their 6th grade (at least in my school) science instruction or they took classroom formal science in college. In those fields we operate like we have unlimited time and resources and we are trying to validate whether a drug will cause cancer, not whether a banner will get more clicks if it is blue or red. The stakes are higher and the models are much more simple in classroom controlled studies for cancer. There is a lot to scientific method, especially when approached from a resource efficiency perspective that is not considered in such a simplistic view of idea validation.
We must apply scientific rigor, but we must also make sure that all actions make sense in real world situations, which means that efficiency and minimizing regret are more important than validation of an individual’s opinion. It is not that scientific method relies on the use of a hypothesis, it is simply that we mistake a hypothesis with a correct hypothesis; we seek validation for our opinions and not the discovery of the best way to proceed. Science is also about proving one idea versus all other alternative hypothesis yet we ignore that part of the discipline because it is not the part that allows someone to see if they are right. In the grand scheme of things we are drastically over valuing test ideas and that is distracting from the parts of the process that provide value.
Let’s start with the basics. You should never, and I mean never, run a test if you do not have a single success metric for your entire site. In most cases this is to make more money, but whatever it is, this goal exists outside of the concept of the test. You must also must have rigid measurement and action rules that are reproducible, which means that you must understand real world situations like the limitations of confidence and variance.
You can then have an opinion about what you think will happen when you make a change. The problem is when we confuse that opinion with the measured goals of the test. Even worse we limit what we compare resulting in massively inefficient use of your time and effort. Just because you believe that improving your navigation will get people to spend more time on your site, that is completely irrelevant to the end goal of making more money. Your belief that more engagement will result in more revenue is not enough to make it so. If you are right AND if that also produces more revenue, then you will know that from revenue. If you are wrong you will only know that from revenue. We must construct our actions to produce answers to our opinion and to what is best for our organization. Hypothesis and ideas are just a very small part of a much more complex and important picture, and over focus on them allows people to avoid the responsibility and the benefit on focusing on all those other parts, which are the ones that really make a difference over time for any and all testing programs.
The worst factor of this is that it allows people to fall for congruence bias and to fail to ask the right questions. We become so used to the conversation around a single idea that the concept of discovery and challenging assumptions is more word then action. Questions can be incredibly important to the success of a program, but only if they are tackled in the right order and used to focus attention, not as the final validation of spent attention. If your hypothesis is that a certain navigation change will result in more engagement, then the correct use of your resources are either which of a number of different versions of the navigation will produce the most revenue or if you can, which section on your site produces the most engagement when changed. In both cases you have adapted your “hypothesis” to present a more efficient and functional use of your time. The hypothesis exists, but it is not the constraint of the test. If you are right, you will see it. If you are wrong, you will make more money.
This means that having a hypothesis is important, but only if it is not the test charter. Have an idea what you are trying to accomplish and make sure that you go about seeing the value of certain actions compared to each other is more important. Sometimes the most effective hypothesis are “I believe that we do not know the value of different sections on our pages.” Don’t confuse your opinion on what will win with a successful test. Challenge assumptions and design efforts to maximize what you can do with what you have and you will never be without opinions. The best answers are always when you are proven wrong, but if you get too caught up on validating your hypothesis, then you will always be missing the largest lessons you could be learning.
We need to optimize X because it is losing Y
This is the classic problem of confusing rate and value, or more correctly correlative and causal inference. We confuse what we want to happen with what is really happening. Just because people were doing X and now they are doing Y, it doesn’t mean that this is directly causing any change, positive or negative to our end goals. Outside of the three rules of proving causation the real issue here is that we get tied to our beliefs about a pattern of events even when the data cannot possibly validate that conclusion. Understanding and acting on what you know as opposed to what you want to have happen is the difference between being data driven and simply being data justified.
Think about it this way, I have 23% clicks on one section of my page and 0% on another. If I were to improve one of those which one is going to produce the biggest returns? The answer here is that you do not know. A rate of interaction cannot possibly tell you the value of changing that item. Some of the most important parts of any user experience are things that can’t even be clicked.
This plays out outside of clicks too. We have a product funnel and we see more people leaving on page 3, therefore we need to test on page 3. The reality is that more or less people may or may not be tied to more or less revenue. Even if it is tied it may be a qualification issue higher, or a user interaction issue, or simply too many people in a prior step. This is called a linear assumption fallacy, where we assume that when we have 5 people and 2 convert that if we have 10 people 4 will convert. Linear models are rare in nature but are easy to understand, so we fall back on comfort over realistic understanding.
The act of figuring out what to test can be difficult but it is never improved by pretending we have validation of our own ideas when we have nothing to justify them. We need to be open to discovering where we should go and to focus on some set path. In almost all cases you will find that you are wrong, often dramatically so, about where problems really are and how to fix them. This is why it is so important to not try and focus solely on more or less correlative actions. We can and should be able to test fast enough and with few enough resources that we will never be limited to this realm unless we can are stuck there mentally.
Like so much else what you spend your time and effort on is incredibly important. There are a thousand things you can improve and there are always new ideas. Justifying them falsely or focusing on them instead of the discipline of testing is nothing but a drag on your entire testing program. Test ideation is about 1% of the value derived from a test program yet it is 90%+ of where people like to spend their time. A 5% gain that took 2 months is worth a lot less than a 10% gain that took 2 weeks. The most important issues we must face are not about generating test ideas or validating our beliefs about how to improve our site, it is about discovering and applying resources to make sure that we are doing the 10% option and not the 5% option. If we overly focus on test ideas and not the discipline of applying them correctly we are never going to going to achieve what should be achieved. If we get lost trying to focus only on where we want to go, then you will always be limited in the possible outcomes you can generate.
Talk to 5 people in the optimization space and you will get 5 different stories about how best to solve your website. Talk with 50 however and those 5 will get repeated more often than not. Such is the world we operate in where “best practices” become so common place and repeated that we often do not take the time to really think about or prove their effectiveness. Because of this phenomenon a lot of actions which are less than ideal or outright bad for companies become reinforced must do items.
The reality is that discipline is going to always win out over specific actions, and that often times the best answer is to measure everything against each other and take nothing for granted. While all of that is true it is still important you understand these common suggestions, where they work, how, why, and more importantly why people believe they are more valuable than they really may be.
Test Free Shipping or Price Changes
This is a real common one for retail sites as it is easy to understand, and a common tactic (thanks Amazon) and one that is easy to sell to the higher ups. The problem is not actually the concept, but how people measure the impact of it, and what that means to other similar tactics. What can easily seem like a huge win is often a massive loss, and even worse due to how most back-end systems are designed the actual amount of work needed to achieve these tests can be much higher than other more simple and extremely valuable uses of your finite resources.
Let’s look at the math of a basic free shipping test. In this simplified scenario, we sell 1 item for $90 dollars on our site, with an actual cost of $70 to us ($20 net profit). Our shipping is $10 dollars, which means that when it is normally purchased someone pays us $100.
We want to test free shipping, where we pay for the shipping and sell the same widget for now $90. We run the test and we have an 50% increase in sales! We should be getting promotions and in most cases the person who ran this project is shouting their accomplishments to the entire world and everyone that will listen. Obviously this is the greatest thing ever and everyone should be doing it… except you just lost a lot of money.
The problem here is that we often confused gross and net profit, especially because in a lot of different tests you are not directly changing the bottom line. In the case of free shipping or pricing tests though, we are directly change what a single sell means to us.
Let’s dive into the numbers of the above. Let’s say that we sell 1000 orders in our control normal group.
$100 X 1000 = $100000
But the real number that impacts the business is:
$20 x 1000 = $20000
In the free shipping option, we have cut our profit in half by paying for the $10 shipping, which means that at $10 profit we actually have to have twice as many orders JUST TO BREAK EVEN.
$20000 / $10 = 2000
This means that if we fall back to the standard RPV reporting that you look at for other types of tests, then the math says that:
$100 X 1000 = $100000
$90 X 2000 = $180000
So any option where we do not increase RPV by at least 180% means we are dramatically losing revenue. So many times you see reports of amazing results from these kinds of optimization efforts which are masking the realities behind the business. It can be hard, no matter how much this makes sense in conversation, to have the discipline to think about a 50% increase as a loss, but that is exactly what happened here. Sadly this hypothetical story plays out often in the real world, with the most likely result being the pushing of the results and not the rational evaluation of the impact to the business.
This same scenario plays out anytime we have varied margin and not as varied gross cost. The other common example is price changes, where the cost of the item remains fixed, but the test is only truly impacting how much margin we make off of the item. In both cases we are forced to set minimum marks prior to starting a test, and treating those as the neutral point, not the normal relative percentage lift that we might be accustomed to.
Always repeat content on your site
This and a large number of other common personalization type suggestions (who to target to and how to target to them) actually have a large number of issues inherent to them. The first is that even if what is suggested is true, it does not mean that it is the most valuable way to tackle the problem. Just because repeating content does improve performance by 3%, it doesn’t mean that doing something else completely will not result in a 10% or 50% increase.
The sad truth is that repeating content, when it does work, is often a very small incremental gain and pails in comparison to many other concepts of content that you could be trying. The goal is not to just do something that produces an outcome as every action produces an outcome, the goal is to find the action that produces the maximum outcome for the lowest amount of resources. In that light repeating content is often but not always a poor use of time and resources. The reason it is talked about is often not due to its performance but because it is easy to understand and easier to get buy-in from above.
The second major problem with these is that they skip the entire discipline that leads to the answer. There is no problem with repeating content as long as you also try 3-4 other completely different forms of content. Repeating content may be the right answer, it may be an ok answer, and it may be the worst answer, but you only know that if you are open to discovering the truth. There is no problem having a certain group or behavior you want to see if you can target to, the issue is when you target to them without looking at the other feasible alternatives. If you are not testing out multiple concepts to everyone and looking at them for the best combination, then no matter what you do you are losing revenue (and making you and your team do extra work).
The real irony of course is that if you test these out in a way to find out the impact compared to other alternatives, the absolutely worst case scenario is that you are correct and you target as you would have liked. Any other scenario presents you either with a piece of content or the group or both that results in better performance. Knowing this information allows you to save time and effort in the future as well as spend resources on actions that are more likely to produce a result.
It is not unusual to find that doing just targeting to a specific group will result in that group showing a slight increase, and if that is all that you look at you would have evidence to present and share internally as success. Looking at the issues deeper you commonly find that the overall impact to the business is negligible (within the standard 2% natural variance) or even worse negative to the whole. It is also not uncommon to find a combination that you never thought of presenting a massive gain.
One of my favorite stories in this line was when I worked with an organization that had decided exactly how and what to target to a number of specific groups based on a very complex statistical analysis of site behaviors. They had built out large amounts of infrastructure to facilitate this exact action. We instead took 100% of the same content they already had and presented it to everyone, looking at the impact to serving it to the groups they envisioned as well as others. We simple took all their existing content and serve it to everyone and also in a few different dynamic permutations. The result showed that if they had done only what they had envisioned they would have lost 18% total leads on the site (this is also a great example of why causal inference is so vital and to not rely on correlative inference). They also found that by serving 2 of their normal pieces of content based on behaviors they had not envisioned they would see a 20% gain. They were able to go from causing dramatic harm to their business to a large meaningful multimillion dollar gain simply by not relying solely on hearsay and instead testing their assumptions.
In both cases there are many different ways you can manipulate the data to look like there was a positive outcome while actually doing damage. In both cases massive amounts of time and effort was spent to try something only to find an outcome counter to people’s assumptions. In both cases testing out assumptions and exploring to discover the value of different actions prior would have better informed and created more value.
In the end, any idea is only going to be as valuable as the system you put it through. There is nothing inherently wrong with either concept as long as they are measured for efficiency and acted on rationally. If you can take a common heuristic and evaluate it properly, there is value to be had. That does not mean that they will act as magical panacea, nor should you plan your program around such flawed simple ideas. Focus on building the proper system and you will be able to provide value no matter what concepts get thrown your way.
There are many challenges for anyone entering a new field of study or a new discipline. We are all coming into any new concept with all of our previous held knowledge and previous held beliefs filtering and changing how we view the new thing before us. Some choose to make it fit their world view, others dismiss it from fear, and others look for how it can change their current world view. Usually in these situations I quote Sherlock Holmes, “It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.” Nothing represents this challenge more in online marketing then the differences between analytics and optimization, and nothing represents that struggle more than the debate about visit based measurement versus visitor based measurement.
The debate about should someone use visits, impressions, or visitor basis for analysis is a perfect example of this problem, as it is not as simple as always use one or the other. When you are doing analytics, usually visits are the best way to look at data. When you are doing optimization, there is never a time where visits would present to you more relevant information then using a visitor based view of the data.
Analytics = Visit
Optimization = Visitor
The only possible exceptions are when you are using adaptive learning tools. While the rules can be simple, a deep understanding of the way presents many other opportunities to improve your overall data usage and value derived from every action.
Since most people reading this start in an analytics background, let’s look at what works best in that environment. Analytics is a single data set correlative data metric system, which is a long way of say, it counts things on a consistent basis and only one set of data, even if that data has many different dimensions. You are only recording what was, not what could or should be. In that environment, you have to look at data in some very particular ways. The first amongst those is a very tight control on accuracy, since in many cases the use of that data is to represent what the business did, and to hopefully make predictions about the future.
It is also important that you are consistent with how you measure and that you look at things in a common basis. Because most people are comfortable looking at a day or shorter term basis, this means the easiest method is going to be a visit. It is works great because you are trying to look at interactions and to measure in a raw count of things that did happen, e.g. how many conversions, or how many people came from SEO. In those cases, a raw count in a correlative area is going to be best represented using a visit basis, since it mitigates lost data (though it is not a massive amount) and it best reflects the common basis that people look at data.
In the world of optimization however, you have a completely different usage and type of data. In optimization we are looking at a single comparative data point, and trying to represent an entire different measure, which is influence on behavior over time. It doesn’t matter if your site changes once a year or once an hour, or if your buying cycle is 1 visit or 180 days, all of those things are irrelevant to the fact that you are influencing a population over time. Because behavior is defined as influence on a population, and because we are looking comparatively over time, the measurement techniques used in analytics need to be rethought. Any concern about accuracy, past a simple point, become far less important than a measure of precision (consistency of data collection) since all error derived is going to be equally distributed. It doesn’t matter if the common basis is $4.50 or $487.62, what matters is the relative change based on the controlled factor. It is also important that we are focusing far more on the influence then the raw count, which means we are really talking about the behavior of the population.
In analytics you are thinking in terms of, what was the count of the outcome (rate) as opposed to in optimization the focus is on what was the influence (value). To really understand optimization, you have to understand that all groups start with a standard propensity of action which is represented by your control group. If you do nothing the people coming to your site, people in all stages and all types of interaction, measure up to one standard measure across your site (though all measurement systems do have internal variance in a small degree). Since we are measuring not what the propensity of action is but what are ability to positively or negatively influence it is, we need to think in terms of reporting based on visitors and based on the change (lift) and not the raw count.
You also have the case of time, where we need to measure total impact over time. While it is correct that every time a visitor hits your site you have a chance to influence them, it is important to remember that the existing propensity of action measurement already accounts for this. What we are looking for is a simple measure of what did we accomplish by in terms of getting them to spend more. This means that we have to think in terms of both long and short term behavior. Some people will purchase today, some 3 visits later, but all of that is part of standard business as usual. It is incredibly easy to have scenarios where you get more immediate actions but less long term actions. This means that on a daily basis you might see a short term spike, but for the business overall you are going to be making actually less revenue. This possibility creates two possible measurement scenarios:
1) There is no difference between short term and long term behavior, meaning the short term spike continues through and is positive also in the long term. In this scenario the only way to know that is to look at the long term.
2) There is a difference and short and long term behaviors differ and we are getting a different outcome by looking at the visitor metric over time. In this scenario the only positive outcome for the business is the visitor based metric view.
In both cases the visitor based metric view gives us the full picture of what is good for the business, while the visit based metric system either has no additional value or a negative value by reaching a false conclusion. In either case the only measure that adds value and gives us a full picture is the visitor based view of the world. We have a case where visitor is both the most complete view, no matter the situation, but the only one that can give you a rational view of the impact of a change. To top it off, the choice to only look at the shorter window creates a distribution bias, by valuing short term behavior over long term behavior, which may create questions into the relevance of the data used to make any conclusion.
The visitor vs. visit based view of the world is just one of many massive differences that reduce the value derived from optimization if not understood or not evaluated as a separate discipline. Because it is so easy to rationalize sticking with what is comfortable, it is common to find this massive weakness being propagated throughout organizations with no measure of what the cost really is. While not as damaging as others, like not having a single success metric or not understanding variance, it is vital that you are thinking about visit and visitor based data as attached the end goal and not as a single answer to everything.
In the end, the debate about which version to use is not really one about visits or visitors, there are clear reasons to choose visits for analytics and visitor for optimization. The real challenge is if you and your organization understand the different data disciplines that are being leveraged. If you constantly look for different ways to think about each action you will find new and better ways to improve value, if you fail to do so you will cause damage throughout your organization and will not even know you are doing it.
You can’t go five minutes in the current business world without the terms big data, predictive or statistical tool being thrown about. If one was to believe all of the hype you would have no problems making perfect decisions, acting quickly, and all everyone would be improving their performance by millions of dollars every hour. Of course everyone in the field also acknowledges just how far everyone else is from that reality, but they fail to mention the same errors in logic from their own promises and their own analysis. All data is leveraged using mathematical tools many of which do not have the level of understand that are necessary to maximize their value. Data can both be a powerful and important aid to improving business and a real deciding factor between success and failure. It can also be a crutch used to make poor decisions or to validate one opinion versus another. The fundamental truth is that nothing with “big data” is really all that new, and that in almost all cases, the promises that you people are making have no basis in reality. It is vital that people understand core principles of statistics that will enable them to differentiate when data is being used in either of those two roles and to help maximize the value that data can bring to your organization.
So how then do you arm yourself to maximize outcomes and to combat poor data discipline? The key is in understanding key concepts of statistics, so that you can find when and how promises are made that cannot possibly be true. You do not need to understand the equations, or even have masterly level depth on most of these topics, but it is vital that you understand the truth behind certain types of statistical claims. I want to break down the top few that you will hear, and how they are misused to make promises, and how do you really achieve that level of success.
Correlation does not Equal Causation –
Problem– I don’t think anyone can get through college without having heard this phrase, and most can quote it immediately, but very few really focus on what it means. The key thing to take from this is that no matter how great your correlative analysis is it can not tell you cause of the outcome nor the value of items without direct active interaction with the data. No matter how much you can prove a linear correlation or even find a micro-conversion that you believe is success, by itself it can never answer even the most basic of real world business questions. They can be guiding lights towards a new revelation, but they can also just be empty noise leading you away from vital information. It is impossible to tell if you leave the analysis at just basic correlation, yet in almost all cases this is where people are more then happy to leave their analysis. The key is to make sure that you do not jump to conclusions and that you incorporate other pieces of information instead of blindly following the data.
Just because I can prove a perfect correlation between email sign-ups and conversion rate, that they both go up, I can never know from correlation alone if getting more people to sign-up for emails CAUSED more conversions, or if the people we got to convert more are also more interested in signing up for email. In a test this is vital because not only is it easy see those two points, but you are also limited with only a single data point making even correlation impossible to diagnose. It is incredibly common for people to claim they know the direction and that they need to generate more email signups in order to produce more revenue, but it is impossible to make that conclusion based on purely correlative information alone and it can be massively damaging to a business to point resources in a direction that can equally produce negative and not positive results.
The fundamental key is to make sure that you are incorporating consistent ACTIVE interaction with data, where you induce change across a wide variety of items and measure the casual value of them. Combined or leading your correlative information you can discover amazing new lessons that you would never have learned before. Without doing this the data that many claim is leading them to conclusions is often incomplete for fundamentally wrong and can in no way produce the insights that people are claiming. The core goal is always to minimize the cost of this active interaction with data while maximizing the number and level of alternatives that you are comparing. Failure to do this will inevitably lead to lost revenue and often false directions for entire product road maps as people leverage data to confirm their opinions and not to truly use data rationally to produce amazing results.
Examples – Multiple success metrics, Attribution, Tracking Clicks, Personas, Clustering
Solution – Causal changes can arm you with the added information needed to answer these questions more directly, but in reality that is not always going to be an option. If nothing else, always remember that for any data to tell you what lead to something else, you have to prove three things:
1) That what you saw was not just a random outcome
2) That the two items are correlated with each other, and not just some other change
3) That you need to prove causal direction to be able to prove any conclusion
Just the very act of stopping people from not racing ahead or abusing this data to prove their own agenda will dramatically improve the efficiency of your data usage as well as the value derived from your entire data organization.
Rate vs. Value –
Problem – There is nothing more common than finding patterns and anomalies in your analytics. This probably is the single core skill of all analysis, yet it can often be the most misused or abuse actions taken with data. It can be segments that have different purchase behavior, channels that behave differently, or even “problems” with certain pages or processes. Finding a pattern or anomaly at best is simply the halfway point of actionable insight, not the final stop to be followed blindly. Rate is the pattern of behavior, usually expressed as a ratio of actions. Finding rates of action is the single most common and core action in the world of analytics, but the issue usually comes when we confuse the pattern we observe with the action to “correct” that action. Like Correlation vs. Causation above though, a pattern by itself is just noise. It takes active interaction and comparison with other less identified able options in order to validate the value of those types of analysis.
Just because Google users spend 4.34 min per visit or email users average visit depth is 3.4 pages are examples of rates of action. What this is not is the measure of value of those actions. Value is the change in outcome created by that certain action not the rate at which people happen to do things in the past. Most people understand “past performance does not ensure future outcomes” but they fail to apply the same logic when it comes to looking for patterns in their own data. Value is expressed as a lift or differentiation, things like adding a button increased conversion by 14% or removing our hero image generated 18% more revenue per visitor.
The main issues come from confusing the ability to measure different actions with knowing how to change someone’s behavior. The simplest example of this is the simple null hypothesis of what would happen if that item wasn’t there? Just because 34% of people click on your hero image which is by far the highest amount on your homepage, what would happen if that image wasn’t there? You wouldn’t just lose 34% of people, they would instead interact with other part of the page. Would you make more less revenue? Would it be better or worse?
It also comes down to two different business questions. At face value the only possible question you could answer with just pattern analysis is, “What is an action we can take?”, in the ideal value business case you would instead answer “Based on my current finite resources, what is the action I can take to generate the most X” where X is your single success metric. Rates of value have no measure of ability to change or of cost to do so, and as such they can not answer many of the business questions that they are erroneously applied to.
Examples – Personalization, Funnel Analysis, Attribution, Page Analysis, Pathing, Channel Analysis
Solution – The real key is to make sure that built into any plans of optimization you are incorporating active data acquisition and a that you are always measuring null assumptions and measuring the value of items. This information combined with knowledge of influence and cost to change can be vital, but without it is likely empty noise. There are entire studies in math dedicated to this, with the most common being bandit based problem solving. Once you have actively acquired knowledge, you then will start to build information that can start to inform and improve the cost of data acquisition, but never replace it.
These are but two of the many areas where people consistently make mistakes when leveraging data and concepts from statistics to make false conclusions. Data should be your greatest asset not your greatest liability, but until you help your organization make data driven decisions and not data validated decision there are always going to be massive opportunities for improvement. Make it a focus to improve your organizations understanding and interaction with each of these concepts and you will start using far less resources and making far better outcomes. Failure to do so also insures the opposite outcomes over time.
Understand data and data discipline have to become your biggest areas of focus and educating others your primary directive if you truly want to see your organization take the next step. Don’t let just reporting data or making claims of analysis be enough for you and you will quickly find that it is not enough for others.