How You can Stop Statistics from Taking Advantage of You – Part 1
You can’t go five minutes in the current business world without the terms big data, predictive or statistical tool being thrown about. If one was to believe all of the hype you would have no problems making perfect decisions, acting quickly, and all everyone would be improving their performance by millions of dollars every hour. Of course everyone in the field also acknowledges just how far everyone else is from that reality, but they fail to mention the same errors in logic from their own promises and their own analysis. All data is leveraged using mathematical tools many of which do not have the level of understand that are necessary to maximize their value. Data can both be a powerful and important aid to improving business and a real deciding factor between success and failure. It can also be a crutch used to make poor decisions or to validate one opinion versus another. The fundamental truth is that nothing with “big data” is really all that new, and that in almost all cases, the promises that you people are making have no basis in reality. It is vital that people understand core principles of statistics that will enable them to differentiate when data is being used in either of those two roles and to help maximize the value that data can bring to your organization.
So how then do you arm yourself to maximize outcomes and to combat poor data discipline? The key is in understanding key concepts of statistics, so that you can find when and how promises are made that cannot possibly be true. You do not need to understand the equations, or even have masterly level depth on most of these topics, but it is vital that you understand the truth behind certain types of statistical claims. I want to break down the top few that you will hear, and how they are misused to make promises, and how do you really achieve that level of success.
Correlation does not Equal Causation –
Problem– I don’t think anyone can get through college without having heard this phrase, and most can quote it immediately, but very few really focus on what it means. The key thing to take from this is that no matter how great your correlative analysis is it can not tell you cause of the outcome nor the value of items without direct active interaction with the data. No matter how much you can prove a linear correlation or even find a micro-conversion that you believe is success, by itself it can never answer even the most basic of real world business questions. They can be guiding lights towards a new revelation, but they can also just be empty noise leading you away from vital information. It is impossible to tell if you leave the analysis at just basic correlation, yet in almost all cases this is where people are more then happy to leave their analysis. The key is to make sure that you do not jump to conclusions and that you incorporate other pieces of information instead of blindly following the data.
Just because I can prove a perfect correlation between email sign-ups and conversion rate, that they both go up, I can never know from correlation alone if getting more people to sign-up for emails CAUSED more conversions, or if the people we got to convert more are also more interested in signing up for email. In a test this is vital because not only is it easy see those two points, but you are also limited with only a single data point making even correlation impossible to diagnose. It is incredibly common for people to claim they know the direction and that they need to generate more email signups in order to produce more revenue, but it is impossible to make that conclusion based on purely correlative information alone and it can be massively damaging to a business to point resources in a direction that can equally produce negative and not positive results.
The fundamental key is to make sure that you are incorporating consistent ACTIVE interaction with data, where you induce change across a wide variety of items and measure the casual value of them. Combined or leading your correlative information you can discover amazing new lessons that you would never have learned before. Without doing this the data that many claim is leading them to conclusions is often incomplete for fundamentally wrong and can in no way produce the insights that people are claiming. The core goal is always to minimize the cost of this active interaction with data while maximizing the number and level of alternatives that you are comparing. Failure to do this will inevitably lead to lost revenue and often false directions for entire product road maps as people leverage data to confirm their opinions and not to truly use data rationally to produce amazing results.
Examples – Multiple success metrics, Attribution, Tracking Clicks, Personas, Clustering
Solution – Causal changes can arm you with the added information needed to answer these questions more directly, but in reality that is not always going to be an option. If nothing else, always remember that for any data to tell you what lead to something else, you have to prove three things:
1) That what you saw was not just a random outcome
2) That the two items are correlated with each other, and not just some other change
3) That you need to prove causal direction to be able to prove any conclusion
Just the very act of stopping people from not racing ahead or abusing this data to prove their own agenda will dramatically improve the efficiency of your data usage as well as the value derived from your entire data organization.
Rate vs. Value –
Problem – There is nothing more common than finding patterns and anomalies in your analytics. This probably is the single core skill of all analysis, yet it can often be the most misused or abuse actions taken with data. It can be segments that have different purchase behavior, channels that behave differently, or even “problems” with certain pages or processes. Finding a pattern or anomaly at best is simply the halfway point of actionable insight, not the final stop to be followed blindly. Rate is the pattern of behavior, usually expressed as a ratio of actions. Finding rates of action is the single most common and core action in the world of analytics, but the issue usually comes when we confuse the pattern we observe with the action to “correct” that action. Like Correlation vs. Causation above though, a pattern by itself is just noise. It takes active interaction and comparison with other less identified able options in order to validate the value of those types of analysis.
Just because Google users spend 4.34 min per visit or email users average visit depth is 3.4 pages are examples of rates of action. What this is not is the measure of value of those actions. Value is the change in outcome created by that certain action not the rate at which people happen to do things in the past. Most people understand “past performance does not ensure future outcomes” but they fail to apply the same logic when it comes to looking for patterns in their own data. Value is expressed as a lift or differentiation, things like adding a button increased conversion by 14% or removing our hero image generated 18% more revenue per visitor.
The main issues come from confusing the ability to measure different actions with knowing how to change someone’s behavior. The simplest example of this is the simple null hypothesis of what would happen if that item wasn’t there? Just because 34% of people click on your hero image which is by far the highest amount on your homepage, what would happen if that image wasn’t there? You wouldn’t just lose 34% of people, they would instead interact with other part of the page. Would you make more less revenue? Would it be better or worse?
It also comes down to two different business questions. At face value the only possible question you could answer with just pattern analysis is, “What is an action we can take?”, in the ideal value business case you would instead answer “Based on my current finite resources, what is the action I can take to generate the most X” where X is your single success metric. Rates of value have no measure of ability to change or of cost to do so, and as such they can not answer many of the business questions that they are erroneously applied to.
Examples – Personalization, Funnel Analysis, Attribution, Page Analysis, Pathing, Channel Analysis
Solution – The real key is to make sure that built into any plans of optimization you are incorporating active data acquisition and a that you are always measuring null assumptions and measuring the value of items. This information combined with knowledge of influence and cost to change can be vital, but without it is likely empty noise. There are entire studies in math dedicated to this, with the most common being bandit based problem solving. Once you have actively acquired knowledge, you then will start to build information that can start to inform and improve the cost of data acquisition, but never replace it.
These are but two of the many areas where people consistently make mistakes when leveraging data and concepts from statistics to make false conclusions. Data should be your greatest asset not your greatest liability, but until you help your organization make data driven decisions and not data validated decision there are always going to be massive opportunities for improvement. Make it a focus to improve your organizations understanding and interaction with each of these concepts and you will start using far less resources and making far better outcomes. Failure to do so also insures the opposite outcomes over time.
Understand data and data discipline have to become your biggest areas of focus and educating others your primary directive if you truly want to see your organization take the next step. Don’t let just reporting data or making claims of analysis be enough for you and you will quickly find that it is not enough for others.