# Tagged: bias

Simpson’s paradox, also known as the Yule-Simpson effect, is a statistical phenomenon that occurs when the relationship between two variables appears to be reversed when analyzed separately, but is actually the opposite when analyzed together. This paradox can be confusing and can lead to incorrect conclusions being drawn.

The paradox was first identified by Edward Simpson in 1951, and is named after him. It is a common occurrence in statistical analysis, and can be seen in many different fields, including social science, economics, and medicine.

One classic example of Simpson’s paradox involves a study on the effectiveness of a new drug. In this study, the researchers found that the new drug was more effective in reducing the number of heart attacks in men than in women. However, when the data was further analyzed, it was discovered that the new drug was actually more effective in reducing the number of heart attacks in women than in men.

So, why did the initial analysis show the opposite result? The answer lies in the fact that the number of men and women in the study was not equal. There were more men in the study, and therefore, the results for men had a larger impact on the overall results.

Simpson’s paradox can also be seen in studies on education. For example, a study may find that students in a certain school district perform better on standardized tests than students in a neighboring district. However, when the data is further analyzed, it may be discovered that the students in the neighboring district come from more disadvantaged backgrounds, and therefore, their test scores may not be representative of their true abilities.

Another example of Simpson’s paradox can be seen in the hiring practices of a company. A company may find that they are more likely to hire male candidates over female candidates. However, when the data is further analyzed, it may be discovered that the male candidates were more qualified, and therefore, were more likely to be hired.

So, how can we avoid the pitfalls of Simpson’s paradox? One way is to carefully analyze the data and consider all possible factors that may be influencing the results. It is also important to consider the sample size, as small sample sizes can lead to skewed results.

Another way to avoid Simpson’s paradox is to use statistical techniques, such as stratified sampling, which involves dividing the population into different subgroups and analyzing the data within each subgroup. This can help to identify any underlying trends or patterns that may not be apparent when the data is analyzed as a whole.

Simpson’s paradox can also be avoided by using multivariate analysis, which involves considering multiple variables at once. This can help to identify any interactions or correlations between variables that may not be apparent when considering each variable individually.

Overall, Simpson’s paradox is a common occurrence in statistical analysis, and it is important to be aware of it in order to avoid drawing incorrect conclusions. By carefully analyzing the data and considering all possible factors, it is possible to avoid the pitfalls of Simpson’s paradox and draw more accurate conclusions.

# One Problem but Many Voices – The Many Ways to Explain Rate & Value

One of my great struggles in the entire data world is to get people to understand the difference in rate and value. It seems like this problem has a thousand different faces, yet it can be extremely difficult to find the right way to correct the misconceptions of any particular case. People are constantly trying to abuse data to show that they provided value or that something is directly tied to an outcome, despite the fact that the data itself can not in any way tell you this fact.

I was faced recently with trying to explain this to a person new to data discipline and found that once again, my answer was much longer and more complicated then I would hope. It seems like such a great concept, but the truth is that everyone has their own way to understand and tackle this problem. With that in mind, I reached out to some of the smartest people I know to see how they tackle the issue. The specific problem I asked about was explaining the difference and contradictory nature of revenue attribution and revenue generation.

Not everyone agrees on the issue or how to express is, and that is why it is so difficult for some, especially those that don’t deal with it on a daily basis. It takes many great voices to find the tools that enable anyone to really correctly tackle large complex issues.

Below are a few of the answers that I was able to gather:

Brent Dykes – Author of Web Analytics Action Hero and general analytics guru –

A rate is simply a calculated metric. We use rates to measure all kinds of things such as the bounce rate of a landing page or conversion rate of a checkout process. In order to get value from rates we need both context and comparisons. On its own, a rate doesn’t tell us anything useful.

For example, if my site’s conversion rate is 10%, you’d think that would be great. In the back of your mind, you may remember reading somewhere that the average conversion rate for most sites between 2-3% so 10% sounds fantastic. However, when we start to add context and perform comparisons this number may end up sounding less appealing. What if my site’s conversion rate last year was 15% compared to today’s 10%? What if similar country sites in my organization have 20% conversion rates? What if my closest industry peers recently shared in a media article that they have average conversion rates of 30%? Now the 10% conversion rate doesn’t sound as good.

A rate simply provides us with a number, and what we do with the measure is what adds value. When we analyze what’s happening with the conversion rate, we can determine how to create more value or stop value leakages. Through testing we can confirm what we found in our analysis (correlation vs. causation) before making wholesale changes. It’s important to use the right rates or metrics, but the numbers without any context or comparisons are meaningless. Value only comes from understanding the rates and making changes to improve them over time.

Russell Lewis – Optimization Consultant

Here is one that spawns from my latest fantasy football win.

You have two QB’s to play. One has a higher completion rate than the other. This rate indicates that he should have a high predicted score when it comes to game time. When you decide to play him, he falls flat on his face. This rate did not give you the value of what his performance is, it just showed what he has done in the past in regards to completed passes to attempted passes. The value of what he actually did is seen when put in comparison to the QB on the bench that had the 10 additional points needed to win the game for the week. Without the comparison to the other QB and the current matchup, we would have no value.

Anonymous –

To me, revenue allocation has always been a method for ranking performance in much the same way page views or visits are. It gives you something to sort by, and that’s about it. Not to mention that depending on the type of allocation you are using you may be inflating your total revenue anyway, so it inherently is not a reliable method of determining revenue generating sources.

When trying to determine revenue generating sources, I have always relied on a less granular outlook. Rather than saying “this email message generated \$X,” stepping back and saying “email campaigns drove \$X, while SEO drove \$X”. To get much more granular than that and you begin speculating too much about human nature, which is anything but reliable.

To me that is when you get into the psychology of it, and it gets too nit-picky. I think broadly if you are trying to determine whether to put ad dollars in email or SEO it can help…but when you start saying “well, if we put out an email with this call to action, it will generate \$X in return” you have a problem.

To me it is a gross misuse of the scientific method…you almost need to look at the control group and see what they are doing before you can determine anything. No one looks at the visitors not associated with a campaign…maybe people on the site just buy stuff on their own.

Jared Lees – Business Consulting Manager

• Revenue allocation – similar to attribution or correlation. Assigning credit to an activity. The amount of credit could depend on the business rules or attribution model you want to do.

• Revenue generation – total revenue acquired from a singular action. There could be other actions that influenced it, but we aren’t counting that here.

Rhett Norton – Senior Retail Consultant & Team Lead

I think what the person was saying something like this, “I looked at a specific channel and it said it generated \$4.50 worth of revenue.” And then you would say “Who cares what the rate is, we need to find what impacts true value and actually changes revenue since that number is just a rate.”

I think the best thing that helps explain these types of situations is explaining causation and correlation.

This made me think of the jobs report on the economy – unemployment rate is 8%, which there is not really anything we can do with that, it is just a number/rate. Lots of people like to look at different sectors and pretend we know what is going on when they can say that growth increased in the technology sector—this is similar to the page participation example above. Again, there isn’t anything you can do with that, it is just a rate. The real question is how do we move the needle, how do we create jobs or what actions make jobs decline.

Derek Tangren – Principal Analytics Consultant –

I’d describe the two as follows:

• Revenue allocation is a method/means to assign success based on certain behavior
• Revenue generation references an action that you are taking in order to invoke a positive change in driving revenue

I would define revenue generation as the action you take and the revenue allocation the means by which you measure the success.

There were many more answers, as you would expect. Some said it didn’t matter because the point is to give executives evidence to continue their agenda, others simplified the situation to simply correlation and causation, and even more didn’t even acknowledge the problem. Most acknowledged that the problem is a major one, but were unable to come up with a simple direct way to convey the message.

Like so much in the online data world, there is no simple answer. Even more, there are as many different agendas and points of views as there are ways to answer the question. Simple answers will always leave you with more questions than answers. How do you deal with this when running your program? Is this the type of battle that you wage, and if not, why? How do you know when you are having the right conversations?