February 27, 2012

Bridging the Gap: Dealing with Variance between Data Systems

One of the problems that never seems to be eliminated from the world of data is education and understanding on the nature of comparing data between systems. When faced with the issue, too many companies find the variance between their different data solutions to be a major sign of a problem with their reporting, but in reality variance between systems is expected. One of the hardest lessons that groups can learn is to focus on the value and the usage of information over the exact measure of the data. This plays itself out now more than ever as more and more groups find themselves with a multitude of tools, all offering reporting and other features about their sites and their users. As more and more users are dealing with the reality of multiple reporting solutions, they are discovering that all the tools report different numbers, be it visits, visitors, conversion rates, or just about anything else. There can be a startling realization that there is no single measure of what you are or what you are doing, and for some groups this can strip them of their faith in their data. This variance problem is nothing new, but if not understood correctly, it can lead to some massive internal confusion and distrust of the data.

I had to learn this lesson the hard way. I worked for a large group of websites who used 6 different systems for basic analytics reporting alone. I led a team to dive into the different systems and understand why they reported different things and to figure out which one was ”right.” After losing months of time and almost losing complete faith in our data, we discovered some really important hard won lessons. We learned that the use of the data is paramount, that there is no one view or right answer, that variance is almost completely predictable once you learn the systems, and that we would have been far better served spending that time on how to use the data instead of why they were different.

I want to help your organization avoid the mistakes that we made. The truth is that no matter how deep you go, you will never find all the reasons for the differences. The largest lesson learned was that your organization can be so caught up in the quest for perfect data that they forget about the actual value of that data. To make sure you don’t get caught in this trap, I want to help establish when and if you do have a problem, the most common reasons for variance between systems, and some suggestions about how to think about and how to use the new data challenge that multiple reporting systems presents.

Do you have a problem?

First, we must set some guidelines around when you have a variance problem and when you do not. When you have systems designed for different purposes, they will leverage that data in very different ways. No systems will match, and in a lot of cases, being too close represents artificial constraints on the data that is actually hindering its usability. At the same time, if you are too far apart, then that is a sign that there might be a reporting issue with one or both of the solutions.

Here are two simple questions to evaluate if you do have a variance “problem”:

1) What is the variance percentage?

Normal variance between similar data systems is almost always between 15-20%.
For non-similar data systems the range is much larger, and is usually between 35-50%.

If the gap is too low or too large, then you may have a problem. A 2% variance is actually a worse sign then a 28% variance on similar data systems.

Many groups run into the issue of trying too hard to constrain variance. The result is that they put artificial constraints on their data, causing the representative nature of the data to be severely hampered. Just because you believe that variance should be lower does not mean that it really should be or that lower is always a good thing.

This analysis should be done on non-targeted groups of the same population (e.g., all users to a unique page.) The variance for defendant tracking (segments) is going to always be higher.

2) Is the variance consistent in a small range?

You may see variance be in a series of 13, 17, 20, 14, 16, 21, 12 over a few days, but you should not see 5, 40, 22, 3, 78, 12.

If you are within the normal range and you are in the normal range of outcomes, then congratulations, you are dealing with perfectly normal behavior and I could not more strongly suggest that you spend your time and energy on how best to use the different data.

Data is only as valuable as how you use it, and while we love the idea of one perfect measure of the online world, we have to remember that each system is designed for a purpose, and that making one universal system comes with the cost of losing specialized function and value.

Always keep in mind these two questions when it comes to your data:

1) Do I feel confident that my data accurately reflects my users’ digital behavior?

2) Do I feel that things are tracked in a consistent and actionable fashion?

If you can’t answer those questions with a yes, then variance is not your issue. Variance is the measure of the differences between systems. If you are not confident in a single system, then there is no point in comparing it. Equally, if you are comfortable with both systems, then the differences between them should mean very little.

The most important thing I can suggest is that you pick a single data system as a system of record for each action you do. Every system is designed for different purposes, and with that purpose in mind, each one has advantages and disadvantages. You can definitely look at each system for similar items, but when it comes time to act or report, you need to be consistent and have all concerned parties aligned on which system is the one that everyone looks at. Choosing how and why you are going to act before you get to that part of the process is the easiest fastest way to insure the reduction of organizational barriers. Getting this agreement is far more important for going forward than the dive into the causes behind normal variance.

Why do systems always have variance?

For those of you who are still not completely sold or who need to at least have some quick answers for senior management, I want to make sure you are prepared.
Here are the most common reasons for variance between systems:

1) The rules of the system – Visit based systems track things very differently than visitor based systems. They are meant for very different purposes. In most cases, a visit based system is used for incremental daily counting, while a visitor based system is designed to measure action over time.

2) Cookies – Each system has different rules about tracking and storing of cookie information over time. This tracking will dramatically impact what is or not tracked. This is even more true for 1st versus 3rd party cookie solutions.

3) Rules of inclusion vs. Rules of exclusion – For the most part, all analytics solutions are rules of exclusion, meaning that you really have to do something (IP filter, data scrubbing, etc.) to not be tracked. A lot of other systems, especially testing, are rules of inclusion, meaning you have to meet very specific criteria to be tracked. This will dramatically impact the populations, and also any tracked metrics from those populations.

4) Definitions – What something means can be very specific to a system. Be it a conversion, a segment, a referrer, or even a site action. The very definition can be different. An example of this would be a paid keyword segment. If I land on the site, and then see a second page, what is the referrer for that page? Is it the visit or the referring page? Is it something I did on an earlier visit?

5) Mechanical Variance – There are mechanical differences in how systems track things. Are you tracking the click of a button with an onclick? Or is landing on the previous page? Or is it he server request? Do you use a log file system or a beacon system? Is that a unique request or added on to the next page tag? Do you rely on cookies or are all actions independent? What are the different timing mechanisms for each system? Do they collide with each other or other site functions?

Every system does things differently, and as such these smaller changes can build up over time, especially when combined with some of the other reasons listed above. There are hundreds of reasons beyond those listed, and the reality is that each situation is unique and each one is the culmination of the impact of these hundred different reasons. There is no way to ever get to the point where you can accurately describe with 100% certainty why you get the variance.

Variance is not a new issue, but it is one that can be the death of programs if not dealt with in a proactive manner. Armed with this information, I would strongly suggest that you hold conversations with your data stakeholders before you run into the questions that inevitably come. Establishing what is normal, how you act, and a few reasons why you are dealing with the issue should help cut all of these problems off at the pass.

TL;DR

Bridging the Gap: Dealing with Variance between Data Systems

Join the Discussion Cancel reply

Share this:

Related

Join the Discussion Cancel reply