Understand the math behind it all: The N-Armed Bandit Problem

One of the great struggles marketers have when they enter new realms, especially those of analytics and testing, is trying to apply the disciplines of math to what they are doing. They are amazed by the promise of models and of applying a much more stringent discipline then the normal qualitative discussions they are used. The problem is that most marketers are not PHDs in statistics, nor have they really worked with the math applied to their real world issues. We have all this data and this promise of power before us, but most lack the discipline to interact and really derive value from the data. In this series, I want to explain some of the math concepts that impact daily analysis, especially those that a majority of people do not realize they are struggling with, and show you how and where use them, as well as their pragmatic limitations.

In the first of these, I want to introduce the N-Armed bandit problem as it is really at the heart of all testing programs and is a fundamental evaluation of the proper use of resources.

The N-Armed Bandit problem, also called the One-Armed bandit problem or the multi-armed bandit problem, is the fundamental concept of the balance of acquiring new knowledge while at the same time exploiting that knowledge for gain. The concept goes like this:

You walk into a casino with N number of slot machines. Each machine has a different payoff. If the goal is to walk away with the most money, then you need to go through a process of figuring out the slot machine with the highest payout, yet keep as much money back as possible in order to exploit that machine. How do you balance the need to test out the payouts from the different machines while reserving as much money as possible to put into the machine with the greatest payout?

Which one do you choose?

Exploring the casino

As we dive into the real world application of this concept, it is important that you walk away with some key understandings of why it matters to you. An evaluation of the N-Armed bandit problem and how we interact with it in the real world leads to two main goals:

1) Discovery of relative value of actions

2) The most efficient use of resources for this discovery and for exploitation

The N-Armed bandit problem is at the core of machine learning and of testing programs, and does not have a one-size fits all answer. There is no perfect way to learn and to exploit, but there are a number of well known strategies. In the real world, where the system is constantly shifting and the values are constantly moving it gets even more difficult, but that does not make it any less valuable. All organizations face the fundamental struggle in how best to apply resources, especially between doing what they are already doing and exploring new avenues or functional alternatives. Do you put resources where you feel safe, where you think you know the values? Or do you use them to explore and find out the value of other alternatives? The tactics used to solve the N-armed bandit problem come down to how greedy you try to be and about giving you ways to think about applying those resources. Where most groups falter is when they fail to balance those two goals, becoming lost in their own fear, egos, or biases; either diving too deep into “trusted” outlets, or going too far down the path of discovery. The challenge is trying to keep to the rules of value and of bounded loss.

The reason this problem comes into play for all testing programs is that the entire need for testing is the discovery of the various values for each variant, or for each concept, against one another. If you are not allowing for this question to enter your testing, then you are always only throwing resources towards what you assume is the value of a change. Knowing just one outcome can never help you be efficient. How do you know what value you could have gotten by just throwing all your money into one slot machine? While it is easy to convince yourself that because you did get a payout, that you did the right thing, the evaluation of the different payouts is the heart of improving your performance. You have to focus on applying resources, and for all groups there is a finite amount of resources, to achieve the highest possible return.

In an ideal world, you would already know all possible values, be able to intrinsically call the value of each action, and then apply all your resources towards that one action that causes you the greatest return (a greedy action). Unfortunately, that is not the world we live in, and the problem lies when we allow ourselves that delusion. The problem is that we do not know the value of each outcome, and as such need to maximize our ability of that discovery.

If the goal is to discover what the value of each action is, and then exploit them, then fundamentally the challenge is to how best to apply the least amount of resources, in this case time and work, to the discovery of the greatest amount of relative values. The challenge becomes one purely of efficiency. We have to create a meaningful testing system and efficiencies in our organization, either politically, infrastructure, or technically, in order to minimize the amount of resources we spend and to maximize the amount of variations that we can evaluate. Every time we get side tracked, or we do not run a test that has this goal of exploring at its heart, or we pretend we have a better understanding of the value of things via the abuse of data, we are being inefficient and are failing on this question for the highest possible value. The goal is to create a system that allows you to facilitate this need, to measure each value against each other, to discover and to exploit, in the shortest time and with the least amount of resources.

An example of a suboptimal design for testing based on this is any single recipe “challenger” test. Ultimately, any “better” test is going to limit your ability to see the relative values. You want to test out your banner on your front door, but how do you know that it is more important then your other promos? Or your navigation, or your call to action? Just because you have found an anomaly or pattern in data, what does that mean to other alternatives? If you only test or evaluate one thing by itself, or don’t test out feasible options against each other, then you will never know the relative value of those actions. You are just putting all your money into one slot machine, not knowing if has a higher payout then the others near it.

This means that any action that is taken by a system that limits the ability to measure values against each other, or that does not allow you to measure values in context, or that does not acknowledge the cost of that evaluation, is inefficient and is limiting the value of the data. Anything that is not directly allowing you the fastest way to figure out the payouts of the different slot machines is losing you value. It also means that any action that requires additional resources for that discovery is suboptimal.

If we have accepted that we have to be efficient in our testing program, we still have to deal with the greatest limiter of impact, the people in the system. Every time we are limited only to “best practices” or by a HiPPO, then we have lowered the possible value we can receive. Some of the great work by studiers of probability, especially by Nassim Nicholas Taleb, has shown that for systems, over time, the more human level interaction, or the less organic that the system is allowed to be, the lower the value and the higher the pain we create.

Comparing organic versus inorganic systems:

Taleb - Value of a System

We can see that for any inorganic system, one that has all of those rules forced onto it, over time there is a lot less unpredictability then what people think, and that there is almost a guarantee of loss of value for each rule and for each assumption that is entered into that system. One of the fastest ways to improve your ability to discover the various payouts is to have an understanding of just how many slot machines are before you. Every time that you think you are smarter then the system, or you get caught up in “best practices” or popular opinion, you have forced a non-organic limit into the system. You have artificially said that there are less machines available to you. This means that for the discovery part of the system, and the best thing for our program and for gaining value, that we must limit human subjection or rules, in order to insure the highest amount of value.

An example of these constraints is any hypothesis based test. If you are limiting your possible outcomes to only what you “think” will win, you will never be able to test out everything that is feasible. Just because you hear a “best practice” or someone has this golden idea, you have to make sure that you are still testing it relatively to other possibilities, nor can you let it impact your evaluation of that data. It is ok to have an idea of what you think will win going in, but you can not limit yourself to that in your testing. That is the same as walking up to the slot machine with the most flashy lights, just because the guy next to you said to, and only putting your money in that machine.

Everyone always says the house wins, and in Vegas that is how it works. In the real world, the deck may be stacked against you, but that does not mean that you are going to lose. Once you understand the rules of the game and can think in terms of efficiency and exploiting, then you have the advantage. If you can agree that at the end of the day your goal is to walk out of that casino with the largest stack of bills possible, then you have to focus on learning and exploiting. The odds really aren’t stacked against you here, but the only way to really win this game is to be willing to play it the right way. Do you choose the first and most flashy machine? Or do you want to make money?

Join the Discussion