How bandit tests really work

In preparation for a presentation, I recently found myself looking for a layperson’s explanation of how bandit testing works and why. I haven’t been able to find one that suits my needs; Wikipedia entry on the subject is at the same time short on details and rather arcane, while most of the articles I’ve been able to locate on the Internet seem more concerned with establishing bandit testing’s superiority to A/B—which, at least without a good number of caveats, is an absurd statement to make. Since I have to write some notes for my presentation anyway, I figured I’d make a blog post out of it in the interest of never having to write this again.

How bandit testing works

Bandit testing is dead easy to understand. Let’s start with a scenario that will probably be familiar to the kind of people who are most likely to read this blog: suppose that you want to optimize the title of an article on your website’s homepage to maximize the number of people who click on it. You get together with your editor and come up with a list of, say, five titles; you believe that your visitors are probably going to like one title best, but you can’t say which one at the outset.

There’s one more kink: You know from your previous measurements that your articles tend to get the vast majority of their traffic—let’s say, on average, 10,000 views—within a short time of their publication. After this time, they fall off the homepage, and an immediately catchy title becomes less important. Therefore, you know that there will be a limited number of opportunities for your title experiment to work its magic, and you want to maximize your success within that limited number of views.

Your bandit test begins here. Since you don’t know which title is best, your first objective is to try and figure out if one of them performs better than the others. Therefore, you pick the first few views—let’s say 1,000—and show each title to 200 of them at random, essentially giving every title equal chance of succeeding.

At the end of this exploration phase, you find that Articles 1, 2, and 5 have very poor conversion rates, while Article 3 and Article 4 have fare much better, with 4 eking out 3 by a small margin.

Based on the information available to you now, therefore, Article 4 is the winner. Article 3 may still turn out to be better in the long run, but further experimentation is expensive, because you already know that you only have 9,000 more views to invest, and more random testing may take you further away from an optimal result. Besides, you reason, even if Article 3 turns out to be better than Article 4, by now you have enough confidence that the difference will be small, and it might well turn out to be less than the additional wasted views you’d have to invest in order to figure it out.

Therefore, you now move from exploration to exploitation. From this point on, you accept that Article 4 is your best choice, and serve it exclusively until the end of the experiment.

Tests unbounded

This approach to exploration—called epsilon first because all the exploration (which is identified by the Greek letter $$epsilon$$) happens at the beginning—makes sense because of three important assumptions:

  • We only have a limited number of opportunities to run our experiment
  • One of the possible choices leads to a better outcome than the others
  • Whatever makes one title the winner is unlikely to change over the time of the experiment

Let’s say that you now want to use a bandit test elsewhere on your site to determine whether the colour of the “buy” button in your online cart can influence users to purchase more often. The assumptions above no longer apply, because you’re now planning to run the experiment indefinitely, rather than for a finite number of views. Besides, the tastes of your audience may well change over time, and doing all your exploration upfront may cost you in the long run.

Therefore, you decide to change the exploration strategy so that each request has a 10 percent chance of being served with a colour picked at random from all possible choices (with each choice having an equal opportunity to be picked). The remaining 90 percent of requests are automatically served with the colour that, at that point in time, has received the most clicks when it was picked at random.

You set up the test, and then let it run, confident that, over time, the best possible choice will be picked for the majority of the time.

Exploring further

After running a few of these tests, you realize that this exploration strategy, called epsilon greedy, tends to suffer from a few inefficiencies, although it is, in general, quite successful.

For one thing, at the beginning, you’re likely to waste a significant portion of the views on the wrong colour, simply because you won’t have explored enough to let the best colour emerge. As a result, there will be a period where the best choice will “wobble” around as the test figures things out. In addition, if your users’ tastes in colours don’t change, you will eventually be throwing away your exploration views, since, in that case, the best choice will remain the same.

Not to worry! You decide to adapt your exploration strategy so that it is now epsilon adaptive. At the beginning, you could start with a period of pure exploration, followed by a period during which exploration is interspersed with exploitation at random, like in the epsilon-greedy approach, or even decreases over time unless you start noticing a change in the overall trends during your exploration phases.

As you have probably figured out by now, picking the right exploration strategy is what really determines whether a bandit test is ultimately successful. Most tools tend to only implement epsilon-greedy, which offers a good “all-purpose” approach, but a different method can yield significantly better results under the right circumstances.

Why bandit testing works

One final thing to note is that bandit testing is not concerned with picking winners. That is, its goal is not to eventually indicate that you can keep one of your choices and discard the others.

Instead, the aim of bandit testing is to minimize regret. “Regret” has a very specific meaning in decision theory, but, even intuitively, it’s not hard to see that the objective here is to converge as efficiently as possible on a solution that is as good as possible given a necessarily limited amount of exploration—so that one doesn’t regret spending too much time exploring and not enough time exploiting.

In other words, bandit tests are really good at two things: dampening mistakes, and amplifying successes. They are, however, terrible at telling you that one of your choices is statistically superior to the other.

To be clear, that’s not to say that a bandit test cannot be modified to also provide statistically valid results; however, that’s not a very simple task.

Appendix A: Beyond titles and colours

As I mentioned earlier, I picked two examples for this post that I thought would resonate best with the kind of reader that my blog is likely to have, but the usefulness of bandit testing is by no means limited to websites.

Given that bandits do particularly well when you have a limited set of resources to play with, they are a great choice to maximize the ROI in your advertising; for example, you could run a bandit test between multiple keywords, or between multiple providers, and know that most of your fixed budget will go towards the choice that brings you the most results. Similarly, you could use a bandit test to allocate your R&D budget among multiple departments, or optimize e-mail campaigns.

An even more interesting use for bandits is in determining the optimal pricing for a product. While there are some very sophisticated approaches to this particular problem, the basic idea is not that hard to understand: Segment a product by a trivial attribute (e.g.: model name), and then run a bandit in which each choice is a different price point.