Beating A/B Tests
Over the past year, we've run 39 different experiments on The Muse with a total of 113 variations. This testing has played a pivotal role in our ability to rapidly and effectively iterate toward a better site.
We're currently experimenting with various layouts for social buttons at The Muse. Part of what we're gauging is placement of the buttons. For example, placing them on the side of the page:
Versus the top:
We're also testing button sizes. Here are the bigger buttons to the side:
And we're testing the effect of "priority" social, where we hide the buttons for all but the most popular social sites by default:
These tweaks are small enough to justify quantitative analysis, where we measure the performance of the different variations and go forward with the one that does best. The canonical standard for doing this is A/B testing, but we've opted to use a different method for this, and all of our experiments.
It's called bandit testing, and it's way better than A/B. This article provides a good summary, but the important takeaways are:
- Bandit tests allow you to run as many variations at the same time as you want, versus A/B testing, which you limits to two.
- Bandit tests automatically figure out who should be opted into which variation, whereas in A/B you have to statically assign percentage breakdowns.
Between the two, you'll reach a conclusion quicker while sacrificing less of the user base to lower-converting variations.
But there's a deeper, less obvious benefit. Had we run these social button variations as a series of A/B tests, we'd be at much greater risk of reaching a local maxima. That is, the odds of us missing the best performing variation in favor of one that's merely adequate would be higher.
There are three types of things we're testing: button size, button placement, and priority buttons. In an A/B framework, we could run this in one of two ways:
- Testing over three simultaneous A/B tests - one for each type of variation. The problem with this is that it won't let us see the result of various combinations. We'd settle on the independently best option for each type, but it's possible these options are different from the best overall combination of options. For example, large buttons may perform better on vertical layouts, but worse on horizontal layouts.
- We could test each possible combination of layouts, one at a time. This would require a lot of effort - I'd have to go back and change the code for each of the nine possible variations. Further, closing out an A/B test is far more time-sensitive than bandit experiments, as a static (and likely large) percent of the traffic is receiving a sub-optimal experience - usually 50% for smaller companies like our own. So I'd have to be on top of my game to make the code-changes as soon as the Chi Squared value was high enough.
With bandit tests, we're running all nine of the variations simultaneously. We don't have to worry about hitting a local maxima with respect to this experiment, and we can "set it and forget it." Because bandit tests dynamically route the bulk of traffic to the best-performing option, I can rest assured that not too many users are receiving a sub-optimal experience after we've reached confidence and the test is ready to close out.
There are, of course, critiques of bandit tests over A/B. Visual Website Optimizer released a post in response to the aforementioned bandit primer. Aside from the conflict of interest, the critique is utterly ridiculous - there's nothing stopping you from tuning a bandit test to behave exactly like an A/B test. Just limit the number of options to two, as we do with the bulk of our experiments. Or, if you wanted to perfectly mimic A/B testing behavior and you're using the epsilon-greedy method (as most frameworks do), you can set the epsilon to 1, in which case users would get placed in one of the buckets completely randomly, just like a 50/50 A/B split.
If you're interested in giving bandit a go and are on the tornado web framework, our bandit-testing code is freely available through oz. Otherwise, plenty of libraries and services are out there - for example, we used OpBandit for some time.
Photo of bandit courtesy of Shutterstock.
Yusuf is a chef, avid ukulele player, and hip-hop artist. Unfortunately he does all of those poorly, so he sticks to his day job writing software. Prior to The Muse, Yusuf has worked as a developer at companies both big (Microsoft, IBM) and small (dotCloud, Transloc), and as an Associate Product Manager at Google. Find him on github, hacker news, or say hi on Twitter.More from this Author