A/B testing - statistical hypothesis testing or multi-armed bandit

With the emergence of lean startup and big data, more and more companies are embracing A/B testing. While it’s encouraging that the industry is starting to statistically test features using data, very few are aware of an superior alternative to traditional statistical hypothesis testing:multi-armed bandit. In this post, I will give an overview as to why multi-armed bandit is superior to hypothesis testing for a majority of applications. For readers who are not familiar with why testing features with data is important, more details can be found in "Bridging the gap between lean startup in theory and in practice."  For readers who are looking for social proof, multi-armed bandit solution is also what Google Analytics has adopted for their new content experiment.

Overview of statistical hypothesis testing

The objective of hypothesis testing in A/B testing is to tell whether the difference of conversion rates observed can be explained away by chance. The standard methodology involves

  1. Calculating the sampling distribution under the null hypothesis
  2. Calculating the likelihood of the observed difference under the sampling distribution, and
  3. Compare the probability against the pre-determined threshold.

While intuitive at first glance, it actually requires substantial statistical knowledge to design and interpret the experimental result properly. For instance,

  • How to bound type II error?
  • How to test more than two treatments simultaneously? 
  • How many observations are required? 
  • What are the right thresholds?
  • Is it ok to peek into the results and early terminate the experiment?

Overview of multi-armed bandit

The name "multi-armed bandit" describes a scenario in which a gambler faces several slot machines, a.k.a., “one-armed bandit", with different expected payouts. The objective is to maximize the total reward earned through a sequence of lever pulls. To achieve the objective, the multi-armed bandit solution adaptively balances the cost to gather information of the uncertain levers by playing them (exploration) with the cumulative rewards from playing the known good arms (exploitation).

In the context of A/B testing, each slot machine represents a treatment in the experiment, each play represents an impression to a treatment, and the cumulative reward represents the cumulative conversions. While there are many different algorithms, e.g. UCB, epsilon-greedy, etc, in this post, we will focus on one algorithm called "Thompson sampling".

High level overview of Thompson sampling

The idea behind Thompson sampling is very simple. The algorithm maintains the posterior distribution of the payout rate of each arm, plays the arm in proportion to the probability that the given arm is optimal under the posterior, and then update the posterior as new observations are made. For example, for two treatments which have 10/150 and 5/100 conversions over impressions observed, the posterior conversion rate distributions would be Beta(10, 140) and Beta(5, 95). Now, instead of deterministically playing the first treatment which has higher conversion rate for subsequent trials, Thompson sampling stochastically plays the first treatment with P(1st treatment is the best treatment) and plays the second treatment with P(2nd treatment is the best treatment) under the current posterior conversion rate distribution. Lastly, update the posterior distribution as new observations are made. For readers who are not familiar with Bayesian statistics, Beta distribution is often used as the conjugate prior distribution to Binomial distribution, which is used to model the conversion rates.

Comparison

Now, with the basic understanding of both problem formulations and solutions, let's compare them. 

1. Thompson sampling is simple

To properly interpret statistical hypothesis testing, practitioners need to have a good understanding of basic statistical testing, power analysis, bias correction when peeking into results, bias correction when dealing with multiple treatments, etc. On the other hand, practitioners only need to understand basic Bayesian statistics to understand Thompson sampling. Simpler concept makes the interpreting the results much less error-prone.

2. Thompson sampling directly estimates the probability of which arm is optimal

Statistical hypothesis testing tries to answer the question of "what's the probability of observing situation as extreme assuming all treatments have the same conversion rate." On the other hand, Thompson sampling tries to answer the question of "given the observation, what is the probability of each given arm is optimal." While both questions are valid but Thompson sampling is much easier to understand and it naturally balances the tradeoff between type I and type II errors.

3. multi-armed bandit typically converges much faster

As multi-armed bandit solutions are adaptive, the number of trials required to identify the best arm (if it exists) is typically much lower than the number of trials needs for statistical hypothesis testing. However, it does imply that a separate stopping criterion is needed when all treatments are about the same.

4. multi-armed bandit generalizes naturally with multiple treatments

This is really where multi-armed bandit shines. Since multi-armed bandit is adaptive, it can quickly determine which arms are less likely to be the optimal ones and play those inferior arms with lower probability. On the other hand, in statistical hypothesis testing, each treatment is allocated the same numbers of trials determined by the power analysis before the experiment.

To sum up, multi-armed bandit has many practical advantages over traditional statistical hypothesis testing. It typically converges much faster, allows less room for misinterpretation, generalizes better for multiple treatments and requires less tuning parameters. For startups who are serious about A/B testing, multi-armed bandit is strongly recommended. 

 If you like the post, you can follow me (@chengtao_chu) on Twitter or subscribe to my blog "ML in the Valley" Also, special thanks Ian Wong (@ihat) and Bob Ren (@bobrenjc93) for reading a draft of this.

12 responses
Leo Polovets upvoted this post.
Hello! Overall, you raise many good points. That said, while auto-optimization / MAB's sound great in theory, they too have issues. For example, most of the available MAB-based tools haven’t implemented contextually aware algorithms, so you ultimately end up creating a lot more work implementing the same experiment for many different user segments (a time consuming pain point), or you settle for, “what’s the best option for most”. Separately, overlapping experiments in a MAB setup, where one experiments better-performing arm can inadvertently skew (positively) the results of another experiments inferior arm, is a real problem, and most tools available don’t offer a solution to it. There are also issues with traffic quality and returning customer bias (through things like the recency/novelty effect) in-inadvertently skewing results of one arm, when it in fact might be the inferior one – in this situation, you would in-inadvertently be maximizing, not minimizing regret, and depending on the method of your “stopping point criteria”, might lead to wrong conclusions being made. A frequentist approach *typically* doesn't suffer the above issues, since you are typically doing a 50/50 split of traffic, and can then segment results (which, as you've noted above, you'd still have to then account for the multiple comparisons problem) - now I'm not saying MAB's aren't superior - I do think they are the future - but I don't think we are quite there yet with a tool that solves the above issues plaguing MABs. Thanks again for the article. Brian Lang
Brian, thank you for the great and insightful comment. I completely agree with all the problems you raised. However, I do want to encourage people to start using MAB instead of hypothesis testing as I have seen many people making mistakes interpreting basic experimental result from hypothesis testing and those experiments weren't even complex.
Brian I have implemented some MAB algorithms in Storm in one of my open source projects. I do take context (customer segment etc.) into account. Multiple MAB models get built concurrently in real time for different contexts. https://github.com/pranab/avenir
Hi, I'd like to translate this post to Chinese too. So I comment to ask for your permission as before (I've translated your "ML Done Wrong" post).
Hi weakish, sure as long as my name and the link to the original post are cited.
Chinese translation: https://www.jqr.com/article/000414 (name is cited in the beginning and there is a link to the original post at the end of the translated text)
5 visitors upvoted this post.