With the emergence of lean startup and big data, more and more companies are embracing A/B testing. While it’s encouraging that the industry is starting to statistically test features using data, very few are aware of an superior alternative to traditional statistical hypothesis testing:multi-armed bandit. In this post, I will give an overview as to why multi-armed bandit is superior to hypothesis testing for a majority of applications. For readers who are not familiar with why testing features with data is important, more details can be found in "Bridging the gap between lean startup in theory and in practice." For readers who are looking for social proof, multi-armed bandit solution is also what Google Analytics has adopted for their new content experiment.
Overview of statistical hypothesis testing
The objective of hypothesis testing in A/B testing is to tell whether the difference of conversion rates observed can be explained away by chance. The standard methodology involves
- Calculating the sampling distribution under the null hypothesis
- Calculating the likelihood of the observed difference under the sampling distribution, and
- Compare the probability against the pre-determined threshold.
While intuitive at first glance, it actually requires substantial statistical knowledge to design and interpret the experimental result properly. For instance,
- How to bound type II error?
- How to test more than two treatments simultaneously?
- How many observations are required?
- What are the right thresholds?
- Is it ok to peek into the results and early terminate the experiment?
Overview of multi-armed bandit
The name "multi-armed bandit" describes a scenario in which a gambler faces several slot machines, a.k.a., “one-armed bandit", with different expected payouts. The objective is to maximize the total reward earned through a sequence of lever pulls. To achieve the objective, the multi-armed bandit solution adaptively balances the cost to gather information of the uncertain levers by playing them (exploration) with the cumulative rewards from playing the known good arms (exploitation).
In the context of A/B testing, each slot machine represents a treatment in the experiment, each play represents an impression to a treatment, and the cumulative reward represents the cumulative conversions. While there are many different algorithms, e.g. UCB, epsilon-greedy, etc, in this post, we will focus on one algorithm called "Thompson sampling".
High level overview of Thompson sampling
The idea behind Thompson sampling is very simple. The algorithm maintains the posterior distribution of the payout rate of each arm, plays the arm in proportion to the probability that the given arm is optimal under the posterior, and then update the posterior as new observations are made. For example, for two treatments which have 10/150 and 5/100 conversions over impressions observed, the posterior conversion rate distributions would be Beta(10, 140) and Beta(5, 95). Now, instead of deterministically playing the first treatment which has higher conversion rate for subsequent trials, Thompson sampling stochastically plays the first treatment with P(1st treatment is the best treatment) and plays the second treatment with P(2nd treatment is the best treatment) under the current posterior conversion rate distribution. Lastly, update the posterior distribution as new observations are made. For readers who are not familiar with Bayesian statistics, Beta distribution is often used as the conjugate prior distribution to Binomial distribution, which is used to model the conversion rates.
Comparison
Now, with the basic understanding of both problem formulations and solutions, let's compare them.
1. Thompson sampling is simple
To properly interpret statistical hypothesis testing, practitioners need to have a good understanding of basic statistical testing, power analysis, bias correction when peeking into results, bias correction when dealing with multiple treatments, etc. On the other hand, practitioners only need to understand basic Bayesian statistics to understand Thompson sampling. Simpler concept makes the interpreting the results much less error-prone.
2. Thompson sampling directly estimates the probability of which arm is optimal
Statistical hypothesis testing tries to answer the question of "what's the probability of observing situation as extreme assuming all treatments have the same conversion rate." On the other hand, Thompson sampling tries to answer the question of "given the observation, what is the probability of each given arm is optimal." While both questions are valid but Thompson sampling is much easier to understand and it naturally balances the tradeoff between type I and type II errors.
3. multi-armed bandit typically converges much faster
As multi-armed bandit solutions are adaptive, the number of trials required to identify the best arm (if it exists) is typically much lower than the number of trials needs for statistical hypothesis testing. However, it does imply that a separate stopping criterion is needed when all treatments are about the same.
4. multi-armed bandit generalizes naturally with multiple treatments
This is really where multi-armed bandit shines. Since multi-armed bandit is adaptive, it can quickly determine which arms are less likely to be the optimal ones and play those inferior arms with lower probability. On the other hand, in statistical hypothesis testing, each treatment is allocated the same numbers of trials determined by the power analysis before the experiment.
To sum up, multi-armed bandit has many practical advantages over traditional statistical hypothesis testing. It typically converges much faster, allows less room for misinterpretation, generalizes better for multiple treatments and requires less tuning parameters. For startups who are serious about A/B testing, multi-armed bandit is strongly recommended.
If you like the post, you can follow me (@chengtao_chu) on Twitter or subscribe to my blog "ML in the Valley". Also, special thanks Ian Wong (@ihat) and Bob Ren (@bobrenjc93) for reading a draft of this.