A/B testing can be defined as a randomized controlled experiment that allows us to test if there is a causal relationship between a change to a website/app and the user behavior. The change can be visible such as location of a button on the homepage or invisible such as the ranking/recommendation algorithms and backend infrastructure.
Web/Mobile developers and business stakeholders always face the following dilemma: Should we try out all ideas and explore all options continuously? Or should we exploit the best available option and stick to it? The answer is, as in most cases, will be a trade-off between the two extremes. If we explore all the time, we'll collect a lot of data and waste resources in testing inferior ideas and missing sales (e-commerce case). However, if we only exploit the available option and never try new ideas, we would be left behind and loose in the long-term with ever-changing markets.
In this series, we'll explore solutions offered by Multi-armed Bandit Algorithms that have two main advantages over traditional A/B testing:
What is Bandit Algorithms? Bandit Algorithms are algorithms that try to learn a rule of selecting a sequence of options that balance exploring available options and getting enough knowledge about each option and maximize profits by selecting the best option. Note that during the experiment, we only have knowledge about the options we tried. Therefore, every time we select an option that's not the best one, we incur an opportunity cost of not selecting the best option; however, we also gain a new knowledge (feedback) about the selected option. In other words, we need to have enough feedback about each option to learn the best option. As a result, the best strategy would be to explore more at the beginning of the experiment until we know the best option and then start exploiting that option.
As a result, the probability of selecting any option randomly if we have $N$ options is $\epsilon \frac{1}{N}$; however, the probability of selecting the best option is $1 - \epsilon$ (see figure 1).
Let's import the needed packages and implement the algorithm.
# Import packages
import os
import sys
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
# Add module path to system path
sys.path.append(os.path.abspath("../"))
from utils import plot_algorithm, compare_algorithms
%matplotlib inline
plt.style.use("fivethirtyeight")
sns.set_context("notebook")
class EpsilonGreedy:
def __init__(self, epsilon, counts=None, values=None):
self.epsilon = epsilon
self.counts = counts
self.values = values
def initialize(self, n_arms):
self.counts = np.zeros(n_arms, dtype=int)
self.values = np.zeros(n_arms, dtype=float)
def select_arm(self):
z = np.random.random()
if z > self.epsilon:
# Pick the best arm
return np.argmax(self.values)
# Randomly pick any arm with prob 1 / len(self.counts)
return np.random.randint(0, len(self.values))
def update(self, chosen_arm, reward):
# Increment chosen arm's count by one
self.counts[chosen_arm] += 1
n = self.counts[chosen_arm]
# Recompute the estimated value of chosen arm using new reward
value = self.values[chosen_arm]
new_value = value * ((n - 1) / n) + reward / n
self.values[chosen_arm] = new_value
Few things to note from the above implementation:
We'll define the % of reward (probability) of each option and test the performance of the algorithm using three different metrics:
Moreover, we'll evaluate the algorithm using 5 different values of $\epsilon$: $0.1, 0.2, 0.3, 0.4, 0.5$. Since in the literature they use arm instead of option for historical reasons, we'll be using arm and option interchangeably.
class BernoulliArm:
def __init__(self, p):
self.p = p
def draw(self):
z = np.random.random()
if z > self.p:
return 0.0
return 1.0
def test_algorithm(algo, arms, num_simulations, horizon):
# Initialize rewards and chosen_arms with zero 2d arrays
chosen_arms = np.zeros((num_simulations, horizon))
rewards = np.zeros((num_simulations, horizon))
# Loop over all simulations
for sim in range(num_simulations):
# Re-initialize algorithm's counts and values arrays
algo.initialize(len(arms))
# Loop over all time horizon
for t in range(horizon):
# Select arm
chosen_arm = algo.select_arm()
chosen_arms[sim, t] = chosen_arm
# Draw from Bernoulli distribution to get rewards
reward = arms[chosen_arm].draw()
rewards[sim, t] = reward
# Update the algorithms' count and estimated values
algo.update(chosen_arm, reward)
# Average rewards across all sims and compute cumulative rewards
average_rewards = np.mean(rewards, axis=0)
cumulative_rewards = np.cumsum(average_rewards)
return chosen_arms, average_rewards, cumulative_rewards
np.random.seed(1)
# Average reward by arm
means = [0.1, 0.1, 0.1, 0.1, 0.9]
n_arms = len(means)
# Shuffle the arms
np.random.shuffle(means)
# Each arm will follow and Bernoulli distribution
arms = list(map(lambda mu: BernoulliArm(mu), means))
# Get the index of the best arm to test if algorithm will be able to learn that
best_arm_index = np.argmax(means)
# Define epsilon value to check the performance of the algorithm using each one
epsilon = [0.1, 0.2, 0.3, 0.4, 0.5]
# Plot the epsilon-Greedy algorithm
plot_algorithm(alg_name="epsilon-Greedy", arms=arms, best_arm_index=best_arm_index,
hyper_params=epsilon, num_simulations=5000, horizon=500, label="eps")
Few thing to note from the above graphs:
Let's run the experiment again to see how would the algorithm behave under the following settings:
np.random.seed(1)
# Average reward by arm
means = [0.1, 0.9]
n_arms = len(means)
# Shuffle the arms
np.random.shuffle(means)
# Each arm will follow and Bernoulli distribution
arms = list(map(lambda mu: BernoulliArm(mu), means))
# Get the index of the best arm to test if algorithm will be able to learn that
best_arm_index = np.argmax(means)
# Define epsilon value to check the performance of the algorithm using each one
epsilon = [0.1, 0.2, 0.3, 0.4, 0.5]
# Plot the epsilon-Greedy algorithm
plot_algorithm(alg_name="epsilon-Greedy", arms=arms, best_arm_index=best_arm_index,
hyper_params=epsilon, num_simulations=5000, horizon=500, label="eps")
np.random.seed(1)
# Average reward by arm
means = [i for i in np.random.random(50)]
n_arms = len(means)
# Shuffle the arms
np.random.shuffle(means)
# Each arm will follow and Bernoulli distribution
arms = list(map(lambda mu: BernoulliArm(mu), means))
# Get the index of the best arm to test if algorithm will be able to learn that
best_arm_index = np.argmax(means)
# Define epsilon value to check the performance of the algorithm using each one
epsilon = [0.1, 0.2, 0.3, 0.4, 0.5]
# Plot the epsilon-Greedy algorithm
plot_algorithm(alg_name="epsilon-Greedy", arms=arms, best_arm_index=best_arm_index,
hyper_params=epsilon, num_simulations=5000, horizon=250, label="eps")
np.random.seed(1)
# Average reward by arm
means = [0.2, 0.18, 0.22, 0.19, 0.21]
n_arms = len(means)
# Shuffle the arms
np.random.shuffle(means)
# Each arm will follow and Bernoulli distribution
arms = list(map(lambda mu: BernoulliArm(mu), means))
# Get the index of the best arm to test if algorithm will be able to learn that
best_arm_index = np.argmax(means)
# Define epsilon value to check the performance of the algorithm using each one
epsilon = [0.1, 0.2, 0.3, 0.4, 0.5]
# Plot the epsilon-Greedy algorithm
plot_algorithm(alg_name="epsilon-Greedy", arms=arms, best_arm_index=best_arm_index,
hyper_params=epsilon, num_simulations=5000, horizon=500, label="eps")
We will evaluate the Annealed version using the same settings as before and compare it to standard version.
class AnnealingEpsilonGreedy(EpsilonGreedy):
def __init__(self, counts=None, values=None):
self.counts = counts
self.values = values
def select_arm(self):
# Epsilon decay schedule
t = np.sum(self.counts) + 1
epsilon = 1 / np.log(t + 0.0000001)
z = np.random.random()
if z > epsilon:
# Pick the best arm
return np.argmax(self.values)
# Randomly pick any arm with prob 1 / len(self.counts)
return np.random.randint(0, len(self.values))
np.random.seed(1)
# Average reward by arm
means = [0.1, 0.1, 0.1, 0.1, 0.9]
n_arms = len(means)
# Shuffle the arms
np.random.shuffle(means)
# Each arm will follow and Bernoulli distribution
arms = list(map(lambda mu: BernoulliArm(mu), means))
# Get the index of the best arm to test if algorithm will be able to learn that
best_arm_index = np.argmax(means)
# Plot the epsilon-Greedy algorithm
plot_algorithm(alg_name="Annealing epsilon-Greedy", arms=arms, best_arm_index=best_arm_index,
num_simulations=5000, horizon=500)
Even though the accuracy of selecting the best option and the average rewards of the annealing epsilon-Greedy Algorithm is lower than the standard version, it has higher cumulative rewards. Also, since the real world is uncertain and we may not have any clue about the designed options, it may be preferred to use the annealing version under some scenarios.