Multi-armed Bandits¶

Focus of this talk¶

1. Not Math
2. Not Programming
3. Problem Solving
4. Under-utilized methods
5. Showcase Ipython Notebooks

Background – What we do as a company¶

2. Custom ad campaigns for agencies and large companies
3. Targeting Platform
• Hyper local geo targeting
• Contextual and Performance based media targeting

Background – What does our system do¶

• Real Time Bidding with 12B-15B ad requests
• We want to serve several hundred million ads
• Data intelligence paltform
• Offline and online models to create scored lists
• Users
• Media
• Geography
• Time
• Auction theory
• Bidders execute on scores in under 50ms roundtrip

Background – What we do as data scientists¶

1. Focus
• Machine learning
• Automation
• Scale
2. Process
• Prototype examples
• Implement in production if possible

Background – The Media Optimization Problem¶

We have access to many millions of domains which can each be broken into smaller segments or even URLs. For each ad campaign, how do we pick which parts of what domains to serve to ads to? And how much do we value each of those parts?

Why?¶

1. Find the best sites to show ads
2. Always serve the best ads possible
3. Learn performance optimally
4. To beat competition

Multi-armed bandits (MAB)¶

A MAB is an online algorithm which chooses from a set of strategies in a sequence of trials so as to maximize the total payoff of the chosen strategies.

Rephrased, it is the problem a gambler faces at a row of slot machines, sometimes known as "one-armed bandits", when deciding which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a random reward from a distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls.$^1$

$^1$ Wikipedia

Applications¶

2. Website configuration
3. Finance
4. A/B testing
5. Evolutionary theory

In general, any system that emphasizes continuous improvement where products remain in a perpetual state of testing/exploration.

Strategies¶

1. Greedy - Always play the arm you think is best
2. Epsilon-greedy - Play the best arm for a proportion (1 - epsilon) of the trials, otherwise it explores other arms
3. Epsilon-first - Use the first epsilon porportion of plays to explore the arms and thereafter exploit
4. Bayesian probabilistic matching - Play an arm based on the probability that it is the best arm
In [42]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import numpy.random as rand
import numpy as np
%matplotlib inline
from scipy.stats import beta
import random as random
import scipy.integrate as integrate
import math as math

import time
from bokeh.plotting import *
from bokeh.objects import Glyph
from bokeh.objects import Range1d
output_notebook(url="default")


Modeling Random Events¶

Pulling a bandit's arm once is Bernoulli with chance of success $p$ $$\\$$

$$f(x;p) = p^x(1-p)^{1-x} \hskip2em \text{ for } x \in \{0,1\} \\$$

Sum of $n$ identical Bernoulli distributed random variables is binomial $$\\$$

$$f(x;n,p) = \genfrac(){0pt}{0}{n}{x} p^x(1-p)^{n-x} \hskip2em \text{ for } x \in \{0,1,2,...,n\}$$

Beta distribution is the conjugate prior of the binomial distribution $$\\$$

$$f(x;\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1}$$

More on the Beta Distribution¶

The mode of the beta distribution is given by $$\\$$ $$\frac{\alpha-1}{\alpha+\beta-2}$$ $$\\$$

where $\alpha - 1$ is the count of successful trials and $\beta - 1$ the count of failed trials

$$\\$$

You can also update the Beta distribution with information from new trials with $\alpha_{new}$ and $\beta_{new}$ to get

$$\\$$$$\frac{\Gamma(\alpha'+\beta')}{\Gamma(\alpha')\Gamma(\beta')}x^{\alpha'-1}(1-x)^{\beta'-1}$$$$\\$$

where $$\alpha' = \alpha+\alpha_{new}$$ and $$\beta' = \beta+\beta_{new}$$

In [5]:
def betaPlot(prob = 0.5, N = 100):

figure(tools="",
title = 'Beta Distribution',
x_axis_label = 'Success Probability',
y_axis_label = 'PDF',
x_range=Range1d(start=0, end=1))

hold()

a = 1
b = 1
x = np.linspace(0, 1, 100)

line(x, beta.pdf(x,a,b), color="#2E2EFE", plot_width=800, plot_height=500, legend='Beta Posterior')
line([prob,prob],[0,10], color="#FE2E2E", line_dash = 'dashed', legend='True Mean')

grid().grid_line_alpha=0.3

renderer1 = [r for r in curplot().renderers if isinstance(r, Glyph)][0]
ds1 = renderer1.data_source

show()

for i in range(N):
test = rand.rand() < 0.5
if i < 6:
print 'a = %s and b = %s' %(a,b)
raw_input("")
print 'Payoff: %s' %test.real
a += test
b += not(test)
ds1.data["x"] = x
ds1.data["y"] = beta.pdf(x,a,b)
ds1._dirty = True
cursession().store_objects(ds1)
time.sleep(0.1)


In [ ]:
betaPlot()


Create a Mulit-Armed Bandit Object¶

In [6]:
class MAB(object):

def __init__(self,numArms):
self.armPayoffs = rand.uniform(0,1,numArms)
self.successes = np.ones(numArms)
self.failures = np.ones(numArms)

def pullArm(self,arm):
test = rand.uniform()
result = test < self.armPayoffs[arm]
bestResult = test < max(self.armPayoffs)
regret = bestResult - result
self.successes[arm] += result
self.failures[arm] += not(result)
return result, regret



So How Do You Measure the Performance Over Time?¶

The standard is to use $regret$

$$R_T = \sum^T_{i=1}(m^*-m_{B(i)})$$

where $m^*$ is the payout from the optimal arm and $m_{B(i)}$ is the payout from arm $i$

Create a Basline Strategy¶

In [7]:
def strategyStratifiedSample(mab):

samples = mab.successes + mab.failures
bestEstArm = samples == min(samples)
return random.sample(np.where(bestEstArm)[0], 1)[0]

In [8]:
def mab3Plot(strategy, pulls = 100):

figure(tools="",
title = 'Multi-Armed Bandit',
x_axis_label = 'Success Probability',
y_axis_label = 'PDF',
x_range=Range1d(start=0, end=1))

hold()

mab = MAB(3)
x = np.linspace(0, 1, 100)

line(x, beta.pdf(x,mab.successes[0],mab.failures[0]), color="#2E2EFE", plot_width=800, plot_height=500, legend='Bandit 1 Posterior')
line([mab.armPayoffs[0],mab.armPayoffs[0]],[0,10], color="#2E2EFE", line_dash = 'dashed', legend='Bandit 1 Truth')

line(x, beta.pdf(x,mab.successes[0],mab.failures[0]), color="#AC58FA", plot_width=800, plot_height=500, legend='Bandit 2 Posterior')
line([mab.armPayoffs[1],mab.armPayoffs[1]],[0,10], color="#AC58FA", line_dash = 'dashed', legend='Bandit 2 Truth')

line(x, beta.pdf(x,mab.successes[0],mab.failures[0]), color="#FE642E", plot_width=800, plot_height=500, legend='Bandit 3 Posterior')
line([mab.armPayoffs[2],mab.armPayoffs[2]],[0,10], color="#FE642E", line_dash = 'dashed', legend='Bandit 3 Truth')

grid().grid_line_alpha=0.3

renderer0 = [r for r in curplot().renderers if isinstance(r, Glyph)][0]
ds0 = renderer0.data_source

renderer2 = [r for r in curplot().renderers if isinstance(r, Glyph)][2]
ds2 = renderer2.data_source

renderer4 = [r for r in curplot().renderers if isinstance(r, Glyph)][4]
ds4 = renderer4.data_source

show()

for i in range(pulls):
arm = strategy(mab)
result, regret = mab.pullArm(arm)
ds0.data["x"] = x
ds0.data["y"] = beta.pdf(x,mab.successes[0],mab.failures[0])
ds0._dirty = True

ds2.data["x"] = x
ds2.data["y"] = beta.pdf(x,mab.successes[1],mab.failures[1])
ds2._dirty = True

ds4.data["x"] = x
ds4.data["y"] = beta.pdf(x,mab.successes[2],mab.failures[2])
ds4._dirty = True

cursession().store_objects(ds0)
cursession().store_objects(ds2)
cursession().store_objects(ds4)
time.sleep(0.01)

In [44]:
mab3Plot(strategy=strategyStratifiedSample)


Simulate a Basic Experiment¶

In [9]:
def stratifiedRegret(cdata):
figure(tools="",
title = 'Multi-Armed Bandit Regret',
x_axis_label = 'Iteration',
y_axis_label = 'Regret',
x_range=Range1d(start=0, end=pulls),
y_range=Range1d(start=0, end=max(cdata)))

hold()

line(range(pulls), cdata, color="#FF4000", plot_width=800, plot_height=500, legend='Stratified Sample')

grid().grid_line_alpha=0.3
legend()[0].orientation = "top_left"
show()

In [10]:
mab = MAB(3)
pulls = 1000
data = np.zeros(pulls)

for i in range(pulls):
arm = strategyStratifiedSample(mab)
result, regret = mab.pullArm(arm)
data[i] = regret

cdata = np.cumsum(data)

print mab.armPayoffs

stratifiedRegret(cdata)

[ 0.26952967  0.25392393  0.53431951]


Greedy Strategy¶

In [11]:
def strategyGreedy(mab):
rates = mab.successes/(mab.successes + mab.failures)
bestEstArm = rates == max(rates)
return random.sample(np.where(bestEstArm)[0], 1)[0]

In [45]:
mab3Plot(strategy=strategyGreedy)


Epsilon Greedy Strategy¶

In [12]:
def strategyEpsilonGreedy(mab, epsilon = 0.1):
if random.random() < epsilon:
return random.randint(0,len(mab.successes)-1)
else:
return strategyGreedy(mab)


Epsilon First Strategy¶

In [13]:
def strategyEpsilonFirst(mab, pulls = 100, epsilon = 0.1):
trials = pulls*epsilon
samples = mab.successes + mab.failures
if sum(samples) < trials:
return strategyStratifiedSample(mab)
else:
return strategyGreedy(mab)

In [ ]:
mab3Plot(strategy=strategyEpsilonFirst)

In [14]:
def basicRegretPlot(data):

randBandit = np.cumsum(data.StratifiedSample.astype(float))
greedyBandit = np.cumsum(data.Greedy.astype(float))
epsGreedyBandit = np.cumsum(data.EpsilonGreedy.astype(float))
epsFirstBandit = np.cumsum(data.EpsilonFirst.astype(float))

figure(tools="",
title = 'Multi-Armed Bandit Regret',
x_axis_label = 'Iteration',
y_axis_label = 'Regret',
x_range=Range1d(start=0, end=len(data)),
y_range=Range1d(start=0, end=max(randBandit)))

hold()

line(range(len(randBandit)), randBandit, color="#FF4000", plot_width=800, plot_height=500, legend='Stratified Sample')
line(range(len(randBandit)), greedyBandit, color="#AC58FA", plot_width=800, plot_height=500, legend='Greedy')
line(range(len(randBandit)), epsGreedyBandit, color="#0080FF", plot_width=800, plot_height=500, legend='Epsilon Greedy')
line(range(len(randBandit)), epsFirstBandit, color="#01DF01", plot_width=800, plot_height=500, legend='Epsilon First')

grid().grid_line_alpha=0.3
legend()[0].orientation = "top_left"
show()


Tweak the MAB Class¶

In [15]:
class MABs(object):

def __init__(self,armPayoffs):
self.armPayoffs = armPayoffs
self.successes = np.ones(len(armPayoffs))
self.failures = np.ones(len(armPayoffs))

def pullArm(self,arm,test):
result = test < self.armPayoffs[arm]
bestResult = test < max(self.armPayoffs)
regret = bestResult - result
self.successes[arm] += result
self.failures[arm] += not(result)
return result, regret


Compare Strategies¶

In [16]:
def genBasicData(numArms = 3, pulls = 100):

armPayoffs = rand.uniform(0,1,numArms)

mab0 = MABs(armPayoffs)
mab1 = MABs(armPayoffs)
mab2 = MABs(armPayoffs)
mab3 = MABs(armPayoffs)

cols = ['StratifiedSample','Greedy','EpsilonGreedy','EpsilonFirst']
data = pd.DataFrame(columns=cols)

for i in range(pulls):
test = rand.uniform()
arm0 = strategyStratifiedSample(mab0)
result0, regret0 = mab0.pullArm(arm0, test)
arm1 = strategyGreedy(mab1)
result1, regret1 = mab1.pullArm(arm1, test)
arm2 = strategyEpsilonGreedy(mab2, 0.1)
result2, regret2 = mab2.pullArm(arm2, test)
arm3 = strategyEpsilonFirst(mab3, pulls, 0.1)
result3, regret3 = mab3.pullArm(arm3, test)

row = pd.DataFrame([dict(StratifiedSample = regret0,
Greedy = regret1,
EpsilonGreedy = regret2,
EpsilonFirst = regret3)])
data = data.append(row, ignore_index=True)

#print mab0.armPayoffs
return data

In [43]:
data = genBasicData()
basicRegretPlot(data)


Average Performance¶

In [18]:
def basicSimulation(sims = 100):
data = genBasicData()
for i in range(1,sims):
data += genBasicData()
return data/100.0


In [19]:
data = basicSimulation()

In [20]:
basicRegretPlot(data)


Bayesian Probabilistic Matching¶

The binomial bandit assumes that the rewards in each configuration are independent Bernoulli random variables with success probabilities ($\theta_1,...,\theta_k$)$^1$.

Assume a simple uninformative prior of $\theta_a \backsim U(0,1)$ with all $\theta_i$ being independent.

Let $Y_{at}$ and $N_{at}$ denote the cumulative number of successes and trials observed for arm $a$ up to time $t$. Then the posterior distribution of $\theta = (\theta_1,...,\theta_k)$ is

$$p(\theta|\boldsymbol y_t) = \prod^k_{a=1}B(\theta_a|Y_{at}+1,N_{at}-Y_{at}+1) \\$$

Then the probability of arm $a$ being the optimal arm with all the information at time $t$ is

$$w_{at} = \int^1_0 B(\theta_a|Y_{at}+1,N_{at}-Y_{at}+1) \prod^k_{j \ne a} P(\theta_j<\theta_a|Y_{jt}+1,N_{jt}-Y_{jt}+1)$$$$\\$$

$^1$ 'A modern Bayesian look at the multi-armed bandit' - Steven Scott

Bayesian Probabilistic Matching in Code¶

In [21]:
def bpmProbs(mab):
y = mab.successes-1
n = mab.successes + mab.failures - 2

k = len(y)
ans = np.zeros(k)

for i in range(k):
indx = range(k)
del indx[i]

def f(x):
r =  beta.pdf(x, y[i]+1, n[i]-y[i]+1)
for j in indx:
r = r * beta.cdf(x, y[j]+1, n[j]-y[j]+1)
return r

return ans


Bayesian Probabilistic Matching in Code continued¶

In [22]:
def weightedSample(weights):
'''Return the index to sample based on weight vector'''
r = random.random()
cweight = np.cumsum(weights)
choice = min(cweight[np.where(cweight >= r)])
return np.where(cweight == choice)[0][0]

In [ ]:
pd.Series([weightedSample([0.333,0.333,0.334]) for _ in range(10000)]).hist(figsize=(12,7))
#pd.Series([weightedSample([0.8,0.1,0.1]) for _ in range(10000)]).hist(figsize=(12,7))


Bayesian Probabilistic Matching in Code continued¶

In [23]:
def StrategyBPM(mab):
weights = bpmProbs(mab)
return weightedSample(weights)

In [46]:
mab3Plot(strategy=StrategyBPM)

In [24]:
def genBasicData2(numArms = 3, pulls = 100):

armPayoffs = rand.uniform(0,1,numArms)

mab0 = MABs(armPayoffs)
mab1 = MABs(armPayoffs)
mab2 = MABs(armPayoffs)
mab3 = MABs(armPayoffs)
mab4 = MABs(armPayoffs)

cols = ['StratifiedSample','Greedy','EpsilonGreedy','EpsilonFirst','BPM']
data = pd.DataFrame(columns=cols)

for i in range(pulls):
test = rand.uniform()
arm0 = strategyStratifiedSample(mab0)
result0, regret0 = mab0.pullArm(arm0, test)
arm1 = strategyGreedy(mab1)
result1, regret1 = mab1.pullArm(arm1, test)
arm2 = strategyEpsilonGreedy(mab2, 0.1)
result2, regret2 = mab2.pullArm(arm2, test)
arm3 = strategyEpsilonFirst(mab3, pulls, 0.1)
result3, regret3 = mab3.pullArm(arm3, test)
arm4 = StrategyBPM(mab4)
result4, regret4 = mab4.pullArm(arm4, test)

row = pd.DataFrame([dict(StratifiedSample = regret0,
Greedy = regret1,
EpsilonGreedy = regret2,
EpsilonFirst = regret3,
BPM = regret4)])
data = data.append(row, ignore_index=True)

#print mab0.armPayoffs
return data

In [25]:
def basicSimulation2(sims = 100):
data = genBasicData2()
dataseries = []
dataseries.append(data.sum())
for i in range(1,sims):
d = genBasicData2()
data += d
dataseries.append(d.sum())
return data/100.0, dataseries


In [26]:
def basicRegretPlot2(data):

randBandit = np.cumsum(data.StratifiedSample.astype(float))
greedyBandit = np.cumsum(data.Greedy.astype(float))
epsGreedyBandit = np.cumsum(data.EpsilonGreedy.astype(float))
epsFirstBandit = np.cumsum(data.EpsilonFirst.astype(float))
BPMBandit = np.cumsum(data.BPM.astype(float))

figure(tools="",
title = 'Multi-Armed Bandit Regret',
x_axis_label = 'Iteration',
y_axis_label = 'Regret',
x_range=Range1d(start=0, end=len(data)),
y_range=Range1d(start=0, end=max(randBandit)))

hold()

line(range(len(randBandit)), randBandit, color="#FF4000", plot_width=800, plot_height=500, legend='Stratified Sample')
line(range(len(randBandit)), greedyBandit, color="#AC58FA", plot_width=800, plot_height=500, legend='Greedy')
line(range(len(randBandit)), epsGreedyBandit, color="#0080FF", plot_width=800, plot_height=500, legend='Epsilon Greedy')
line(range(len(randBandit)), epsFirstBandit, color="#01DF01", plot_width=800, plot_height=500, legend='Epsilon First')
line(range(len(randBandit)), BPMBandit, color="#FF0000", plot_width=800, plot_height=500, legend='BPM')

grid().grid_line_alpha=0.3
legend()[0].orientation = "top_left"
show()

In [27]:
data, dataseries = basicSimulation2()

In [28]:
basicRegretPlot2(data)


A Closer Look¶

In [29]:
pd.concat(dataseries,axis=1).mean(axis=1)

Out[29]:
BPM                  4.83
EpsilonFirst         4.63
EpsilonGreedy        5.25
Greedy               5.15
StratifiedSample    25.72
dtype: float64
In [30]:
pd.concat(dataseries,axis=1).var(axis=1)

Out[30]:
BPM                   8.869798
EpsilonFirst         62.437475
EpsilonGreedy        26.593434
Greedy               91.684343
StratifiedSample    194.789495
dtype: float64

What Happens When The Payoffs Change?¶

In [31]:
def genBasicData3(numArms = 3, pulls = 100):

armPayoffs = rand.uniform(0,1,numArms)

mab0 = MABs(armPayoffs)
mab1 = MABs(armPayoffs)
mab2 = MABs(armPayoffs)
mab3 = MABs(armPayoffs)
mab4 = MABs(armPayoffs)

cols = ['StratifiedSample','Greedy','EpsilonGreedy','EpsilonFirst','BPM']
data = pd.DataFrame(columns=cols)

for i in range(pulls):

if i == pulls/2:
mab0.armPayoffs = mab1.armPayoffs = mab2.armPayoffs = \
mab3.armPayoffs = mab4.armPayoffs = rand.uniform(0,1,numArms)

test = rand.uniform()
arm0 = strategyStratifiedSample(mab0)
result0, regret0 = mab0.pullArm(arm0, test)
arm1 = strategyGreedy(mab1)
result1, regret1 = mab1.pullArm(arm1, test)
arm2 = strategyEpsilonGreedy(mab2, 0.1)
result2, regret2 = mab2.pullArm(arm2, test)
arm3 = strategyEpsilonFirst(mab3, pulls, 0.1)
result3, regret3 = mab3.pullArm(arm3, test)
arm4 = StrategyBPM(mab4)
result4, regret4 = mab4.pullArm(arm4, test)

row = pd.DataFrame([dict(StratifiedSample = regret0,
Greedy = regret1,
EpsilonGreedy = regret2,
EpsilonFirst = regret3,
BPM = regret4)])
data = data.append(row, ignore_index=True)

#print mab0.armPayoffs
return data

def basicSimulation3(sims = 100):
data = genBasicData3()
for i in range(1,sims):
data += genBasicData3()
return data/100.0

In [32]:
data = basicSimulation3()

In [33]:
basicRegretPlot2(data)

In [34]:
def mab3PlotVarying(strategy, pulls = 100):

figure(tools="",
title = 'Multi-Armed Bandit',
x_axis_label = 'Success Probability',
y_axis_label = 'PDF',
x_range=Range1d(start=0, end=1))

hold()

armPayoffs = np.array([0, 0.5, 0])
x1 = np.linspace(0, 1.5, 100)
payoff0 = map(math.sin, x1)
payoff2 = map(math.cos, x1)
armPayoffs[0] = payoff0[0]
armPayoffs[2] = payoff2[0]
mab = MABs(armPayoffs)

x = np.linspace(0, 1, 100)

line(x, beta.pdf(x,mab.successes[0],mab.failures[0]), color="#2E2EFE", plot_width=800, plot_height=500, legend='Bandit 1 Posterior')
line([mab.armPayoffs[0],mab.armPayoffs[0]],[0,10], color="#2E2EFE", line_dash = 'dashed', legend='Bandit 1 Truth')

line(x, beta.pdf(x,mab.successes[0],mab.failures[0]), color="#AC58FA", plot_width=800, plot_height=500, legend='Bandit 2 Posterior')
line([mab.armPayoffs[1],mab.armPayoffs[1]],[0,10], color="#AC58FA", line_dash = 'dashed', legend='Bandit 2 Truth')

line(x, beta.pdf(x,mab.successes[0],mab.failures[0]), color="#FE642E", plot_width=800, plot_height=500, legend='Bandit 3 Posterior')
line([mab.armPayoffs[2],mab.armPayoffs[2]],[0,10], color="#FE642E", line_dash = 'dashed', legend='Bandit 3 Truth')

grid().grid_line_alpha=0.3

renderer0 = [r for r in curplot().renderers if isinstance(r, Glyph)][0]
ds0 = renderer0.data_source

renderer1 = [r for r in curplot().renderers if isinstance(r, Glyph)][1]
ds1 = renderer1.data_source

renderer2 = [r for r in curplot().renderers if isinstance(r, Glyph)][2]
ds2 = renderer2.data_source

renderer3 = [r for r in curplot().renderers if isinstance(r, Glyph)][3]
ds3 = renderer3.data_source

renderer4 = [r for r in curplot().renderers if isinstance(r, Glyph)][4]
ds4 = renderer4.data_source

renderer5 = [r for r in curplot().renderers if isinstance(r, Glyph)][5]
ds5 = renderer5.data_source

show()

for i in range(pulls):

test = rand.uniform()

armPayoffs[0] = payoff0[i]
armPayoffs[2] = payoff2[i]

mab.armPayoffs = armPayoffs

arm = strategy(mab)
result, regret = mab.pullArm(arm, test)
ds0.data["x"] = x
ds0.data["y"] = beta.pdf(x,mab.successes[0],mab.failures[0])
ds0._dirty = True

ds1.data["x"] = [payoff0[i],payoff0[i]]
ds1.data["y"] = [0,10]
ds1._dirty = True

ds2.data["x"] = x
ds2.data["y"] = beta.pdf(x,mab.successes[1],mab.failures[1])
ds2._dirty = True

ds3.data["x"] = [0.5,0.5]
ds3.data["y"] = [0,10]
ds3._dirty = True

ds4.data["x"] = x
ds4.data["y"] = beta.pdf(x,mab.successes[2],mab.failures[2])
ds4._dirty = True

ds5.data["x"] = [payoff2[i],payoff2[i]]
ds5.data["y"] = [0,10]
ds5._dirty = True

cursession().store_objects(ds0)
cursession().store_objects(ds1)
cursession().store_objects(ds2)
cursession().store_objects(ds3)
cursession().store_objects(ds4)
cursession().store_objects(ds5)
time.sleep(0.01)

In [ ]:
mab3PlotVarying(strategy=StrategyBPM)


Let's Decay Bandit Memory¶

In [35]:
class MABdecay(object):

def __init__(self,armPayoffs,decay):
self.armPayoffs = armPayoffs
self.successes = np.ones(len(armPayoffs))
self.failures = np.ones(len(armPayoffs))
self.decay = decay

def pullArm(self,arm,test):
result = test < self.armPayoffs[arm]
bestResult = test < max(self.armPayoffs)
regret = bestResult - result
self.successes[arm] += result
self.failures[arm] += not(result)

#decay
self.successes[arm] = max(1, (1-self.decay)*self.successes[arm])
self.failures[arm] = max(1, (1-self.decay)*self.failures[arm])

return result, regret

In [36]:
def mab3PlotVarying2(strategy, pulls = 100, decay = 0.1):

figure(tools="",
title = 'Multi-Armed Bandit',
x_axis_label = 'Success Probability',
y_axis_label = 'PDF',
x_range=Range1d(start=0, end=1))

hold()

armPayoffs = np.array([0, 0.5, 0])
x1 = np.linspace(0, 1.5, 100)
payoff0 = map(math.sin, x1)
payoff2 = map(math.cos, x1)
armPayoffs[0] = payoff0[0]
armPayoffs[2] = payoff2[0]
mab = MABdecay(armPayoffs, decay)

x = np.linspace(0, 1, 100)

line(x, beta.pdf(x,mab.successes[0],mab.failures[0]), color="#2E2EFE", plot_width=800, plot_height=500, legend='Bandit 1 Posterior')
line([mab.armPayoffs[0],mab.armPayoffs[0]],[0,10], color="#2E2EFE", line_dash = 'dashed', legend='Bandit 1 Truth')

line(x, beta.pdf(x,mab.successes[0],mab.failures[0]), color="#AC58FA", plot_width=800, plot_height=500, legend='Bandit 2 Posterior')
line([mab.armPayoffs[1],mab.armPayoffs[1]],[0,10], color="#AC58FA", line_dash = 'dashed', legend='Bandit 2 Truth')

line(x, beta.pdf(x,mab.successes[0],mab.failures[0]), color="#FE642E", plot_width=800, plot_height=500, legend='Bandit 3 Posterior')
line([mab.armPayoffs[2],mab.armPayoffs[2]],[0,10], color="#FE642E", line_dash = 'dashed', legend='Bandit 3 Truth')

grid().grid_line_alpha=0.3

renderer0 = [r for r in curplot().renderers if isinstance(r, Glyph)][0]
ds0 = renderer0.data_source

renderer1 = [r for r in curplot().renderers if isinstance(r, Glyph)][1]
ds1 = renderer1.data_source

renderer2 = [r for r in curplot().renderers if isinstance(r, Glyph)][2]
ds2 = renderer2.data_source

renderer3 = [r for r in curplot().renderers if isinstance(r, Glyph)][3]
ds3 = renderer3.data_source

renderer4 = [r for r in curplot().renderers if isinstance(r, Glyph)][4]
ds4 = renderer4.data_source

renderer5 = [r for r in curplot().renderers if isinstance(r, Glyph)][5]
ds5 = renderer5.data_source

show()

for i in range(pulls):

test = rand.uniform()

armPayoffs[0] = payoff0[i]
armPayoffs[2] = payoff2[i]

mab.armPayoffs = armPayoffs

arm = strategy(mab)
result, regret = mab.pullArm(arm, test)
ds0.data["x"] = x
ds0.data["y"] = beta.pdf(x,mab.successes[0],mab.failures[0])
ds0._dirty = True

ds1.data["x"] = [payoff0[i],payoff0[i]]
ds1.data["y"] = [0,10]
ds1._dirty = True

ds2.data["x"] = x
ds2.data["y"] = beta.pdf(x,mab.successes[1],mab.failures[1])
ds2._dirty = True

ds3.data["x"] = [0.5,0.5]
ds3.data["y"] = [0,10]
ds3._dirty = True

ds4.data["x"] = x
ds4.data["y"] = beta.pdf(x,mab.successes[2],mab.failures[2])
ds4._dirty = True

ds5.data["x"] = [payoff2[i],payoff2[i]]
ds5.data["y"] = [0,10]
ds5._dirty = True

cursession().store_objects(ds0)
cursession().store_objects(ds1)
cursession().store_objects(ds2)
cursession().store_objects(ds3)
cursession().store_objects(ds4)
cursession().store_objects(ds5)
time.sleep(0.01)

In [47]:
mab3PlotVarying2(strategy=StrategyBPM)

In [37]:
def genBasicData4(numArms = 3, pulls = 100, decay = 0.1):

armPayoffs = [0, 0.5, 0]
x = np.linspace(0, 2, 100)
payoff0 = map(math.sin, x)
payoff2 = map(math.cos, x)
armPayoffs[0] = payoff0[0]
armPayoffs[2] = payoff2[0]

mab0 = MABdecay(armPayoffs, decay = 0.1)
mab1 = MABdecay(armPayoffs, decay = 0.1)
mab2 = MABdecay(armPayoffs, decay = 0.1)
mab3 = MABdecay(armPayoffs, decay = 0.1)
mab4 = MABdecay(armPayoffs, decay = 0.1)

cols = ['StratifiedSample','Greedy','EpsilonGreedy','EpsilonFirst','BPM']
data = pd.DataFrame(columns=cols)

for i in range(pulls):

armPayoffs[0] = payoff0[i]
armPayoffs[2] = payoff2[i]

mab0.armPayoffs = mab1.armPayoffs = mab2.armPayoffs = \
mab3.armPayoffs = mab4.armPayoffs = armPayoffs

test = rand.uniform()
arm0 = strategyStratifiedSample(mab0)
result0, regret0 = mab0.pullArm(arm0, test)
arm1 = strategyGreedy(mab1)
result1, regret1 = mab1.pullArm(arm1, test)
arm2 = strategyEpsilonGreedy(mab2, 0.1)
result2, regret2 = mab2.pullArm(arm2, test)
arm3 = strategyEpsilonFirst(mab3, pulls, 0.1)
result3, regret3 = mab3.pullArm(arm3, test)
arm4 = StrategyBPM(mab4)
result4, regret4 = mab4.pullArm(arm4, test)

row = pd.DataFrame([dict(StratifiedSample = regret0,
Greedy = regret1,
EpsilonGreedy = regret2,
EpsilonFirst = regret3,
BPM = regret4)])
data = data.append(row, ignore_index=True)

return data

def basicSimulation4(sims = 100):
data = genBasicData4()
dataseries = []
dataseries.append(data.sum())
for i in range(1,sims):
d = genBasicData4()
data += d
dataseries.append(d.sum())
return data/100.0, dataseries

In [38]:
data, dataseries = basicSimulation4()

In [39]:
basicRegretPlot2(data)


A Closer Look Again¶

In [40]:
pd.concat(dataseries,axis=1).mean(axis=1)

Out[40]:
BPM                 13.71
EpsilonFirst        24.70
EpsilonGreedy       14.07
Greedy              14.39
StratifiedSample    37.00
dtype: float64
In [41]:
pd.concat(dataseries,axis=1).var(axis=1)

Out[41]:
BPM                 25.420101
EpsilonFirst        38.191919
EpsilonGreedy       65.479899
Greedy              61.715051
StratifiedSample    14.989899
dtype: float64

Summary¶

1. These methods are simple
2. Surprisingly flexible and extendable
• Time varying
• Batch processing
• Differing payoffs
• Pulling multiple arms simultaneously
3. Set-it-and-forget-it
4. Some upfront effort matching algorithm to your problem
5. Easily testible
6. Epsilon first is A/B testing
7. Simulation is awesome!

Our Implementation¶

1. Custom BPM
2. Update our posterior with information from multiple models
3. Update sampled beta beliefs using dynamic linear models
4. Batch process information
5. Sparse information from millions of arms for thousands of campaigns