Puzzle master Wei-Hwa Huang poses this problem (reworded by me):
You are playing a game where a merchant has a stockpile of bags for sale. Each bag contains an unknown number of diamonds. You can buy* a randomly selected bag for 12 coins, look in the bag to see how many diamonds there are, and immediately decide whether to keep it, or sell it back for 11 coins. You can also (as many times as you want and at no cost) peek at a random bag and count the diamonds, without the option to buy. What is a good strategy for maximizing the number of diamonds in the bags you keep?*
In this notebook I solve the problem and along the way demonstrate:
Let's get these out of the way first: imports of necessary modules, and some constants:
from typing import *
from statistics import mean, stdev
import matplotlib.pyplot as plt
import functools
import random
BUY = 12 # How many coins a bag costs
SELL = 11 # How many coins you get from selling a bag back
MEAN = 100 # Average number of diamonds in a bag
N = 10_000 # Default number of bags in the merchant's stockpile
K = 1_000 # Default number of peeks at the stockpile
cache = functools.lru_cache(None) # Decorator
random.seed(42) # For reproducability
There are two different collections of bags that we need to represent:
In both cases we need to know the number of diamonds in each bag, so essentially we want to represent a collection of integers, and how often each of those integers appears. I implement this with the class ProbDist
, a subclass of Python's Counter
class, that works as follows:
P = ProbDist([100, 101, 100])
(or P = ProbDist({100: 2, 101: 1})
) makes a distribution P
with 3 samples.P(d)
gives the probability of d
diamonds (e.g. P(100) == 2/3
).P[d]
gives the count of d
diamonds (e.g. P[100] == 2
).P
is hashable, so that it can be passed to functions with a cache
decorator. Caching is crucial for efficiency.P.n, P.μ, P.σ
give the total count, mean, and standard deviation of the values in P
, respectively.__str__
method gives an informative description of the distribution, including a descriptive name
.class Hashable:
"""A mixin that hashes objects based on their identity."""
def __hash__(self): return id(self)
def __eq__(self, other): return self is other
class ProbDist(Hashable, Counter):
"""A discrete probability distribution, mapping values to frequencies."""
def __init__(self, samples: Union[Iterable, Mapping], name='ProbDist'):
"`samples` can be either an iterable of values, or a {val: count} mapping."
super().__init__(samples)
values = list(self.elements())
self.__dict__.update(μ=mean(values), σ=stdev(values), n=len(values), name=name)
def __call__(self, d) -> float: return self[d] / self.n
def __str__(self) -> str: return f'{self.name}[μ={self.μ:.1f}, σ={self.σ:.1f}, n={self.n:,d}]'
Here's an example of building a probability distribution, P
, from five values, and making some assertions about it:
P = ProbDist([108, 92, 108, 100, 92])
assert P[108] == 2 # number of times 108 occurs
assert P(108) == 2/5 # probability of 108
assert P.n == 5 # total number of samples in the distribution
assert P.μ == 100 # mean of the samples
assert P.σ == 8 # standard deviation of the samples
assert dict(P) == {108: 2, 92: 2, 100: 1}
assert repr(P) == 'ProbDist({108: 2, 92: 2, 100: 1})'
assert str(P) == 'ProbDist[μ=100.0, σ=8.0, n=5]'
I will define the function normal
to create a stockpile of bags that will have a roughly normal (Gaussian) distribution. Then I will define peek
to peek at some of the bags in the stockpile:
def normal(σ, μ=MEAN, n=N) -> ProbDist:
"""A ProbDist of `n` integers from a Normal distribution with mean `μ` and standard deviation `σ`."""
values = [max(0, round(random.gauss(μ, σ))) for _ in range(n)]
return ProbDist(values, 'normal')
def peek(stockpile: ProbDist, k=K) -> ProbDist:
"""A ProbDist formed by peeking at k values sampled randomly from stockpile."""
values = random.choices(list(stockpile.elements()), k=k)
return ProbDist(values, stockpile.name + '_peek')
We see that a peek
at 1,000 bags does a good job of capturing the statistics of a stockpile:
stock1 = normal(σ=1)
peek1 = peek(stock1)
print(stock1)
normal[μ=100.0, σ=1.0, n=10,000]
print(peek1)
normal_peek[μ=100.0, σ=1.0, n=1,000]
stock1.most_common()
[(100, 3808), (99, 2476), (101, 2388), (98, 611), (102, 598), (103, 58), (97, 55), (104, 3), (96, 3)]
peek1.most_common()
[(100, 390), (99, 251), (101, 237), (98, 61), (102, 54), (103, 4), (97, 3)]
We previously explained that it best to do all the peeks first, and to continue buying bags until the coins are exhausted. So we'll assume that all strategies automatically do that, and we're left with the task of making keep/sell choices. I will use two conventions:
strategy(c, d)
, that returns 'keep'
if a player with c
coins should keep a bag with d
diamonds, and otherwise returns 'sell'
.For example, cutoff_strategy
is a strategy maker such that cutoff_strategy(peeks, 0.95)
returns a strategy function that keeps all bags with at least 95% of the mean number of diamonds in peeks
. (Also, if it started with exactly BUY coins, it keeps the bag, because selling it back would not yield enough coins to buy another bag.)
Action = Literal['keep', 'sell'] # The two possible actions
Strategy = Callable[[int, int], Action] # strategy(coins, diamonds) -> 'keep' | 'sell'
def cutoff_strategy(peeks: ProbDist, ratio=1.0) -> Strategy:
"""Strategy to keep all bags with `ratio * P.μ` or more diamonds.
Also keep a bag if selling it back would not yield enough coins to buy another bag."""
cutoff = ratio * peeks.μ
def strategy(c: int, d: int) -> Action:
return 'keep' if (c == BUY or d >= cutoff) else 'sell'
strategy.__name__ = f'cutoff_{cutoff:.0f}' # Give the strategy a descriptive name
return strategy
Here are some examples of a cutoff strategy deciding what action to take:
strategy = cutoff_strategy(peek1, 0.95) # A strategy to keep bags at least 95% of the mean of peek1
assert strategy(13, 92) == 'sell' # 92 is below the cutoff; sell it back
assert strategy(23, 103) == 'keep' # 103 is above the cutoff; keep it
I'll define $E(P, \textit{strategy}, c)$ to be the expected number of diamonds that can be bought with $c$ coins, from a stockpile with distribution $P$, when the player follows the given $\textit{strategy}$. The expected value is 0 if we don't have enough coins to buy any bags, and otherwise is the probability-weighted average, over all possible bags, of the expected value of the result of either keeping or selling back the bag, depending on what the strategy says to do. That can be written as:
$$E(P, \textit{strategy}, c) = \sum_{d \in P} P(d) \left\{ \begin{array}{ l l } 0 & \textbf{if } c < 12 \\ E(P, \textit{strategy}, c - 12) + d & \textbf{if } \textit{strategy}(c, d) = \textit{keep} \\ E(P, \textit{strategy}, c - 12 + 11) & \textbf{if } \textit{strategy}(c, d) = \textit{sell} \end{array}\right.$$Our game is simple enough that it is feasible to do an exact expected value calculation, as long as we remember to cache intermediate results for efficiency:
@cache
def E(P: ProbDist, strategy: Strategy, c: int) -> float:
"""The expected number of diamonds that can be bought with `c` coins using `strategy`,
when the stockpile of bags has distribution `P`."""
return (0 if c < BUY else
sum(P(d) * (E(P, strategy, c - BUY) + d if strategy(c, d) == 'keep' else
E(P, strategy, c - BUY + SELL))
for d in P))
This works because each recursive call reduces the number of coins, and when we get down to fewer than 12 coins, the expected value is always zero. If this was not the case (say, if you could have a negative balance of coins and still continue the game) then this computation of $E$ would lead to an infinite loop.
For any game, the optimal strategy is to take the action that leads to the highest expected value. That may sound trivial or circular: "the best strategy is to choose the best action." But it is actually neither trivial nor circular, and can be implemented as follows:
def optimal_strategy(peeks: ProbDist) -> Strategy:
"""Strategy to take the action (keep or sell) that leads to the highest expected value."""
def optimal(c: int, d: int) -> Action:
return 'keep' if E(peeks, optimal, c - BUY) + d > E(peeks, optimal, c - BUY + SELL) else 'sell'
return optimal
One subtle but important point: if we want to know the true expected value of a situation, we would call E(c, stockpile, strategy)
. That's easy enough for us to do as outside observers of the game. But a player in the game does not have access to the stockpile; all they have is an estimate of the stockpile distribution, gained from peeks. We will see that usually the estimation is quite good.
Here we see the optimal strategy in action:
stock10 = normal(10) # A stockpile with standard deviation 10 (and mean 100)
peek10 = peek(stock10) # A probability distribution formed by peeking at some bags
optimal = optimal_strategy(peek10) # The optimal strategy with respect to peek10
assert 99 < peek10.μ < 101 # The mean of peek10 should be somewhere around 100
assert 9 < peek10.σ < 11 # The standard ddeviation of peek10 should be somewhere around 10
assert optimal(13, 98) == 'sell' # sell back because we expect about 100 on the next buy
assert optimal(13, 103) == 'keep' # keep because we expect about 100 on the next buy
assert optimal(23, 103) == 'sell' # 103 is an above average bag, but we have 10 chances to do better; sell it
assert optimal(23, 115) == 'keep' # Even with 10 more chances, it is tough to beat a bag with 115 diamonds
assert optimal(12, 50) == 'keep' # keep because if we sold we couldn't buy another bag
We can use the strategy to play a random instance of the game:
def play(strategy, coins: int, stockpile: ProbDist, verbose=True) -> List[int]:
"""Play a game using `strategy` and return the list of bags kept. Optionally print each action."""
bags = list(stockpile.elements())
kept = [] # List of bags kept
while coins >= BUY: # While you have enough coins to buy a bag:
coins -= BUY # Buy a bag
bag = random.choice(bags) # Count the diamonds in the bag
action = strategy(coins, bag) # Decide whether to keep it or sell it back
if action == 'keep':
kept.append(bag)
else:
coins += SELL
if verbose:
print(f'{bag:3d} diamond bag: {action} it (total: {sum(kept):3d} diamonds and {coins:3d} coins.')
return kept
play(optimal_strategy(peek10), 95, stock10)
102 diamond bag: sell it (total: 0 diamonds and 94 coins. 105 diamond bag: keep it (total: 105 diamonds and 82 coins. 92 diamond bag: sell it (total: 105 diamonds and 81 coins. 112 diamond bag: keep it (total: 217 diamonds and 69 coins. 100 diamond bag: sell it (total: 217 diamonds and 68 coins. 89 diamond bag: sell it (total: 217 diamonds and 67 coins. 109 diamond bag: keep it (total: 326 diamonds and 55 coins. 106 diamond bag: keep it (total: 432 diamonds and 43 coins. 112 diamond bag: keep it (total: 544 diamonds and 31 coins. 111 diamond bag: keep it (total: 655 diamonds and 19 coins. 99 diamond bag: keep it (total: 754 diamonds and 7 coins.
[105, 112, 109, 106, 112, 111, 99]
We've solved the problem!
We could stop here, but ...
I have some remaining questions!
We have defined the optimal strategy, but do we understand it? In exactly which situations should we keep a bag and when should we sell it back? What's a concise intuitive explanation for these results? Here's some code that might lead to some understanding:
plt.rcParams["figure.figsize"] = (14, 6)
COINS = range(12, 121) # Range of coin values to examine
def plot_strategy(P: ProbDist, strategy):
"""Plot (coins, diamonds) points for which strategy(c, d) == 'sell'."""
points = [(c, d) for c in COINS for d in range(min(P), max(P) + 1)
if strategy(c, d) == 'sell']
plt.scatter(*transpose(points), marker='.', label=strategy.__name__)
decorate(title=f'When to sell back with {strategy.__name__} strategy on {P}')
def decorate(title, xticks=COINS[::BUY], xlabel='coins', ylabel='diamonds'):
"""Decorate the plot with title, grid lines, ticks, and labels."""
plt.grid(True); plt.xticks(xticks)
plt.title(title); plt.xlabel(xlabel); plt.ylabel(ylabel)
def transpose(matrix): return zip(*matrix)
The plot below looks at situations where the x-axis lists the number of coins you have before you buy a bag, and the y-axis lists the number of diamonds in the bag you bought. A blue dot at position (x, y) means that the optimal strategy is to sell the bag back in that situation. This is the true optimal strategy for stock10
, obtained by having exact knowledge of the stockpile, something that a player could never obtain exactly.
plot_strategy(stock10, optimal_strategy(stock10))
We notice three things about this plot:
It would be nice to have a single metric to evaluate how well a strategy does. In the quest to find such a metric, the first question is "how well a strategy does what?" One answer is "how well it gains diamonds, on average, starting with various numbers of coins." I'll define the function metric
to do that. It returns the average number of diamonds per 12 coins, averaged over all starting numbers in the argument coins
:
def metric(strategy, P, coins=COINS) -> float:
"""How many diamonds does the strategy get per 12 coins, averaged over all `coins`."""
return mean(E(P, strategy, c) * BUY / c for c in coins)
optimal10 = optimal_strategy(peek10)
metric(optimal10, stock10)
94.40155233086679
Since the mean of stock10
is very close to 100, you might be surprised to see that the optimal strategy gets considerably less than that; does that mean something is wrong? *No*–it means that when you start with a number of coins that has a non-zero remainder mod 12, you won't be able to buy a full additional bag, and by default the average takes into account all remainders mod 12. We can look at just the coins that are multiples of 12; that should give a metric very near 100:
metric(optimal10, stock10, range(12, 100, 12))
100.05930000000001
And we can look at just the coins that are equal to 11 mode 12; that should give a lower value:
metric(optimal10, stock10, range(11, 100, 12))
74.26480822105012
The function plot_E
below allows us to plot the expected value for a strategy as a function of the number of coins. We can compare different stockpiles, strategies, and/or numbers of peeks. The function produces two plots: one that displays the expected total number of diamonds as a function of the number of coins, and one that displays the expected number of diamonds per 12 coins. (You can think of the second plot as being the same as the first, but rotated so that the x-axis parallels the values where the number of coins mod 12 is zero.)
def plot_E(stockpiles: Iterable[ProbDist], strategy_fns=[optimal_strategy],
ks=[K], coins=COINS, divide=False):
"""Plot the expected number of diamonds for different stockpiles, strategies, and coins."""
peeks = {(stock, k): peek(stock, k) for stock in stockpiles for k in ks} # precompute peeks
for per_coins in ('', ' per 12 coins'):
for k in ks: # k is the number of peeks
for stock in stockpiles:
for strategy_fn in strategy_fns:
strategy = strategy_fn(peeks[stock, k])
diamonds = [E(stock, strategy, c) * ((BUY / c) if per_coins else 1)
for c in coins]
m = metric(strategy, stock, coins)
label = f'{m:.1f}♦︎ {strategy.__name__}({peeks[stock, k]})'
plt.plot(coins, diamonds, 'o:', label=label)
decorate(title='Expected number of diamonds' + per_coins, xticks=coins[::12])
plt.legend(); plt.show()
We start with plots showing the expected number of diamonds obtained by the optimal strategy with 1,000 peeks at the stock10
stockpile:
plot_E([stock10])
You can see that there are big jumps in expected return when the coins reach a multiple of 12, which correspond to being able to buy one more bag. Basically, the strategy with 12b coins is: "buy b bags, and keep them, no matter what, giving you an expected total of 100b diamonds, or 100 diamonds per 12 coins." (There are some exceptions to this rule.)
In between the jumps, the top plot shows gradual increases in total expected diamonds; that's because with each additional coin, we have one more opportunity to sell back a bad bag and hope to get a better one. But the expected number of diamonds per 12 coins (bottom plot) mostly decreases, because the return from the extra opportunities is small. We do see a few slight increases: the diamonds per 12 coins is slightly greater with 110 coins (9 bags plus two extra tries) than it is with 108 coins.
The expected value plots and the strategy plot tell the same story in two different ways: as we get more coins above a multiple of 12, we become pickier and sell back more bags; this leads to a higher expectation for the bags we keep and thus for our total expected value.
Here is a plot comparing stockpiles with standard deviations of 33, 10, and 1:
stock33 = normal(33) # Stockpile with standard deviation about 33
plot_E([stock33, stock10, stock1])
We see that with standard deviation 1 or 10, we can't get very far above 100 diamonds per 12 coins. But with standard deviation 33, the gamble of selling a bag back and buying another starts to really pay off, and we end up with over 110 diamonds per 12 coins.
Below is the strategy plot for the optimal strategy on stock33
. Note that with 12 coins the strategy is still to keep any bag but with multiples of 12 coins starting with 24 there are bag values that will be sold back, even though that means we will end up with one less bag.
plot_strategy(stock33, optimal_strategy(stock33))
The merchant's stockpile might bear no resemblance to a normal distribution. In this section I compare three stockpile distributions with very different shapes: a normal distribution, a uniform distribution, and a bimodal distribution that is the sum of two normal distributions. The three distributions have (roughly) the same mean and standard deviation.
def uniform(μ=MEAN, semiwidth=15, n=N) -> ProbDist:
"""A ProbDist of n integers chosen uniformly from μ ± semiwidth."""
values = [random.randint(μ - semiwidth, μ + semiwidth) for _ in range(n)]
return ProbDist(values, 'uniform')
def plot_shapes(shapes: Iterable[ProbDist]):
"""Plot the shapes of some probability dsitributions."""
markers = iter('+DosP*xX')
for P in shapes:
X = sorted(P)
plt.plot(X, [P(x) for x in X], next(markers) + '-', label=P)
plt.legend()
bimodal = ProbDist(normal(2.5, 92, n=6*N) + normal(2.5, 112, n=4*N), 'bimodal')
shapes3 = [bimodal, normal(10, n=10*N), uniform(semiwidth=17, n=10*N)]
plot_shapes(shapes3)
Now let's look at the optimal strategy on the three shapes:
plot_E(shapes3)
Despite the very different shapes, the expected values differ by less than 1%. I found this very surprising!
So far, we have looked at stockpiles where the bags have a mean of 100 diamonds. Let's compare that to some stockpiles with a mean of just 10 diamonds (with different standard deviations):
plot_E([normal(4, 10), normal(2, 10), normal(1, 10)])
The results are sinmilar to the results with a mean of 100.
The cutoff strategy has the advantage of being simple to implement for a human player, but it is not optimal. How close is it? And what is the best cutoff?
First, we can find the best cutoff for any distribution by exhauastive test and comparison:
@cache
def best_cutoff_strategy(P, coins=COINS, cutoffs=range(50, 150)) -> Strategy:
"""The cutoff strategy that maximizes metric(strategy, P, coins)."""
strategies = [cutoff_strategy(P, pct/100) for pct in range(101)]
return max(strategies, key=lambda strategy: metric(strategy, P, coins))
Here are the best cutoff strategies for the stock10
and stock33
stockpiles, compared to the optimal strategies for those stockpiles:
plot_E(stockpiles=[stock33, stock10], strategy_fns=[optimal_strategy, best_cutoff_strategy])
We see that on both stockpiles, the cutoff strategies lag about 4 diamonds behind the optimal strategies.
Below we can verify that the optimal strategy is indeed better than the cutoff strategy, considering every situation we have looked at so far:
def dominates(S1: Strategy, S2: Strategy, P, coins=COINS) -> bool:
"""Is strategy S1 always as good or better than S2?"""
return all(E(P, S1, c) >= E(P, S2, c) for c in coins)
all(dominates(optimal_strategy(P), best_cutoff_strategy(P), P)
for P in [stock1, stock10, stock33] + shapes3)
True
How many bags do we have to peek at to get a good sampling of the stockpile? I arbitrarily set a default of 1,000. But how much worse would we do with fewer peeks? How much better could we do with more peeks? Let's see:
plot_E([stock33], ks=(10_000, 1000, 100, 10))
There is hardly any difference between 10,000 samples and 100 samples! I found this very surprising!
We have to go all the way down to 10 samples before we see a noticable dropoff in expected value. I guess this is saying that 100 samples is enough to get the general idea that the distribution has a mean of somewhere near 100, and a standard deviation somewhere near 33.
Maybe there would be more of a difference if the distribution was harder to predict. Maybe the bimodal distribution?
plot_E([bimodal], ks=(10_000, 1000, 100, 10))
Still not much difference.
Let's try a distribution that is even more skewed: the distribution from Bernoulli's St. Petersburg Paradox, in which the the payoff is $2^d$ with probability $2^{-d}$ for all values of $d$ from 1 up to a limit.
def st_pete(limit=14) -> ProbDist:
"""The St. Petersburg distribution."""
return ProbDist({2 ** i: 2 ** (limit - i) for i in range(1, limit + 1)}, 'St_Pete')
st_pete()
ProbDist({2: 8192, 4: 4096, 8: 2048, 16: 1024, 32: 512, 64: 256, 128: 128, 256: 64, 512: 32, 1024: 16, 2048: 8, 4096: 4, 8192: 2, 16384: 1})
The mean of this distribution is equal to the limit
(by default 14):
print(st_pete())
St_Pete[μ=14.0, σ=180.5, n=16,383]
I'll compare 10,000 peeks to two mid-size sets of peeks (101 and 100), and two small sizes (11 and 10). There is a lot of variatioon from one run to the next, so I'll set the random seed to a value that gives a representative result, but you can rerun the cell to get different plots:
random.seed(42)
plot_E([st_pete()], ks=(10_000, 101, 100, 11, 10))
The collection of 10,000 peeks gives a very good estimate of the mean and standard deviation, but the smaller samples severely underestimate both the mean and the standard deviation of the distribution, and therefore don't yield as many diamonds.
Let's see what the optimal strategy looks like for this distribution.
P = st_pete()
plot_strategy(P, optimal_strategy(P))
I don't fully understand this plot, but it seems to be saying that up to 24 coins the shape is similar to the optimal strategy on other distributions, but once we get to 36 coins, the strategy is to sell back any bag under about 440 diamonds, regardless of how many coins are left.
Two main conclusions: