This notebook describes a model to predict if Brandon Sanderon will complete his Cosmere series. I define a stochastic process model for books authored by Sanderson and use Bayesian inference to fit the model to his work up until Febuaray 2016 with the pymc3 package. I sample the posterior of the model to predict an entire career up until retirement to make inferences about the number and type of books he will write. I've been making the models progressively more complex, but have decided to stop here, especially until I have more data. The next parameter one could add is a career-length dependence of the model parameters.
I model Sanderson as a stochasitc mixture exponential process. Describing it generatively, Sanderson chooses a type of book to write, then if it is a cosmere book and then a duration of time to spend writing it. The "type of book" is a clustering of sorts and is for fitting the data, not to do with if the book is young adult or a novella, etc. The number of words of the resulting book depends on the length of time he spent writing it, the type of book it is, and if it is a cosmere book. Then he begins his next book. Each book is independent and identically distributed, so not a complicated process. I assume that he will stop writing according to a normal distribution centered at age 67, with a standard deviation of 3.
The mathematical description of the model is below:
$$ \vec{p} \sim \textrm{Dir}\left(\,\vec{1}\,\right) $$$$ \vec{w} \sim \textrm{Bernoulli}\left(\, \vec{p}\, \right) $$$$ \vec{\kappa} \sim \textrm{Dir}\left(\,\vec{1}\,\right) $$$$ c\, |\, \vec{w} \sim \sum w_i \textrm{Bernoulli}(\,\kappa_i\,) $$$$ \vec{\lambda} \sim \textrm{Uniform}\left(0.25 \,\textrm{Year}^{-1}, 10\,\textrm{Year}^{-1}\right) $$$$ t \, |\, \vec{w} \sim \sum w_i \textrm{Exp}(\,\lambda_i\,) $$$$ \alpha, \beta, \vec{\mu} \sim \textrm{Norm}(0, 2) $$$$ \tau \sim \textrm{Gamma}(1,1) $$$$ \mu_t = \alpha t + \beta c + \sum w_i \mu_i $$$$ l \, |\, \tau, \mu_t \sim \textrm{Norm}(\mu_t, \tau) $$where $l$, $c$, and $t$ are observed data correspdoning to the number of words, if the book is Cosmere, and the amount of time since last publication. $w$ is the latent category of book variable, whose prior is $p$. $\kappa$, $\lambda$, $\tau$ and $\mu$ are model paramters. $\alpha$ and $\beta$ are linear coefficiencts for the effect of $t$ and $c$ on $l$. The number of words in the books are converted to standard normal to avoid working in huge numbers and because I have no sense of a reasonable prior.
The data was taken from a reddit post by havefunonline and is current up to Calamity (Feb, 2016). I would update, but cannot find word counts for his latest books
raw_data = pd.read_excel('books.xlsx')
data = pd.DataFrame({'date': (raw_data['Publication Date'] - raw_data['Publication Date'].min()).dt.days,
'type': pd.Series(raw_data['Type'], dtype='category'),
'words': raw_data['Word count'],
'cosmere': raw_data['Cosmere']})
data.sort_values('date', inplace=True)
matplotlib.style.use('seaborn-darkgrid')
plt.figure()
for u in data.cosmere.unique():
plt.scatter(data[data.cosmere == u].date / 365.,
data[data.cosmere == u].words,
label='Cosmere' if u == 1 else 'Non-Cosmere')
plt.axhline(y=50000, color='gray',
linestyle='--', label='Short Story Cutoff')
plt.ylabel('Words')
plt.xlabel('Years')
plt.legend()
plt.savefig('data.png', dpi=300)
plt.show()
s = Sanderson(2, {'time': data.date.values, 'words': data.words.values, 'cosmere': data.cosmere.values})
s.inference(samples=100000, njobs=4)
100%|██████████| 120000/120000 [3:54:45<00:00, 16.17it/s]
<MultiTrace: 4 chains, 100000 iterations, 13 variables>
plt.figure()
s.traceplot()
plt.savefig('trace.png', dpi=300)
plt.show()
<matplotlib.figure.Figure at 0x7f6282ac5240>
A few interesting conclusions from the parameters:
Here I've plotted a few realization of the observed data using the model. Although the shorter books are modeled well, it appears the longer books have a time dependence which is not present in the model.
N = 2500
ppc = s.sample(samples=N, flatten=False)
plt.figure()
plt.plot([], [], color='gray', alpha=0.5, marker='o', linestyle='', label='Simulated')
for i in range(N):
plt.scatter(np.cumsum(ppc['t_obs'][i]) / 365, ppc['x_obs'][i],
c='gray', alpha=0.25, s=10)
for u in data.cosmere.unique():
plt.scatter(data[data.cosmere == u].date / 365.,
data[data.cosmere == u].words,
label='Cosmere' if u == 1 else 'Non-Cosmere')
plt.ylabel('Words')
plt.xlabel('Years Since Elantris')
plt.ylim(-2*10**5, 8 * 10**5)
plt.xlim(0, 12)
plt.legend(loc='best')
plt.savefig('fit.png', dpi=300)
plt.show()
100%|██████████| 2500/2500 [00:57<00:00, 48.11it/s]
Here we simulate Sanderson's career. We call two functions, one for stats and one for graphs, with simulated careers.
#http://coppermind.net/wiki/Unpublished_works
stormlight = 10
mistborn = 3 * 4
whitesand = 1
dragonsteel = 3
elantris = 2 * 3
warbraker = 2
total_cosmere = stormlight + mistborn + whitesand + dragonsteel + elantris + warbraker
remaining = total_cosmere - data.cosmere.sum()
short_story_def = 50000
cosmere_s = []
cosmere = []
non_cosmere = []
non_cosmere_s = []
words = []
median_career = {'x': [], 'y': []}
def stats(t_obs, x_obs, c_obs, short = short_story_def):
cosmere.append(np.sum(c_obs * (x_obs > short)))
cosmere_s.append(np.sum(c_obs * (x_obs <= short)))
non_cosmere.append(np.sum(x_obs > short) - cosmere[-1])
non_cosmere_s.append(np.sum(x_obs <= short) - cosmere_s[-1])
words.append(np.sum(x_obs))
#assert (cosmere[-1] + cosmere_s[-1] +
# non_cosmere[-1] +
# non_cosmere_s[-1]) == len(x_obs),
# '{} {}'.format(cosmere[-1] + cosmere_s[-1] + non_cosmere[-1] + non_cosmere_s[-1], len(x_obs))
def plot_career(t_obs, x_obs, c_obs, short=short_story_def):
y = data.cosmere.sum() + np.arange(np.sum(c_obs * (x_obs > short)))
t_cum = np.cumsum(t_obs) / 365
x = t_cum[np.where(c_obs * (x_obs > short) == 1)] + data.date.iloc[-1] / 365.0
plt.plot(x, y, alpha=0.2, color='purple', marker='.', linewidth=0.2)
plt.plot(x[-1], y[-1], alpha=0.7, color='purple', marker='X', linewidth=0.2)
plt.figure()
plt.plot([], [], alpha=1, color='purple', marker='.', linewidth=0.2, label='Projected')
plt.plot([], [], alpha=1, color='purple', marker='X', linewidth=0.2, label='Projected Final Book', linestyle='')
plt.plot(data[data.cosmere == 1].date / 365.,
np.arange(np.sum(data.cosmere == 1)), color='black', alpha=1, marker='.', linewidth=0.5, label='Completed')
s.career([stats, plot_career], [1, 25], samples=25000)
plt.axhline(y=total_cosmere, color='gray', linestyle='--', label='Cosmere Finished')
plt.legend(loc='best')
plt.ylabel('Books Completed')
plt.xlabel('Time since Elantris [Years]')
plt.savefig('final.png', dpi=300)
plt.show()
100%|██████████| 25000/25000 [08:27<00:00, 49.30it/s]
Now we compare the number of novel-length cosmere works with what Sanderson has projected to see the probability that he completes on time.
print('{} total books to write, with {} to go'.format(total_cosmere, remaining))
print('With expected retirement age of {}, there is {:.3%} probability of completing cosmere. Simulated {} careers'.format(
67,
sum([ci > remaining for ci in cosmere]) / len(cosmere),
len(cosmere)))
34 total books to write, with 22 to go With expected retirement age of 67, there is 55.674% probability of completing cosmere. Simulated 10425 careers
for a,l in zip([cosmere, non_cosmere, cosmere_s, non_cosmere_s, words],
['cosmere', 'non_cosmere', 'cosmere_s', 'non_cosmere_s', 'words']):
m = np.median(a)
t = np.percentile(a, [2.5, 97.5])
w = ((t[1] - m) + (m - t[0])) / 2
print('{} = {} +/- {}'.format(l, m, w))
cosmere = 23.0 +/- 12.0 non_cosmere = 25.0 +/- 13.5 cosmere_s = 7.0 +/- 6.0 non_cosmere_s = 16.0 +/- 10.0 words = 8982495.593676297 +/- 4290224.834684873
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import pymc3 as pm
import scipy.stats as ss
import theano, math
matplotlib.style.use('seaborn-notebook')
class ModelVars:
def __init__(self, vars):
self.__dict__.update(vars)
class Sanderson:
def __init__(self, cluster_number, data):
assert 'cosmere' in data and 'time' in data and 'words' in data
assert len(data['cosmere']) == len(data['words']) and len(data['words']) == len(data['time'])
#transform data
data['normed_words'] = (data['words'] - np.mean(data['words'])) / np.std(data['words'])
data['normed_words'] = data['normed_words'][1:]
data['time_diff'] = data['time'][1:] - data['time'][:-1]
self.data = data
self.cluster_number = cluster_number
self.obs_n = len(data['time_diff'])
self._inf_model = None
self._gen_model = None
self._inf_trace = None
def inference(self, seed=0, samples=1500, tune_frac=0.2, njobs=1):
'''Returns the given number of samples after tuning period from inference'''
with self.inf_model:
vs = self._inf_vars
tune = math.floor(samples * tune_frac)
step1 = pm.Metropolis([vs.w, vs.rate, vs.c_p])
step2 = pm.CategoricalGibbsMetropolis([vs.category])
step3 = pm.NUTS([vs.mu, vs.tau, vs.alpha, vs.beta])
trace = pm.sample(samples + tune, tune=tune,random_seed=seed, step=[step1, step2, step3], njobs=njobs)[tune:]
self._inf_trace = trace
return trace
def traceplot(self):
pm.traceplot(self.trace, varnames=['w', 'mu', 'rate', 'alpha', 'beta', 'c_p'], combined=True)
def sample(self, cumtime=None, samples=None, intervals = 5, flatten=True):
'''Return data samples
Args:
cumtime: Generate samples until this value is exceeded. If None, samples are in batches of size according to original data \
samples: Number of multiples of original data (or cumtime) to generate
'''
#create samples
ppc = None
with self.gen_model:
while True:
#generate a batch
s = pm.sample_ppc(self.trace, samples=samples if cumtime is None else intervals)
if flatten:
s = {k: v.flatten() for k,v in s.items()}
#append them to dictionary
if ppc is None:
ppc = s
else:
ppc = {k: np.concatenate((v,s[k])) for k,v in ppc.items()}
#check if we've reached cumulative time
if(cumtime is None or np.cumsum(ppc['t_obs']) >= cumtime):
break
#trim to as close as possible to cumulative time
if cumtime is not None:
i = np.argmax(np.cumsum(ppc['t_obs']) - cumtime > 0)
ppc = {k: v[:i] for k,v in ppc.items}
#transform words back to usual
ppc['x_obs'] *= np.std(self.data['words'])
ppc['x_obs'] += np.mean(self.data['words'])
return ppc
def career(self, fxns, stride, age=41, life_mean=67, life_sd=3, samples=1000):
if type(fxns) != list:
fxns = [fxns]
if type(stride) != list:
stride = [stride]
assert len(stride) == len(fxns)
#create samples
ppc = self.sample(samples=samples)
#create cumulative sum of time
cs = np.cumsum(ppc['t_obs'])
#while we have not exhausted all data points
last_i = 0
count = 0
while cs[-1] > 0:
life = ss.norm.rvs(loc=life_mean - age, scale=life_sd)
cs -= life * 365
i = np.argmax(cs > 0)
#have we run out of points?
if cs[i] <= 0:
break
#no
for f,s in zip(fxns, stride):
if count % s == 0:
f(ppc['t_obs'][last_i:i], ppc['x_obs'][last_i:i], ppc['c_obs'][last_i:i])
last_i = i
count += 1
@property
def trace(self):
if self._inf_trace is None:
raise ValueError('Run inference first')
return self._inf_trace
@property
def inf_model(self):
if self._inf_model is None:
self._inf_model = pm.Model()
self._inf_vars = self._build_model(self._inf_model)
self._symmetry_break(self._inf_model, self._inf_vars)
self._force_cluster_min(self._inf_model, self._inf_vars)
return self._inf_model
@property
def gen_model(self):
if self._gen_model is None:
self._gen_model = pm.Model()
self._gen_vars = self._build_model(self._gen_model)
return self._gen_model
def _build_model(self, model):
'''Builds model and returns an object whose properties are defined variables below'''
with model:
#Category prior
w = pm.Dirichlet('w', np.ones(self.cluster_number))
#Latent Category
category = pm.Categorical('category', p=w, shape= self.obs_n)
#Cosmere Prior
c_p = pm.Dirichlet('c_p', np.ones(self.cluster_number))
#Cosmere
c_obs = pm.Bernoulli('c_obs', c_p[category], observed=self.data['cosmere'][1:], shape=self.obs_n)
#Rate Prior
#rate = pm.Lognormal('rate', math.log(2.0 / 365), math.log(1.0 / 365.0), shape=self.cluster_number)
rate = pm.Uniform('rate', 0.25 / 365, 10 / 365, shape=self.cluster_number)
#Observed time between books
t_obs = pm.Exponential('t_obs', rate[category], observed=self.data['time_diff'], shape=self.obs_n)
#Rate to words coefficient -> word units / time units
alpha = pm.Normal('alpha', 0, sd=2 / 365, shape=self.cluster_number)
#Cosmere identity to words coefficient -> word units
beta = pm.Normal('beta', 0, sd=1, shape=self.cluster_number)
#Word cluster center prior
mu = pm.Normal('mu', 0., 2., shape=self.cluster_number)
#Word precision prior
tau = pm.Gamma('tau', 1, 1., shape=self.cluster_number)
#GLM of words
mu_t = pm.Deterministic('mu_t',mu[category] + alpha[category] * t_obs + beta[category] * c_obs)
#Observed words
x_obs = pm.Normal('x_obs', mu_t, tau=tau[category], observed=self.data['normed_words'], shape = self.obs_n)
return ModelVars(locals())
def _symmetry_break(self, model, vs):
'''Force symmetry breaking so chain doesn't swap cluster identities. Only need to break mu,
since many other parameters depend on mu'''
with model:
#break symmetry in words
order_mu_potential = pm.Potential('order_mu_potential',
theano.tensor.switch(
theano.tensor.all(
[vs.mu[i] < vs.mu[i+1] for i in range(self.cluster_number - 1)]), -np.inf, 0))
def _force_cluster_min(self, model, vs):
with model:
force_wmin_potential = pm.Potential('force_wmin_potential',
theano.tensor.switch(
theano.tensor.all(
vs.w < 0.1), -np.inf, 0))