Notebook

Brandon Sanderson Bibliography Stochastic Model¶

Andrew White - March 2017¶

Overview¶

This notebook describes a model to predict if Brandon Sanderon will complete his Cosmere series. I define a stochastic process model for books authored by Sanderson and use Bayesian inference to fit the model to his work up until Febuaray 2016 with the pymc3 package. I sample the posterior of the model to predict an entire career up until retirement to make inferences about the number and type of books he will write. I've been making the models progressively more complex, but have decided to stop here, especially until I have more data. The next parameter one could add is a career-length dependence of the model parameters.

The Model¶

I model Sanderson as a stochasitc mixture exponential process. Describing it generatively, Sanderson chooses a type of book to write, then if it is a cosmere book and then a duration of time to spend writing it. The "type of book" is a clustering of sorts and is for fitting the data, not to do with if the book is young adult or a novella, etc. The number of words of the resulting book depends on the length of time he spent writing it, the type of book it is, and if it is a cosmere book. Then he begins his next book. Each book is independent and identically distributed, so not a complicated process. I assume that he will stop writing according to a normal distribution centered at age 67, with a standard deviation of 3.

The mathematical description of the model is below:

$$ \vec{p} \sim \textrm{Dir}\left(\,\vec{1}\,\right) $$$$ \vec{w} \sim \textrm{Bernoulli}\left(\, \vec{p}\, \right) $$$$ \vec{\kappa} \sim \textrm{Dir}\left(\,\vec{1}\,\right) $$$$ c\, |\, \vec{w} \sim \sum w_i \textrm{Bernoulli}(\,\kappa_i\,) $$$$ \vec{\lambda} \sim \textrm{Uniform}\left(0.25 \,\textrm{Year}^{-1}, 10\,\textrm{Year}^{-1}\right) $$$$ t \, |\, \vec{w} \sim \sum w_i \textrm{Exp}(\,\lambda_i\,) $$$$ \alpha, \beta, \vec{\mu} \sim \textrm{Norm}(0, 2) $$$$ \tau \sim \textrm{Gamma}(1,1) $$$$ \mu_t = \alpha t + \beta c + \sum w_i \mu_i $$$$ l \, |\, \tau, \mu_t \sim \textrm{Norm}(\mu_t, \tau) $$

where $l$, $c$, and $t$ are observed data correspdoning to the number of words, if the book is Cosmere, and the amount of time since last publication. $w$ is the latent category of book variable, whose prior is $p$. $\kappa$, $\lambda$, $\tau$ and $\mu$ are model paramters. $\alpha$ and $\beta$ are linear coefficiencts for the effect of $t$ and $c$ on $l$. The number of words in the books are converted to standard normal to avoid working in huge numbers and because I have no sense of a reasonable prior.

The Data¶

The data was taken from a reddit post by havefunonline and is current up to Calamity (Feb, 2016). I would update, but cannot find word counts for his latest books

In [2]:

raw_data = pd.read_excel('books.xlsx')
data = pd.DataFrame({'date': (raw_data['Publication Date'] - raw_data['Publication Date'].min()).dt.days,
                    'type': pd.Series(raw_data['Type'], dtype='category'),
                    'words': raw_data['Word count'],
                    'cosmere': raw_data['Cosmere']})
data.sort_values('date', inplace=True)

matplotlib.style.use('seaborn-darkgrid')
plt.figure()
for u in data.cosmere.unique():
    plt.scatter(data[data.cosmere == u].date / 365., 
                data[data.cosmere == u].words, 
                label='Cosmere' if u == 1 else 'Non-Cosmere')
plt.axhline(y=50000, color='gray', 
            linestyle='--', label='Short Story Cutoff')
plt.ylabel('Words')
plt.xlabel('Years')
plt.legend()
plt.savefig('data.png', dpi=300)
plt.show()

Inference¶

In [9]:

s = Sanderson(2, {'time': data.date.values, 'words': data.words.values, 'cosmere': data.cosmere.values})
s.inference(samples=100000, njobs=4)

100%|██████████| 120000/120000 [3:54:45<00:00, 16.17it/s]

Out[9]:

<MultiTrace: 4 chains, 100000 iterations, 13 variables>

In [10]:

plt.figure()
s.traceplot()
plt.savefig('trace.png', dpi=300)
plt.show()

<matplotlib.figure.Figure at 0x7f6282ac5240>

Model Parameters¶

A few interesting conclusions from the parameters:

The first type of books (orange) are more common, less likely to be cosmere, have fewer words, and are completed quicker than the other type (blue).
The shorter books (orange) are longer if they are cosmere books.
The longer books (blue) are shorter if they are cosmere (probably due to Mistborn vs Wheel of Time).
The longer books (orange) have almost no correlation between time since last book and word count
The shorter books (blue) have a positive correlation between time since last book and word count.
The fit overall is not great, judging from the large overlap in posterior of parameters.

Comparing Model with Data¶

Here I've plotted a few realization of the observed data using the model. Although the shorter books are modeled well, it appears the longer books have a time dependence which is not present in the model.

In [11]:

N = 2500
ppc = s.sample(samples=N, flatten=False)

plt.figure()
plt.plot([], [], color='gray', alpha=0.5, marker='o', linestyle='', label='Simulated')
for i in range(N):
    plt.scatter(np.cumsum(ppc['t_obs'][i]) / 365, ppc['x_obs'][i],
             c='gray', alpha=0.25, s=10)
             
for u in data.cosmere.unique():
    plt.scatter(data[data.cosmere == u].date / 365., 
                data[data.cosmere == u].words, 
                label='Cosmere' if u == 1 else 'Non-Cosmere')
plt.ylabel('Words')
plt.xlabel('Years Since Elantris')
plt.ylim(-2*10**5, 8 * 10**5)
plt.xlim(0, 12)
plt.legend(loc='best')
plt.savefig('fit.png', dpi=300)
plt.show()

100%|██████████| 2500/2500 [00:57<00:00, 48.11it/s]

Simulation¶

Here we simulate Sanderson's career. We call two functions, one for stats and one for graphs, with simulated careers.

In [12]:

#http://coppermind.net/wiki/Unpublished_works
stormlight = 10
mistborn =  3 * 4
whitesand = 1
dragonsteel = 3
elantris = 2 * 3
warbraker = 2
total_cosmere = stormlight + mistborn + whitesand + dragonsteel + elantris + warbraker
remaining = total_cosmere - data.cosmere.sum()

In [13]:

short_story_def = 50000
cosmere_s = []
cosmere = []
non_cosmere = []
non_cosmere_s = []
words = []
median_career = {'x': [], 'y': []}
def stats(t_obs, x_obs, c_obs, short = short_story_def):
    cosmere.append(np.sum(c_obs * (x_obs > short)))
    cosmere_s.append(np.sum(c_obs * (x_obs <= short)))
    non_cosmere.append(np.sum(x_obs > short) - cosmere[-1])    
    non_cosmere_s.append(np.sum(x_obs <= short) - cosmere_s[-1])
    words.append(np.sum(x_obs))
    #assert (cosmere[-1] + cosmere_s[-1] + 
    #    non_cosmere[-1] + 
    #    non_cosmere_s[-1]) == len(x_obs), 
    #    '{} {}'.format(cosmere[-1] + cosmere_s[-1] + non_cosmere[-1] + non_cosmere_s[-1], len(x_obs))
def plot_career(t_obs, x_obs, c_obs, short=short_story_def):
    y = data.cosmere.sum() + np.arange(np.sum(c_obs * (x_obs > short)))
    t_cum = np.cumsum(t_obs) / 365
    x = t_cum[np.where(c_obs * (x_obs > short) == 1)] + data.date.iloc[-1] / 365.0
    
    plt.plot(x, y, alpha=0.2, color='purple', marker='.', linewidth=0.2)
    plt.plot(x[-1], y[-1], alpha=0.7, color='purple', marker='X', linewidth=0.2)
    
plt.figure() 
plt.plot([], [], alpha=1, color='purple', marker='.', linewidth=0.2, label='Projected')
plt.plot([], [], alpha=1, color='purple', marker='X', linewidth=0.2, label='Projected Final Book', linestyle='')
plt.plot(data[data.cosmere == 1].date / 365., 
                np.arange(np.sum(data.cosmere == 1)), color='black', alpha=1, marker='.', linewidth=0.5, label='Completed')
s.career([stats, plot_career], [1, 25], samples=25000)
plt.axhline(y=total_cosmere, color='gray', linestyle='--', label='Cosmere Finished')
plt.legend(loc='best')
plt.ylabel('Books Completed')
plt.xlabel('Time since Elantris [Years]')
plt.savefig('final.png', dpi=300)
plt.show()

100%|██████████| 25000/25000 [08:27<00:00, 49.30it/s]

Analysis¶

Now we compare the number of novel-length cosmere works with what Sanderson has projected to see the probability that he completes on time.

In [14]:

print('{} total books to write, with {} to go'.format(total_cosmere, remaining))
print('With expected retirement age of {}, there is {:.3%} probability of completing cosmere. Simulated {} careers'.format(
    67,
    sum([ci > remaining for ci in cosmere]) / len(cosmere),
     len(cosmere)))

34 total books to write, with 22 to go
With expected retirement age of 67, there is 55.674% probability of completing cosmere. Simulated 10425 careers

In [15]:

for a,l in zip([cosmere, non_cosmere, cosmere_s, non_cosmere_s, words], 
             ['cosmere', 'non_cosmere', 'cosmere_s', 'non_cosmere_s', 'words']):
    m = np.median(a)
    t = np.percentile(a, [2.5, 97.5])
    w = ((t[1] - m) + (m - t[0])) / 2
    print('{} = {} +/- {}'.format(l, m, w))

cosmere = 23.0 +/- 12.0
non_cosmere = 25.0 +/- 13.5
cosmere_s = 7.0 +/- 6.0
non_cosmere_s = 16.0 +/- 10.0
words = 8982495.593676297 +/- 4290224.834684873

The Code¶

In [1]:

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import pymc3 as pm
import scipy.stats as ss
import theano, math
matplotlib.style.use('seaborn-notebook')

class ModelVars:
    def __init__(self, vars):
        self.__dict__.update(vars)
    
    
class Sanderson:
    
    def __init__(self, cluster_number, data):
        
        assert 'cosmere' in data and 'time' in data and 'words' in data        
        assert len(data['cosmere']) == len(data['words']) and len(data['words']) == len(data['time'])
        
        #transform data
        data['normed_words'] = (data['words'] - np.mean(data['words'])) / np.std(data['words'])
        data['normed_words'] = data['normed_words'][1:]
        data['time_diff'] = data['time'][1:] - data['time'][:-1]
        
        self.data = data
        self.cluster_number = cluster_number
        self.obs_n = len(data['time_diff'])
        self._inf_model = None
        self._gen_model = None
        self._inf_trace = None
    
    def inference(self, seed=0, samples=1500, tune_frac=0.2, njobs=1):
        '''Returns the given number of samples after tuning period from inference'''
        with self.inf_model:
            vs = self._inf_vars
            tune = math.floor(samples * tune_frac)
            step1 = pm.Metropolis([vs.w, vs.rate, vs.c_p])  
            step2 = pm.CategoricalGibbsMetropolis([vs.category])
            step3 = pm.NUTS([vs.mu, vs.tau, vs.alpha, vs.beta])
            trace = pm.sample(samples + tune, tune=tune,random_seed=seed, step=[step1, step2, step3], njobs=njobs)[tune:]
        self._inf_trace = trace
        return trace
    
    def traceplot(self):
        pm.traceplot(self.trace, varnames=['w', 'mu', 'rate', 'alpha', 'beta', 'c_p'], combined=True)
    
    def sample(self, cumtime=None, samples=None, intervals = 5, flatten=True):
        '''Return data samples
            
            Args:
                cumtime: Generate samples until this value is exceeded. If None, samples are in batches of size according to original data                            \
                samples: Number of multiples of original data (or cumtime) to generate
        '''                
                        
        #create samples        
        ppc = None
        
        with self.gen_model:
            while True:
                #generate a batch
                s = pm.sample_ppc(self.trace, samples=samples if cumtime is None else intervals)
                if flatten:
                    s = {k: v.flatten() for k,v in s.items()}
                #append them to dictionary
                if ppc is None:
                    ppc = s
                else:
                    ppc = {k: np.concatenate((v,s[k])) for k,v in ppc.items()}
                #check if we've reached cumulative time
                if(cumtime is None or np.cumsum(ppc['t_obs']) >= cumtime):
                    break
        #trim to as close as possible to cumulative time
        if cumtime is not None:
            i = np.argmax(np.cumsum(ppc['t_obs']) - cumtime > 0)
            ppc =  {k: v[:i] for k,v in ppc.items}
        
        #transform words back to usual
        ppc['x_obs'] *= np.std(self.data['words'])
        ppc['x_obs'] += np.mean(self.data['words'])
        return ppc
    
    def career(self, fxns, stride, age=41, life_mean=67, life_sd=3, samples=1000):
        
        if type(fxns) != list:
            fxns = [fxns]
        if type(stride) != list:
            stride = [stride]
            
        assert len(stride) == len(fxns)
        
        #create samples
        ppc = self.sample(samples=samples)
        #create cumulative sum of time
        cs = np.cumsum(ppc['t_obs'])
        #while we have not exhausted all data points
        last_i = 0
        count = 0
        while cs[-1] > 0:
            life = ss.norm.rvs(loc=life_mean - age, scale=life_sd)
            cs -= life * 365
            i = np.argmax(cs > 0)
            #have we run out of points?        
            if cs[i] <= 0:
                break
            #no            
            for f,s in zip(fxns, stride):
                if count % s == 0:
                    f(ppc['t_obs'][last_i:i], ppc['x_obs'][last_i:i], ppc['c_obs'][last_i:i])
            last_i = i
            count += 1
    
    @property
    def trace(self):
        if self._inf_trace is None:
            raise ValueError('Run inference first')
        return self._inf_trace
    
    @property
    def inf_model(self):
        if self._inf_model is None:
            self._inf_model = pm.Model()
            self._inf_vars = self._build_model(self._inf_model)
            self._symmetry_break(self._inf_model, self._inf_vars)
            self._force_cluster_min(self._inf_model, self._inf_vars)
        return self._inf_model
    
    @property
    def gen_model(self):
        if self._gen_model is None:
            self._gen_model = pm.Model()
            self._gen_vars = self._build_model(self._gen_model)
        return self._gen_model
    
    def _build_model(self, model):
        '''Builds model and returns an object whose properties are defined variables below'''
        with model:
            #Category prior
            w = pm.Dirichlet('w', np.ones(self.cluster_number))
            #Latent Category
            category = pm.Categorical('category', p=w, shape= self.obs_n)
            #Cosmere Prior
            c_p = pm.Dirichlet('c_p', np.ones(self.cluster_number))
            #Cosmere
            c_obs = pm.Bernoulli('c_obs', c_p[category], observed=self.data['cosmere'][1:], shape=self.obs_n)
            #Rate Prior
            #rate = pm.Lognormal('rate', math.log(2.0 / 365),  math.log(1.0 / 365.0), shape=self.cluster_number)
            rate = pm.Uniform('rate', 0.25 / 365, 10 / 365, shape=self.cluster_number)
            #Observed time between books
            t_obs = pm.Exponential('t_obs', rate[category], observed=self.data['time_diff'], shape=self.obs_n)
            #Rate to words coefficient -> word units / time units
            alpha = pm.Normal('alpha', 0, sd=2 / 365, shape=self.cluster_number)    
            #Cosmere identity to words coefficient -> word units
            beta = pm.Normal('beta', 0, sd=1, shape=self.cluster_number)
            #Word cluster center prior
            mu = pm.Normal('mu', 0., 2., shape=self.cluster_number)
            #Word precision prior
            tau = pm.Gamma('tau', 1, 1., shape=self.cluster_number)
            #GLM of words
            mu_t = pm.Deterministic('mu_t',mu[category] + alpha[category] * t_obs + beta[category] * c_obs)
            #Observed words
            x_obs = pm.Normal('x_obs', mu_t, tau=tau[category], observed=self.data['normed_words'], shape = self.obs_n)
        
        return ModelVars(locals())
    def _symmetry_break(self, model, vs):
        '''Force symmetry breaking so chain doesn't swap cluster identities. Only need to break mu, 
            since many other parameters depend on mu'''
        with model:
            #break symmetry in words
            order_mu_potential = pm.Potential('order_mu_potential', 
                                            theano.tensor.switch(
                                                theano.tensor.all(
                                                   [vs.mu[i] < vs.mu[i+1] for i in range(self.cluster_number - 1)]), -np.inf, 0))
    def _force_cluster_min(self, model, vs):
        with model:
            force_wmin_potential = pm.Potential('force_wmin_potential',
                                               theano.tensor.switch(
                                                   theano.tensor.all(
                                                       vs.w < 0.1), -np.inf, 0))