#!/usr/bin/env python
# coding: utf-8

# ## KQ3: Pritts Meta-analysis Update
# 
# In key question 3, we address the risk of sarcoma dissemination following morcellation of fibroids. We identified a recently published analysis conducted by Elizabeth Pritts and her colleagues (2015) to estimate the  prevalence of occult leiomyosarcoma at time of treatment for presumed benign tumors (fibroids).  We updated their search and used similar eligibility criteria to identify papers published since 2014. We extracted from these papers the number of women who were treated for uterine fibroids and the cases of leiomyosarcoma subsequently identified.  We have combined these data with the data from the 134 publications that Pritts et al included in their analysis for a total of LMS rates from 148 sources.

# In[1]:


get_ipython().run_line_magic('matplotlib', 'inline')
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import matplotlib.pyplot as plt
import pdb
sns.set()


# In[2]:


kq3_data = pd.read_excel('data/UF CER KQ3 Data for Analysis.xlsx', sheetname='5.UFKQ3Data', 
                         na_values=['NR', 'NA'],
                        index_col=0)[['Author', 'Year', 'Design', 'Procedure', 'Indication',
                                     'Age, Mean', 'Age, SD', 'Age, Median', 'Age, Min', 'Age, Max', 'Age, Other',
                                     'LMS','Population','Tumors','InPritts']].dropna(thresh=7)
kq3_data.head()


# Missing values

# In[3]:


kq3_data.isnull().sum()


# In[4]:


kq3_data['age_max'] = kq3_data['Age, Max'].replace('50+', 50)
kq3_data = kq3_data.rename(columns={'Age, Min': 'age_min',
                                   'Age, Mean': 'age_mean',
                                   'Age, SD': 'age_sd',
                                   'Age, Median': 'age_med'})


# In[5]:


kq3_data[kq3_data.age_max.isnull()]


# In[6]:


kq3_data.describe()


# In[7]:


kq3_data[[c for c in kq3_data.columns if c.startswith('age')]].hist();


# Breakdown on studies by design.

# In[8]:


kq3_data.Design.value_counts()


# I will drop the cohort study, since it is the only such representative.

# In[9]:


kq3_data = kq3_data[kq3_data.Design!='Pop based cohort']


# In[10]:


kq3_data = pd.concat([kq3_data, pd.get_dummies(kq3_data.Design)[['Prospective', 'RCT']]], axis=1)


# Following [Pritts et al. (2015)](http://www.ncbi.nlm.nih.gov/pubmed/26283890), I fit a binomial random effects model, such that event probabilities on the logit scale are normally distributed with mean $\mu$ and standard deviation $\sigma$. This distribution describes how the probabilities vary across studies, with the degree of variation described by $\sigma$.
# 
# $$\theta_i \sim N(\mu, \sigma^2)$$
# 
# the expected value for study $i$ is then inverse-logit transformed, and used as the event probability $\pi_i$ in a binomial model describing the number of observed tumors $t$:
# 
# $$\log\left[\frac{\pi_i}{1-\pi_i}\right] = \theta_i$$
# 
# $$t_i \sim \text{Bin}(n_i, \pi_i)$$

# In[18]:


import theano.tensor as tt
from numpy.ma import masked_values

model_data = kq3_data

k = model_data.shape[0]
tumors = model_data.Tumors.values.astype(int)
n = model_data.Population.values.astype(int)
X = model_data[['Prospective', 'RCT']].values.astype(int)
age_max_norm = ((model_data.age_max - 60)/10).fillna(0.5).values

poly_terms = 3

def invlogit(x):
    return tt.exp(x) / (1 + tt.exp(x))


with pm.Model() as pritts_update:
    
    # Impute missing max ages
    age_max_missing = masked_values(age_max_norm, value=0.5)
    ν = pm.HalfCauchy('ν', 5, testval=1)
    μ_age = pm.Normal('μ_age', 0, 5, testval=0)
    age_max = pm.StudentT('age_max', ν, mu=μ_age, observed=age_max_missing)
    
    # Study random effect
    μ = pm.Normal('μ', 0, sd=100, testval=-3)
    σ = pm.Uniform('σ', 0, 1000, testval=10)
    θ = pm.Normal('θ', μ, sd=σ, shape=k)
    # Design effects
    β = pm.Normal('β', 0, sd=10, shape=2, testval=np.zeros(2))
    # Polynomial age effect with Lasso
    α = pm.Normal('α', 0, sd=10, testval=0)
    
    # Study-specific probabilities
    π = pm.Deterministic('π', invlogit(θ + tt.dot(X,β) + α*age_max))
    
    # Expected probabilities by design
    p_retro_1000 = pm.Deterministic('p_retro_1000', invlogit(μ)*1000)
    p_prosp_1000 = pm.Deterministic('p_prosp_1000', invlogit(μ + β[0])*1000)
    p_rct_1000 = pm.Deterministic('p_rct_1000', invlogit(μ + β[1])*1000)
    
    obs = pm.Binomial('obs', n=n, p=π, observed=tumors)


# In[19]:


with pritts_update:
    step = pm.NUTS()
    trace = pm.sample(2000, step=step, njobs=2, random_seed=[20140425, 19700903])


# The following plots are the distribution of samples from the posterior distributions for the expected (population) probability of tumor (`p_update`), the inverse-logit expected probability ($\mu$) and the standard deviation of the probabilities on the inverse-logit scale ($\sigma$).

# In[20]:


pm.traceplot(trace[1000:], varnames=['p_retro_1000', 'p_prosp_1000', 'p_rct_1000', 'α', 'μ', 'σ']);


# Summary statistics from each of the above parameters.

# In[19]:


pm.summary(trace[1000:], varnames=['p_retro_1000', 'p_prosp_1000', 'p_rct_1000', 'μ', 'σ'], roundto=4)


# By comparison, estimates from Pritts' supplement:
# 
# ![pritts estimates](pritts_table.png)

# In[58]:


p_vars = ['p_retro_1000', 'p_prosp_1000', 'p_rct_1000']
intervals = [pm.stats.hpd(trace[var]) for var in p_vars]


# In[66]:


meds = [np.median(trace[var]) for var in p_vars]
upper = [i[1] - m for m,i in zip(meds, intervals)]
lower = [m - i[0] for m,i in zip(meds, intervals)]

plt.errorbar(np.arange(3)-0.1, meds, yerr=[lower, upper], fmt='o', label='Updated MA')
plt.errorbar(np.arange(2)+0.1, [0.57, 0.12], yerr=[[0.57-0.17, 0.12], [1.13 - 0.57,0.75 - 0.12]], color='r', fmt='o', label='Pritts MA')
plt.xticks(range(3), ['Retrospective', 'Prospective', 'RCT'])
plt.ylabel('Rate per 1000 surgeries')
plt.xlim(-0.5, 2.25)
plt.legend();


# In[21]:


pm.forestplot(trace[1000:], varnames=['age_max_missing'], ylabels=[''])


# ### Goodness of fit
# 
# To check how well the model fits the data, I conducted posteior predictive checks. This simulates data from the model for each data point (*i.e.* study) in the meta-analysis, and compares the distribution of these simulated values (here, 500 replicates) with the value of the data itself. If the percentile of the datum is very extreme (either large or small), then it is evidence that the model does not adequately fit the data. The distribution of percentiles in the bottom histogram shows that there is no evidence of lack of fit.

# In[22]:


ppc = pm.sample_ppc(trace, model=pritts_update, samples=500)


# In[23]:


from scipy.stats import percentileofscore
        
p = [percentileofscore(s, o).round(2) for s,o in zip(ppc['obs'].T, tumors)]
        
plt.hist(p)
plt.xlabel('percentile')
plt.ylabel('frequency');