This is a reproduction with a few slight alterations of Bayesian Log Reg by J. Benjamin Cook
Author: Peadar Coyle and J. Benjamin Cook
How likely am I to make more than $50,000 US Dollars?
Exploration of model selection techniques too - I use DIC and WAIC to select the best model.
The convenience functions are all taken from Jon Sedars work.
This example also has some explorations of the features so serves as a good example of Exploratory Data Analysis and how that can guide the model creation/ model selection process.
%matplotlib inline
import pandas as pd
import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt
import seaborn
import warnings
warnings.filterwarnings('ignore')
from collections import OrderedDict
from time import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import fmin_powell
from scipy import integrate
import theano as thno
import theano.tensor as T
def run_models(df, upper_order=5):
'''
Convenience function:
Fit a range of pymc3 models of increasing polynomial complexity.
Suggest limit to max order 5 since calculation time is exponential.
'''
models, traces = OrderedDict(), OrderedDict()
for k in range(1,upper_order+1):
nm = 'k{}'.format(k)
fml = create_poly_modelspec(k)
with pm.Model() as models[nm]:
print('\nRunning: {}'.format(nm))
pm.glm.glm(fml, df, family=pm.glm.families.Normal())
start_MAP = pm.find_MAP(fmin=fmin_powell, disp=False)
traces[nm] = pm.sample(2000, start=start_MAP, step=pm.NUTS(), progressbar=True)
return models, traces
def plot_traces(traces, retain=1000):
'''
Convenience function:
Plot traces with overlaid means and values
'''
ax = pm.traceplot(traces[-retain:], figsize=(12,len(traces.varnames)*1.5),
lines={k: v['mean'] for k, v in pm.df_summary(traces[-retain:]).iterrows()})
for i, mn in enumerate(pm.df_summary(traces[-retain:])['mean']):
ax[i,0].annotate('{:.2f}'.format(mn), xy=(mn,0), xycoords='data'
,xytext=(5,10), textcoords='offset points', rotation=90
,va='bottom', fontsize='large', color='#AA0022')
def create_poly_modelspec(k=1):
'''
Convenience function:
Create a polynomial modelspec string for patsy
'''
return ('income ~ educ + hours + age ' + ' '.join(['+ np.power(age,{})'.format(j)
for j in range(2,k+1)])).strip()
The Adult Data Set is commonly used to benchmark machine learning algorithms. The goal is to use demographic features, or variables, to predict whether an individual makes more than \$50,000 per year. The data set is almost 20 years old, and therefore, not perfect for determining the probability that I will make more than $50K, but it is a nice, simple dataset that can be used to showcase a few benefits of using Bayesian logistic regression over its frequentist counterpart.
The motivation for myself to reproduce this piece of work was to learn how to use Odd Ratio in Bayesian Regression.
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", header=None, names=['age', 'workclass', 'fnlwgt',
'education-categorical', 'educ',
'marital-status', 'occupation',
'relationship', 'race', 'sex',
'captial-gain', 'capital-loss',
'hours', 'native-country',
'income'])
data
age | workclass | fnlwgt | education-categorical | educ | marital-status | occupation | relationship | race | sex | captial-gain | capital-loss | hours | native-country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
5 | 37 | Private | 284582 | Masters | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K |
6 | 49 | Private | 160187 | 9th | 5 | Married-spouse-absent | Other-service | Not-in-family | Black | Female | 0 | 0 | 16 | Jamaica | <=50K |
7 | 52 | Self-emp-not-inc | 209642 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 45 | United-States | >50K |
8 | 31 | Private | 45781 | Masters | 14 | Never-married | Prof-specialty | Not-in-family | White | Female | 14084 | 0 | 50 | United-States | >50K |
9 | 42 | Private | 159449 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 5178 | 0 | 40 | United-States | >50K |
10 | 37 | Private | 280464 | Some-college | 10 | Married-civ-spouse | Exec-managerial | Husband | Black | Male | 0 | 0 | 80 | United-States | >50K |
11 | 30 | State-gov | 141297 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | India | >50K |
12 | 23 | Private | 122272 | Bachelors | 13 | Never-married | Adm-clerical | Own-child | White | Female | 0 | 0 | 30 | United-States | <=50K |
13 | 32 | Private | 205019 | Assoc-acdm | 12 | Never-married | Sales | Not-in-family | Black | Male | 0 | 0 | 50 | United-States | <=50K |
14 | 40 | Private | 121772 | Assoc-voc | 11 | Married-civ-spouse | Craft-repair | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | ? | >50K |
15 | 34 | Private | 245487 | 7th-8th | 4 | Married-civ-spouse | Transport-moving | Husband | Amer-Indian-Eskimo | Male | 0 | 0 | 45 | Mexico | <=50K |
16 | 25 | Self-emp-not-inc | 176756 | HS-grad | 9 | Never-married | Farming-fishing | Own-child | White | Male | 0 | 0 | 35 | United-States | <=50K |
17 | 32 | Private | 186824 | HS-grad | 9 | Never-married | Machine-op-inspct | Unmarried | White | Male | 0 | 0 | 40 | United-States | <=50K |
18 | 38 | Private | 28887 | 11th | 7 | Married-civ-spouse | Sales | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
19 | 43 | Self-emp-not-inc | 292175 | Masters | 14 | Divorced | Exec-managerial | Unmarried | White | Female | 0 | 0 | 45 | United-States | >50K |
20 | 40 | Private | 193524 | Doctorate | 16 | Married-civ-spouse | Prof-specialty | Husband | White | Male | 0 | 0 | 60 | United-States | >50K |
21 | 54 | Private | 302146 | HS-grad | 9 | Separated | Other-service | Unmarried | Black | Female | 0 | 0 | 20 | United-States | <=50K |
22 | 35 | Federal-gov | 76845 | 9th | 5 | Married-civ-spouse | Farming-fishing | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
23 | 43 | Private | 117037 | 11th | 7 | Married-civ-spouse | Transport-moving | Husband | White | Male | 0 | 2042 | 40 | United-States | <=50K |
24 | 59 | Private | 109015 | HS-grad | 9 | Divorced | Tech-support | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K |
25 | 56 | Local-gov | 216851 | Bachelors | 13 | Married-civ-spouse | Tech-support | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
26 | 19 | Private | 168294 | HS-grad | 9 | Never-married | Craft-repair | Own-child | White | Male | 0 | 0 | 40 | United-States | <=50K |
27 | 54 | ? | 180211 | Some-college | 10 | Married-civ-spouse | ? | Husband | Asian-Pac-Islander | Male | 0 | 0 | 60 | South | >50K |
28 | 39 | Private | 367260 | HS-grad | 9 | Divorced | Exec-managerial | Not-in-family | White | Male | 0 | 0 | 80 | United-States | <=50K |
29 | 49 | Private | 193366 | HS-grad | 9 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 40 | United-States | <=50K |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
32531 | 30 | ? | 33811 | Bachelors | 13 | Never-married | ? | Not-in-family | Asian-Pac-Islander | Female | 0 | 0 | 99 | United-States | <=50K |
32532 | 34 | Private | 204461 | Doctorate | 16 | Married-civ-spouse | Prof-specialty | Husband | White | Male | 0 | 0 | 60 | United-States | >50K |
32533 | 54 | Private | 337992 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | Asian-Pac-Islander | Male | 0 | 0 | 50 | Japan | >50K |
32534 | 37 | Private | 179137 | Some-college | 10 | Divorced | Adm-clerical | Unmarried | White | Female | 0 | 0 | 39 | United-States | <=50K |
32535 | 22 | Private | 325033 | 12th | 8 | Never-married | Protective-serv | Own-child | Black | Male | 0 | 0 | 35 | United-States | <=50K |
32536 | 34 | Private | 160216 | Bachelors | 13 | Never-married | Exec-managerial | Not-in-family | White | Female | 0 | 0 | 55 | United-States | >50K |
32537 | 30 | Private | 345898 | HS-grad | 9 | Never-married | Craft-repair | Not-in-family | Black | Male | 0 | 0 | 46 | United-States | <=50K |
32538 | 38 | Private | 139180 | Bachelors | 13 | Divorced | Prof-specialty | Unmarried | Black | Female | 15020 | 0 | 45 | United-States | >50K |
32539 | 71 | ? | 287372 | Doctorate | 16 | Married-civ-spouse | ? | Husband | White | Male | 0 | 0 | 10 | United-States | >50K |
32540 | 45 | State-gov | 252208 | HS-grad | 9 | Separated | Adm-clerical | Own-child | White | Female | 0 | 0 | 40 | United-States | <=50K |
32541 | 41 | ? | 202822 | HS-grad | 9 | Separated | ? | Not-in-family | Black | Female | 0 | 0 | 32 | United-States | <=50K |
32542 | 72 | ? | 129912 | HS-grad | 9 | Married-civ-spouse | ? | Husband | White | Male | 0 | 0 | 25 | United-States | <=50K |
32543 | 45 | Local-gov | 119199 | Assoc-acdm | 12 | Divorced | Prof-specialty | Unmarried | White | Female | 0 | 0 | 48 | United-States | <=50K |
32544 | 31 | Private | 199655 | Masters | 14 | Divorced | Other-service | Not-in-family | Other | Female | 0 | 0 | 30 | United-States | <=50K |
32545 | 39 | Local-gov | 111499 | Assoc-acdm | 12 | Married-civ-spouse | Adm-clerical | Wife | White | Female | 0 | 0 | 20 | United-States | >50K |
32546 | 37 | Private | 198216 | Assoc-acdm | 12 | Divorced | Tech-support | Not-in-family | White | Female | 0 | 0 | 40 | United-States | <=50K |
32547 | 43 | Private | 260761 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | Mexico | <=50K |
32548 | 65 | Self-emp-not-inc | 99359 | Prof-school | 15 | Never-married | Prof-specialty | Not-in-family | White | Male | 1086 | 0 | 60 | United-States | <=50K |
32549 | 43 | State-gov | 255835 | Some-college | 10 | Divorced | Adm-clerical | Other-relative | White | Female | 0 | 0 | 40 | United-States | <=50K |
32550 | 43 | Self-emp-not-inc | 27242 | Some-college | 10 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
32551 | 32 | Private | 34066 | 10th | 6 | Married-civ-spouse | Handlers-cleaners | Husband | Amer-Indian-Eskimo | Male | 0 | 0 | 40 | United-States | <=50K |
32552 | 43 | Private | 84661 | Assoc-voc | 11 | Married-civ-spouse | Sales | Husband | White | Male | 0 | 0 | 45 | United-States | <=50K |
32553 | 32 | Private | 116138 | Masters | 14 | Never-married | Tech-support | Not-in-family | Asian-Pac-Islander | Male | 0 | 0 | 11 | Taiwan | <=50K |
32554 | 53 | Private | 321865 | Masters | 14 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
32555 | 22 | Private | 310152 | Some-college | 10 | Never-married | Protective-serv | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
32556 | 27 | Private | 257302 | Assoc-acdm | 12 | Married-civ-spouse | Tech-support | Wife | White | Female | 0 | 0 | 38 | United-States | <=50K |
32557 | 40 | Private | 154374 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
32558 | 58 | Private | 151910 | HS-grad | 9 | Widowed | Adm-clerical | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K |
32559 | 22 | Private | 201490 | HS-grad | 9 | Never-married | Adm-clerical | Own-child | White | Male | 0 | 0 | 20 | United-States | <=50K |
32560 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 15024 | 0 | 40 | United-States | >50K |
32561 rows × 15 columns
We need to remove any null entries in Income. And we also want to restrict this study to the United States.
data = data[~pd.isnull(data['income'])]
data[data['native-country']==" United-States"]
age | workclass | fnlwgt | education-categorical | educ | marital-status | occupation | relationship | race | sex | captial-gain | capital-loss | hours | native-country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
5 | 37 | Private | 284582 | Masters | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K |
7 | 52 | Self-emp-not-inc | 209642 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 45 | United-States | >50K |
8 | 31 | Private | 45781 | Masters | 14 | Never-married | Prof-specialty | Not-in-family | White | Female | 14084 | 0 | 50 | United-States | >50K |
9 | 42 | Private | 159449 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 5178 | 0 | 40 | United-States | >50K |
10 | 37 | Private | 280464 | Some-college | 10 | Married-civ-spouse | Exec-managerial | Husband | Black | Male | 0 | 0 | 80 | United-States | >50K |
12 | 23 | Private | 122272 | Bachelors | 13 | Never-married | Adm-clerical | Own-child | White | Female | 0 | 0 | 30 | United-States | <=50K |
13 | 32 | Private | 205019 | Assoc-acdm | 12 | Never-married | Sales | Not-in-family | Black | Male | 0 | 0 | 50 | United-States | <=50K |
16 | 25 | Self-emp-not-inc | 176756 | HS-grad | 9 | Never-married | Farming-fishing | Own-child | White | Male | 0 | 0 | 35 | United-States | <=50K |
17 | 32 | Private | 186824 | HS-grad | 9 | Never-married | Machine-op-inspct | Unmarried | White | Male | 0 | 0 | 40 | United-States | <=50K |
18 | 38 | Private | 28887 | 11th | 7 | Married-civ-spouse | Sales | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
19 | 43 | Self-emp-not-inc | 292175 | Masters | 14 | Divorced | Exec-managerial | Unmarried | White | Female | 0 | 0 | 45 | United-States | >50K |
20 | 40 | Private | 193524 | Doctorate | 16 | Married-civ-spouse | Prof-specialty | Husband | White | Male | 0 | 0 | 60 | United-States | >50K |
21 | 54 | Private | 302146 | HS-grad | 9 | Separated | Other-service | Unmarried | Black | Female | 0 | 0 | 20 | United-States | <=50K |
22 | 35 | Federal-gov | 76845 | 9th | 5 | Married-civ-spouse | Farming-fishing | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
23 | 43 | Private | 117037 | 11th | 7 | Married-civ-spouse | Transport-moving | Husband | White | Male | 0 | 2042 | 40 | United-States | <=50K |
24 | 59 | Private | 109015 | HS-grad | 9 | Divorced | Tech-support | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K |
25 | 56 | Local-gov | 216851 | Bachelors | 13 | Married-civ-spouse | Tech-support | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
26 | 19 | Private | 168294 | HS-grad | 9 | Never-married | Craft-repair | Own-child | White | Male | 0 | 0 | 40 | United-States | <=50K |
28 | 39 | Private | 367260 | HS-grad | 9 | Divorced | Exec-managerial | Not-in-family | White | Male | 0 | 0 | 80 | United-States | <=50K |
29 | 49 | Private | 193366 | HS-grad | 9 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 40 | United-States | <=50K |
30 | 23 | Local-gov | 190709 | Assoc-acdm | 12 | Never-married | Protective-serv | Not-in-family | White | Male | 0 | 0 | 52 | United-States | <=50K |
31 | 20 | Private | 266015 | Some-college | 10 | Never-married | Sales | Own-child | Black | Male | 0 | 0 | 44 | United-States | <=50K |
32 | 45 | Private | 386940 | Bachelors | 13 | Divorced | Exec-managerial | Own-child | White | Male | 0 | 1408 | 40 | United-States | <=50K |
33 | 30 | Federal-gov | 59951 | Some-college | 10 | Married-civ-spouse | Adm-clerical | Own-child | White | Male | 0 | 0 | 40 | United-States | <=50K |
34 | 22 | State-gov | 311512 | Some-college | 10 | Married-civ-spouse | Other-service | Husband | Black | Male | 0 | 0 | 15 | United-States | <=50K |
36 | 21 | Private | 197200 | Some-college | 10 | Never-married | Machine-op-inspct | Own-child | White | Male | 0 | 0 | 40 | United-States | <=50K |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
32528 | 31 | Private | 292592 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K |
32529 | 29 | Private | 125976 | HS-grad | 9 | Separated | Sales | Unmarried | White | Female | 0 | 0 | 35 | United-States | <=50K |
32530 | 35 | ? | 320084 | Bachelors | 13 | Married-civ-spouse | ? | Wife | White | Female | 0 | 0 | 55 | United-States | >50K |
32531 | 30 | ? | 33811 | Bachelors | 13 | Never-married | ? | Not-in-family | Asian-Pac-Islander | Female | 0 | 0 | 99 | United-States | <=50K |
32532 | 34 | Private | 204461 | Doctorate | 16 | Married-civ-spouse | Prof-specialty | Husband | White | Male | 0 | 0 | 60 | United-States | >50K |
32534 | 37 | Private | 179137 | Some-college | 10 | Divorced | Adm-clerical | Unmarried | White | Female | 0 | 0 | 39 | United-States | <=50K |
32535 | 22 | Private | 325033 | 12th | 8 | Never-married | Protective-serv | Own-child | Black | Male | 0 | 0 | 35 | United-States | <=50K |
32536 | 34 | Private | 160216 | Bachelors | 13 | Never-married | Exec-managerial | Not-in-family | White | Female | 0 | 0 | 55 | United-States | >50K |
32537 | 30 | Private | 345898 | HS-grad | 9 | Never-married | Craft-repair | Not-in-family | Black | Male | 0 | 0 | 46 | United-States | <=50K |
32538 | 38 | Private | 139180 | Bachelors | 13 | Divorced | Prof-specialty | Unmarried | Black | Female | 15020 | 0 | 45 | United-States | >50K |
32539 | 71 | ? | 287372 | Doctorate | 16 | Married-civ-spouse | ? | Husband | White | Male | 0 | 0 | 10 | United-States | >50K |
32540 | 45 | State-gov | 252208 | HS-grad | 9 | Separated | Adm-clerical | Own-child | White | Female | 0 | 0 | 40 | United-States | <=50K |
32541 | 41 | ? | 202822 | HS-grad | 9 | Separated | ? | Not-in-family | Black | Female | 0 | 0 | 32 | United-States | <=50K |
32542 | 72 | ? | 129912 | HS-grad | 9 | Married-civ-spouse | ? | Husband | White | Male | 0 | 0 | 25 | United-States | <=50K |
32543 | 45 | Local-gov | 119199 | Assoc-acdm | 12 | Divorced | Prof-specialty | Unmarried | White | Female | 0 | 0 | 48 | United-States | <=50K |
32544 | 31 | Private | 199655 | Masters | 14 | Divorced | Other-service | Not-in-family | Other | Female | 0 | 0 | 30 | United-States | <=50K |
32545 | 39 | Local-gov | 111499 | Assoc-acdm | 12 | Married-civ-spouse | Adm-clerical | Wife | White | Female | 0 | 0 | 20 | United-States | >50K |
32546 | 37 | Private | 198216 | Assoc-acdm | 12 | Divorced | Tech-support | Not-in-family | White | Female | 0 | 0 | 40 | United-States | <=50K |
32548 | 65 | Self-emp-not-inc | 99359 | Prof-school | 15 | Never-married | Prof-specialty | Not-in-family | White | Male | 1086 | 0 | 60 | United-States | <=50K |
32549 | 43 | State-gov | 255835 | Some-college | 10 | Divorced | Adm-clerical | Other-relative | White | Female | 0 | 0 | 40 | United-States | <=50K |
32550 | 43 | Self-emp-not-inc | 27242 | Some-college | 10 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
32551 | 32 | Private | 34066 | 10th | 6 | Married-civ-spouse | Handlers-cleaners | Husband | Amer-Indian-Eskimo | Male | 0 | 0 | 40 | United-States | <=50K |
32552 | 43 | Private | 84661 | Assoc-voc | 11 | Married-civ-spouse | Sales | Husband | White | Male | 0 | 0 | 45 | United-States | <=50K |
32554 | 53 | Private | 321865 | Masters | 14 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
32555 | 22 | Private | 310152 | Some-college | 10 | Never-married | Protective-serv | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
32556 | 27 | Private | 257302 | Assoc-acdm | 12 | Married-civ-spouse | Tech-support | Wife | White | Female | 0 | 0 | 38 | United-States | <=50K |
32557 | 40 | Private | 154374 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
32558 | 58 | Private | 151910 | HS-grad | 9 | Widowed | Adm-clerical | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K |
32559 | 22 | Private | 201490 | HS-grad | 9 | Never-married | Adm-clerical | Own-child | White | Male | 0 | 0 | 20 | United-States | <=50K |
32560 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 15024 | 0 | 40 | United-States | >50K |
29170 rows × 15 columns
income = 1 * (data['income'] == " >50K")
age2 = np.square(data['age'])
data = data[['age', 'educ', 'hours']]
data['age2'] = age2
data['income'] = income
income.value_counts()
0 24720 1 7841 Name: income, dtype: int64
Let us get a feel for the parameters.
g = seaborn.pairplot(data)
# Compute the correlation matrix
corr = data.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = seaborn.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
seaborn.heatmap(corr, mask=mask, cmap=cmap, vmax=.3,
linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x11443d5f8>
We see here not many strong correlations. The highest is 0.30 according to this plot. We see a weak-correlation between hours and income (which is logical), we see a slighty stronger correlation between education and income (which is the kind of question we are answering).
We will use a simple model, which assumes that the probability of making more than $50K is a function of age, years of education and hours worked per week. We will use PyMC3 do inference.
In Bayesian statistics, we treat everything as a random variable and we want to know the posterior probability distribution of the parameters (in this case the regression coefficients) The posterior is equal to the likelihood $$p(\theta | D) = \frac{p(D|\theta)p(\theta)}{p(D)}$$
Because the denominator is a notoriously difficult integral, $p(D) = \int p(D | \theta) p(\theta) d \theta $ we would prefer to skip computing it. Fortunately, if we draw examples from the parameter space, with probability proportional to the height of the posterior at any given point, we end up with an empirical distribution that converges to the posterior as the number of samples approaches infinity.
What this means in practice is that we only need to worry about the numerator.
Getting back to logistic regression, we need to specify a prior and a likelihood in order to draw samples from the posterior. We could use sociological knowledge about the effects of age and education on income, but instead, let's use the default prior specification for GLM coefficients that PyMC3 gives us, which is $p(θ)=N(0,10^{12}I)$. This is a very vague prior that will let the data speak for themselves.
The likelihood is the product of n Bernoulli trials, $\prod^{n}_{i=1} p_{i}^{y} (1 - p_{i})^{1-y_{i}}$, where $p_i = \frac{1}{1 + e^{-z_i}}$,
$z_{i} = \beta_{0} + \beta_{1}(age)_{i} + \beta_2(age)^{2}_{i} + \beta_{3}(educ)_{i} + \beta_{4}(hours)_{i}$ and $y_{i} = 1$ if income is greater than 50K and $y_{i} = 0$ otherwise.
With the math out of the way we can get back to the data. Here I use PyMC3 to draw samples from the posterior. The sampling algorithm used is NUTS, which is a form of Hamiltonian Monte Carlo, in which parameteres are tuned automatically. Notice, that we get to borrow the syntax of specifying GLM's from R, very convenient! I use a convenience function from above to plot the trace infromation from the first 1000 parameters.
with pm.Model() as logistic_model:
pm.glm.glm('income ~ age + age2 + educ', data, family=pm.glm.families.Binomial())
trace_logistic_model = pm.sample(2000, progressbar=True)
Auto-assigning NUTS sampler... Initializing NUTS using advi... 7%|▋ | 34842/500000 [01:13<15:33, 498.45it/s]
plot_traces(trace_logistic_model, retain=1000)
One of the major benefits that makes Bayesian data analysis worth the extra computational effort in many circumstances is that we can be explicit about our uncertainty. Maximum likelihood returns a number, but how certain can we be that we found the right number? Instead, Bayesian inference returns a distribution over parameter values.
I'll use seaborn to look at the distribution of some of these factors.
plt.figure(figsize=(9,7))
trace = trace_logistic_model[1000:]
seaborn.jointplot(trace['age'], trace['educ'], kind="hex", color="#4CB391")
plt.xlabel("beta_age")
plt.ylabel("beta_educ")
plt.show()
<matplotlib.figure.Figure at 0x12732d4e0>
So how do age and education affect the probability of making more than $$50K?$ To answer this question, we can show how the probability of making more than $50K changes with age for a few different education levels. Here, we assume that the number of hours worked per week is fixed at 50. PyMC3 gives us a convenient way to plot the posterior predictive distribution. We need to give the function a linear model and a set of points to evaluate. We will pass in three different linear models: one with educ == 12 (finished high school), one with educ == 16 (finished undergrad) and one with educ == 19 (three years of grad school).
# Linear model with hours == 50 and educ == 12
lm = lambda x, samples: 1 / (1 + np.exp(-(samples['Intercept'] +
samples['age']*x +
samples['age2']*np.square(x) +
samples['educ']*12 +
samples['hours']*50)))
# Linear model with hours == 50 and educ == 16
lm2 = lambda x, samples: 1 / (1 + np.exp(-(samples['Intercept'] +
samples['age']*x +
samples['age2']*np.square(x) +
samples['educ']*16 +
samples['hours']*50)))
# Linear model with hours == 50 and educ == 19
lm3 = lambda x, samples: 1 / (1 + np.exp(-(samples['Intercept'] +
samples['age']*x +
samples['age2']*np.square(x) +
samples['educ']*19 +
samples['hours']*50)))
Each curve shows how the probability of earning more than $ 50K$ changes with age. The red curve represents 19 years of education, the green curve represents 16 years of education and the blue curve represents 12 years of education. For all three education levels, the probability of making more than $50K increases with age until approximately age 60, when the probability begins to drop off. Notice that each curve is a little blurry. This is because we are actually plotting 100 different curves for each level of education. Each curve is a draw from our posterior distribution. Because the curves are somewhat translucent, we can interpret dark, narrow portions of a curve as places where we have low uncertainty and light, spread out portions of the curve as places where we have somewhat higher uncertainty about our coefficient values.
# Plot the posterior predictive distributions of P(income > $50K) vs. age
pm.glm.plot_posterior_predictive(trace, eval=np.linspace(25, 75, 1000), lm=lm, samples=100, color="blue", alpha=.15)
pm.glm.plot_posterior_predictive(trace, eval=np.linspace(25, 75, 1000), lm=lm2, samples=100, color="green", alpha=.15)
pm.glm.plot_posterior_predictive(trace, eval=np.linspace(25, 75, 1000), lm=lm3, samples=100, color="red", alpha=.15)
import matplotlib.lines as mlines
blue_line = mlines.Line2D(['lm'], [], color='b', label='High School Education')
green_line = mlines.Line2D(['lm2'], [], color='g', label='Bachelors')
red_line = mlines.Line2D(['lm3'], [], color='r', label='Grad School')
plt.legend(handles=[blue_line, green_line, red_line], loc='lower right')
plt.ylabel("P(Income > $50K)")
plt.xlabel("Age")
plt.show()
b = trace['educ']
plt.hist(np.exp(b), bins=20, normed=True)
plt.xlabel("Odds Ratio")
plt.show()
Finally, we can find a credible interval (remember kids - credible intervals are Bayesian and confidence intervals are frequentist) for this quantity. This may be the best part about Bayesian statistics: we get to interpret credibility intervals the way we've always wanted to interpret them. We are 95% confident that the odds ratio lies within our interval!
lb, ub = np.percentile(b, 2.5), np.percentile(b, 97.5)
print("P(%.3f < O.R. < %.3f) = 0.95"%(np.exp(3*lb),np.exp(3*ub)))
P(1.000 < O.R. < 1.000) = 0.95
The Deviance Information Criterion (DIC) is a fairly unsophisticated method for comparing the deviance of likelhood across the the sample traces of a model run. However, this simplicity apparently yields quite good results in a variety of cases. We'll run the model with a few changes to see what effect higher order terms have on this model.
One question that was immediately asked was what effect does age have on the model, and why should it be age^2 versus age? We'll use the DIC to answer this question.
models_lin, traces_lin = run_models(data, 4)
Running: k1
1%| | 18/2000 [00:28<53:34, 1.62s/it]
Running: k2
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-19-976ca5ac1f6d> in <module>() ----> 1 models_lin, traces_lin = run_models(data, 4) <ipython-input-2-5c6baeee651e> in run_models(df, upper_order) 18 pm.glm.glm(fml, df, family=pm.glm.families.Normal()) 19 ---> 20 start_MAP = pm.find_MAP(fmin=fmin_powell, disp=False) 21 traces[nm] = pm.sample(2000, start=start_MAP, step=pm.NUTS(), progressbar=True) 22 /Users/fonnescj/Repositories/pymc3/pymc3/tuning/starting.py in find_MAP(start, vars, fmin, return_raw, model, *args, **kwargs) 74 bij = DictToArrayBijection(ArrayOrdering(vars), start) 75 ---> 76 logp = bij.mapf(model.fastlogp) 77 dlogp = bij.mapf(model.fastdlogp(vars)) 78 /Users/fonnescj/Repositories/pymc3/pymc3/model.py in fastlogp(self) 153 def fastlogp(self): 154 """Compiled log probability density function""" --> 155 return self.model.fastfn(self.logpt) 156 157 def fastdlogp(self, vars=None): /Users/fonnescj/Repositories/pymc3/pymc3/model.py in fastfn(self, outs, mode, *args, **kwargs) 367 Compiled Theano function as point function. 368 """ --> 369 f = self.makefn(outs, mode, *args, **kwargs) 370 return FastPointFunc(f) 371 /Users/fonnescj/Repositories/pymc3/pymc3/memoize.py in memoizer(*args, **kwargs) 13 14 if key not in cache: ---> 15 cache[key] = obj(*args, **kwargs) 16 17 return cache[key] /Users/fonnescj/Repositories/pymc3/pymc3/model.py in makefn(self, outs, mode, *args, **kwargs) 337 on_unused_input='ignore', 338 accept_inplace=True, --> 339 mode=mode, *args, **kwargs) 340 341 def fn(self, outs, mode=None, *args, **kwargs): /Users/fonnescj/Repositories/Theano/theano/compile/function.py in function(inputs, outputs, mode, updates, givens, no_default_updates, accept_inplace, name, rebuild_strict, allow_input_downcast, profile, on_unused_input) 324 on_unused_input=on_unused_input, 325 profile=profile, --> 326 output_keys=output_keys) 327 # We need to add the flag check_aliased inputs if we have any mutable or 328 # borrowed used defined inputs /Users/fonnescj/Repositories/Theano/theano/compile/pfunc.py in pfunc(params, outputs, mode, updates, givens, no_default_updates, accept_inplace, name, rebuild_strict, allow_input_downcast, profile, on_unused_input, output_keys) 484 accept_inplace=accept_inplace, name=name, 485 profile=profile, on_unused_input=on_unused_input, --> 486 output_keys=output_keys) 487 488 /Users/fonnescj/Repositories/Theano/theano/compile/function_module.py in orig_function(inputs, outputs, mode, accept_inplace, name, profile, on_unused_input, output_keys) 1781 profile=profile, 1782 on_unused_input=on_unused_input, -> 1783 output_keys=output_keys).create( 1784 defaults) 1785 /Users/fonnescj/Repositories/Theano/theano/compile/function_module.py in __init__(self, inputs, outputs, mode, accept_inplace, function_builder, profile, on_unused_input, fgraph, output_keys) 1463 optimizer, inputs, outputs) 1464 else: -> 1465 optimizer_profile = optimizer(fgraph) 1466 1467 end_optimizer = time.time() /Users/fonnescj/Repositories/Theano/theano/gof/opt.py in __call__(self, fgraph) 96 97 """ ---> 98 return self.optimize(fgraph) 99 100 def add_requirements(self, fgraph): /Users/fonnescj/Repositories/Theano/theano/gof/opt.py in optimize(self, fgraph, *args, **kwargs) 85 orig = theano.tensor.basic.constant.enable 86 theano.tensor.basic.constant.enable = False ---> 87 ret = self.apply(fgraph, *args, **kwargs) 88 finally: 89 theano.tensor.basic.constant.enable = orig /Users/fonnescj/Repositories/Theano/theano/gof/opt.py in apply(self, fgraph) 233 nb_nodes_before = len(fgraph.apply_nodes) 234 t0 = time.time() --> 235 sub_prof = optimizer.optimize(fgraph) 236 l.append(float(time.time() - t0)) 237 sub_profs.append(sub_prof) /Users/fonnescj/Repositories/Theano/theano/gof/opt.py in optimize(self, fgraph, *args, **kwargs) 85 orig = theano.tensor.basic.constant.enable 86 theano.tensor.basic.constant.enable = False ---> 87 ret = self.apply(fgraph, *args, **kwargs) 88 finally: 89 theano.tensor.basic.constant.enable = orig /Users/fonnescj/Repositories/Theano/theano/gof/opt.py in apply(self, fgraph, start_from) 2378 # apply local optimizer 2379 topo_t0 = time.time() -> 2380 q = deque(graph.io_toposort(fgraph.inputs, start_from)) 2381 io_toposort_timing.append(time.time() - topo_t0) 2382 /Users/fonnescj/Repositories/Theano/theano/gof/graph.py in io_toposort(inputs, outputs, orderings, clients) 1028 topo = general_toposort(outputs, deps=compute_deps, 1029 compute_deps_cache=compute_deps_cache, -> 1030 deps_cache=deps_cache, clients=clients) 1031 return [o for o in topo if isinstance(o, Apply)] 1032 /Users/fonnescj/Repositories/Theano/theano/gof/graph.py in general_toposort(r_out, deps, debug_print, compute_deps_cache, deps_cache, clients) 948 rset.add(node) 949 for client in _clients.get(node, []): --> 950 deps_cache[client] = [a for a in deps_cache[client] 951 if a is not node] 952 if not deps_cache[client]: KeyboardInterrupt:
dfdic = pd.DataFrame(index=['k1','k2','k3','k4'], columns=['lin'])
dfdic.index.name = 'model'
for nm in dfdic.index:
dfdic.loc[nm, 'lin'] = pm.stats.dic(traces_lin[nm],models_lin[nm])
dfdic = pd.melt(dfdic.reset_index(), id_vars=['model'], var_name='poly', value_name='dic')
g = seaborn.factorplot(x='model', y='dic', col='poly', hue='poly', data=dfdic, kind='bar', size=6)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-20-752c2dd1a5ec> in <module>() 3 4 for nm in dfdic.index: ----> 5 dfdic.loc[nm, 'lin'] = pm.stats.dic(traces_lin[nm],models_lin[nm]) 6 7 NameError: name 'traces_lin' is not defined
There isn't a lot of difference between these models in terms of DIC. So our choice is fine in the model above, and there isn't much to be gained for going up to age^3 for example. Next we look at WAIC. Which is another model selection technique.
dfdic = pd.DataFrame(index=['k1','k2','k3','k4'], columns=['lin'])
dfdic.index.name = 'model'
for nm in dfdic.index:
dfdic.loc[nm, 'lin'] = pm.stats.waic(traces_lin[nm],models_lin[nm])
dfdic = pd.melt(dfdic.reset_index(), id_vars=['model'], var_name='poly', value_name='waic')
g = seaborn.factorplot(x='model', y='waic', col='poly', hue='poly', data=dfdic, kind='bar', size=6)
The WAIC confirms our decision to use age^2.