Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, and Jonathan Morgan.
In this tutorial, we'll discuss how to formulate a research question in the machine learning framework; how to transform raw data into something that can be fed into a model; how to build, evaluate, compare, and select models; and how to reasonably and accurately interpret model results. You'll also get hands-on experience using the scikit-learn
package in Python to model the data you're familiar with from previous tutorials.
This tutorial is based on chapter 6 of Big Data and Social Science.
There are a number of terms specific to Machine Learning that you will find repeatedly in this notebook.
fitting or estimating a function, or training or building a model. These terms are all synonyms and are used interchangeably in the machine learning literature.
or explanatory variables.
enough.
result in poor generalization performance, or applicability of the model to new data.
For example, you can limit the number of features present in the final model, or the weight coefficients applied to the (standardized) features are small.
to predict, or classify data into. Classification, prediction, and regression fall into this category. We call the set of explanatory variables $X$ features, and the outcome variable of interest $Y$ the label.
we are looking to understand "natural" patterns or groupings in the data - looking to uncover some structure that we do not know about a priori. Clustering is the most common example of unsupervised learning, another example is principal components analysis (PCA).
Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment. We're already familiar with numpy
, pandas
, and psycopg2
from previous tutorials. Here we'll also be using scikit-learn
to fit modeling.
%pylab inline
# from __future__ import division
import pandas as pd
import psycopg2
import sklearn
import seaborn as sns
from sklearn.metrics import precision_recall_curve,roc_curve, auc
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier,
GradientBoostingClassifier,
AdaBoostClassifier)
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sqlalchemy import create_engine
sns.set_style("white")
sns.set_context("poster", font_scale=1.25, rc={"lines.linewidth":1.25, "lines.markersize":8})
db_name = "appliedda"
hostname = "10.10.2.10"
conn = psycopg2.connect(database=db_name, host = hostname) #database connection
myschema = 'ada_tanf'
The Machine Learning Process is as follows:
descriptions of a goal - improving health outcomes, increasing graduation rates, understanding the effect of a variable X on an outcome Y, etc. It is really important to work with people who understand the domain being studied to dig deeper and define the problem more concretely. What is the analytical formulation of the metric that you are trying to optimize?
goal to build a model that generates a ranked list prioritized by risk, or is it to detect anomalies as new data come in? Knowing what kinds of tasks machine learning can solve will allow you to map the problem you are working on to one or more machine learning settings and give you access to a suite of methods.
do you need or have access to? What variable will you use to match records for integrating different data sources? What variables exist in the data set? Are they continuous or categorical? What about missing values? Can you use the variables in their original form, or do you need to alter them in some way?
or factors or covariates are called "features." Creating good features is probably the most important step in the machine learning process. This involves doing transformations, creating interaction terms, or aggregating over data points or over time and space.
choose from. It would be great if there were a single method that always worked best for a specific type of problem. Typically, in machine learning, you take a variety of methods and try them, empirically validating which one is the best approach to your problem.
trial, you are ready to put the model into practice. You still have to keep in mind that new data will be coming in, and the model might change over time.
You're probably used to fitting models in physical or social science classes. In those cases, you probably had a hypothesis or theory about the underlying process that gave rise to your data, chose an appropriate model based on prior knowledge and fit it using least squares, and used the resulting parameter or coefficient estimates (or confidence intervals) for inference. This type of modeling is very useful for interpretation.
In machine learning, our primary concern is generalization. This means that:
First, turning something into a real objective function. What do you care about? Do you have data on that thing? What action can you take based on your findings? Do you risk introducing any bias based on the way you model something?
Of the TANF recipients who's spell ended in a three month time period and then were not on TANF for a full year, which will return within the next two years?
This is an example of a binary prediction classification problem.
Note the time windows are completely arbitrary. You could use an outcome window of 5, 3, 1 years or 1 day. The outcome window will depend on how often you receive new data, how accurate your predictions are for a given time period, or on what time-scale you can use the output of the data.
During the first classes, we have explored the data, linked different data sources, and created new variables. A table was put together using similar techniques and writen to the class schema. We will now implement our machine learning model on this dataset.
A step-by-step description of how we created the table is provided in the notebook "Data Preparation" notebooks we went over on Friday.
label
.Refer to the 03_2_ML_data_preparation_creating_labels.ipynb notebook for how the labels were created.
Refer to the 03_3_ML_data_preparation_creating_features.ipynb notebook for how the labels were created.
Feature engineering is the process of transforming raw data into features that better represent the underlying problem/data/structure to the predictive models, resulting in improved model accuracy on unseen data." ( from Discover Feature Engineering ). In text, for example, this might involve deriving traits of the text like word counts, verb counts, or topics to feed into a model rather than simply giving it the raw text. Example of feature engineering are:
Cleaning data: To run the scikit-learn
set of models we demonstrate in this notebook, your input dataset must have no missing variables.
Imputing values to missing or irrelevant data: Once the features are created, always check to make sure the values make sense. You might have some missing values, or impossible values for a given variable (negative values, major outliers). If you have missing values you should think hard about what makes the most sense for your problem; you may want to replace with 0
, the median or mean of your data, or some other value.
Scaling features: Certain models will have an issue with features on different scales. For example, an individual's age is typically a number between 0 and 100 while earnings can be number between 0 and 1000000 (or higher). In order to circumvent this problem, we can scale our features to the same range (eg [0,1]).
We need to munge our dataset into our features (predictors, or independent variables, or $X$ variables) and labels (dependent variables, or $Y$ variables). For ease of reference, in subsequent examples, names of variables that pertain to predictors will start with "X_
", and names of variables that pertain to outcome variables will start with "y_
".
But it's not enough to just build the model; we're going to need a way to know whether or not it geenralizes to new data or in to the future. Convincing others of the quality of results is often the most challenging part of an analysis. Making repeatable, well-documented work with clear success metrics makes all the difference.
To convince ourselves - and others - that our modeling results will generalize, we need to hold some data back (not using it to train the model), then apply our model to that hold-out set and "blindly" predict, comparing the model's predictions to what we actually observed. This is called cross-validation, and it's the best way we have to estimate how a model will perform on entirely novel data. We call the data used to build the model the training set, and the rest the test set.
In general, we'd like our training set to be as large as possible, to allow our model to be built with as much data as possible. However, you also want to be as confident as possible that your model will generalize to new data. In practice, you'll have to balance these two objectives in a reasonable way.
There are also many ways to split up your data into training and testing sets. Since you're trying to evaluate how your model will perform in practice, it's best to emulate the true use case of your model as closely as possible when you decide how to evaluate it. A good tutorial on cross-validation can be found on the scikit-learn
site.
One simple and commonly used method is *k-fold* cross-validation, which entails splitting up our dataset into k groups, holding out one group while training a model on the rest of the data, evaluating model performance on the held-out "fold," and repeating this process k times (we'll get back to this in the text-analysis tutorial). Another method is temporal validation, which involves building a model using all the data up until a given point in time, and then testing the model on observations that happened after that point.
Our current problem is a problem in time where we are trying to predict an event in the future. Generally, if you use the future to predict the past there will be temporal effects that will help the accuracy of your predictions. We cannot use the future to predict the past in real life, so it is important to use temporal validation
and create our training and test sets accordingly.
Note: it is important to segregate your data based on time when creating features. Otherwise there can be "leakage," where you accidentally use information that you would not have known at the time. This happens often when calculating aggregation features; for instance, it is quite easy to calculate an average using values that go beyond our training set time-span and not realize it.
In the 03_2_data_preparation_creating_labels notebook, we created one row for each individual whose TANF spell ended in the 3 months prior to 1 year before our prediction dates, and labeled them with a 1
if they returned to TANF in the 2 years after our prediciton dates.
In the 03_3_data_preparation_creating_features notebook, we created features for each person in our cohort.
For both the training and test sets, let's now combine the labels and features into analytical dataframes.
# For the Training Set:
sql = '''
SELECT a.*,
b.recp_age_beg,
b.recp_age_end,
c.tot_earn_before,
c.tot_earn_during,
c.tot_earn_after,
c.avg_earn_before,
c.avg_earn_during,
c.avg_earn_after,
c.qtr_full_empl_before,
c.qtr_full_empl_during,
c.qtr_full_empl_after,
d.ed_level,
d.martl_status,
e.avg_len_days,
e.district,
e.homeless
FROM {schema}.labels_20080101 AS a
LEFT JOIN {schema}.features_age_20080101 AS b
ON a.recptno = b.recptno
LEFT JOIN {schema}.features_employment_20080101 AS c
ON a.recptno = c.recptno
LEFT JOIN {schema}.features_member_info_20080101 d
ON a.recptno = d.recptno
LEFT JOIN {schema}.features_case_info_20080101 e
ON a.recptno = e.recptno
'''.format(schema=myschema)
df_training = pd.read_sql(sql, conn)
# For the Testing Set:
sql = '''
SELECT a.*,
b.recp_age_beg,
b.recp_age_end,
c.tot_earn_before,
c.tot_earn_during,
c.tot_earn_after,
c.avg_earn_before,
c.avg_earn_during,
c.avg_earn_after,
c.qtr_full_empl_before,
c.qtr_full_empl_during,
c.qtr_full_empl_after,
d.ed_level,
d.martl_status,
e.avg_len_days,
e.district,
e.homeless
FROM {schema}.labels_20090101 AS a
LEFT JOIN {schema}.features_age_20090101 AS b
ON a.recptno = b.recptno
LEFT JOIN {schema}.features_employment_20090101 AS c
ON a.recptno = c.recptno
LEFT JOIN {schema}.features_member_info_20090101 d
ON a.recptno = d.recptno
LEFT JOIN {schema}.features_case_info_20090101 e
ON a.recptno = e.recptno
'''.format(schema=myschema)
df_testing = pd.read_sql(sql, conn)
df_training.describe(include='all', percentiles=[.5, .9, .99])
df_testing.describe(include='all')
Before running any machine learning algorithms, we have to ensure there are no NULL
(or NaN
) values in the data. As you have heard before, never remove observations with missing values without considering the data you are dropping. One easy way to check if there are any missing values with Pandas
is to use the .info()
method, which returns a count of non-null values for each column in your DataFrame.
df_training.info()
df_testing.info()
Let's check how much data we have, and what the split is between positive (1) and negative (0) labels in our training dataset. It's good to know what the "baseline" is in our dataset, to be able to intelligently evaluate our performance.
print('Number of rows: {}'.format(df_training.shape[0]))
df_training['label'].value_counts(normalize=True)
Crosstabs can be useful to get a summary of and to find trends and patterns in our data.
pd.crosstab(index=df_training['label'], columns=df_training['qtr_full_empl_before'])
We'll first list all the columns we have in our data frame and then decide which ones to use as predictors, which one as the label/outcome, and which ones to ignore.
print(list(df_training))
print(list(df_testing))
# Let's remove the ID variables, the label, and the dates from our list of features.
sel_features = list(df_training)
sel_features.remove('recptno')
sel_features.remove('start_date')
sel_features.remove('end_date')
sel_features.remove('label')
sel_features.remove('district') # good to revisit, too many and it's unclear what they mean
sel_label = 'label'
Categorical variables need to be converted to a series of binary (0,1) values for scikit-learn
to use them - this is also what happens in the background of other packages you may be familiar with (eg telling a statsmodels
regression function C(<your-categorical-variable>)
creates a series of binary variables)
pandas
has a very nice get_dummies
function to make our lives easier.
# this is what pandas.DataFrame[column].dtype gives you for two different columns:
df_training['avg_earn_before'].dtype, df_training['ed_level'].dtype
# find our "sel_features" columns that are strings
feat_to_dummy = [c for c in sel_features if df_training[c].dtype == 'O']
# remove these "feat_to_dummy" columns from our "sel_features" list
for c in feat_to_dummy:
sel_features.remove(c)
# new "sel_features" list
print(sel_features)
# get dummy values:
df_train_dummies = pd.get_dummies(df_training[feat_to_dummy])
df_test_dummies = pd.get_dummies(df_testing[feat_to_dummy])
# check if column list is the same for train and test dummy sets
sorted(df_train_dummies.columns.tolist()) == sorted(df_test_dummies.columns.tolist())
print(df_train_dummies.columns.tolist())
print(df_test_dummies.columns.tolist())
# handle problem when training or testing set has additional categories
# in the categorical columns
# if dummy columns not equal
if not (sorted(df_train_dummies.columns.tolist()) == sorted(df_test_dummies.columns.tolist())):
# if training column list longer than testing list
# assume need to add columns to testing dummy dataframe
if len(df_train_dummies.columns.tolist()) > len(df_test_dummies.columns.tolist()):
print('train longer')
# get columns to add
add_cols = [c for c in df_train_dummies.columns.tolist()
if c not in df_test_dummies.columns.tolist()]
# add missing columns as zeros b/c binary 0 means does not exist
for c in add_cols:
df_test_dummies[c] = 0
# same but when test has categories train doesn't
elif len(df_train_dummies.columns.tolist()) < len(df_test_dummies.columns.tolist()):
print('test longer')
add_cols = [c for c in df_test_dummies.columns.tolist()
if c not in df_train_dummies.columns.tolist()]
# put additional categories into an "other" column since training set doesn't have them
print('to be updated')
# check if additional categories are in variable that has
# an "unknown" option
# in case same length but not equal
else:
print('case not handled. stop and check.')
else:
print('train and test set dummies are equal')
df_test_dummies.info()
df_train_dummies.info()
sorted(df_train_dummies.columns.tolist()) == sorted(df_test_dummies.columns.tolist())
# add dummy columns to full training and testing dataframes
df_training = df_training.merge(df_train_dummies, left_index=True, right_index=True)
df_testing = df_testing.merge(df_test_dummies, left_index=True, right_index=True)
# update the "sel_features" with the dummy columns:
for c in df_train_dummies.columns.tolist():
sel_features.append(c)
# now we can easily access just the columns we want for modeling:
df_training[sel_features].info()
Certain models will have issue with values on different scales. In your analysis cohort the job count range may vary from zero to ten while the total wages may range from zero to hundreds of thousands or millions. Traditional regression methods, for example, tend to result in features (aka right-hand variables, Xs, etc) with small values having larger coefficients than features with large values. On the other hand, some models - like decision trees - are not generally effected by having variables on different scales.
To more easily use any model, we'll scale all of our continuous data to value between 0 and 1.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
# With few features it is relatively easy to hand code which columns to scale
# but we can also make our lives a bit easier by doing it programmatically
# get a list columns with values <0 and/or >1:
cols_to_scale = [c for c in list(df_training[sel_features]) if
df_training[c].min()<0 or df_training[c].max()>1]
print(cols_to_scale)
# add a '*_scl' version of the column for each of our "columns to scale"
# and replace our "sel_features" list
for c in cols_to_scale:
# create a new column name by adding '_scl' to the end of each column to scale
new_column_name = c+'_scl'
# fit MinMaxScaler to training set column
# reshape because scaler built for 2D arrays
scaler.fit(df_training[c].values.reshape(-1, 1))
# update training and testing datasets with new data
df_training[new_column_name] = scaler.transform(df_training[c].values.reshape(-1, 1))
df_testing[new_column_name] = scaler.transform(df_testing[c].values.reshape(-1, 1))
# add new column to our "selection features"
sel_features.append(new_column_name)
# and remove the unscaled column
sel_features.remove(c)
# now our selection features are all scaled between 0-1
df_training[sel_features].describe().T
# and see distribution of training set
df_training[sel_features].describe().T
# get the underlying numpy.array data for use in scikit-learn
X_train = df_training[sel_features].values
y_train = df_training[sel_label].values
X_test = df_testing[sel_features].values
y_test = df_testing[sel_label].values
In this phase, we will run the Machine Learning model on our training set. The training set's features will be used to predict the labels. Once our model is created using the test set, we will assess its quality by applying it to the test set and by comparing the predicted values to the actual values for each record in your testing data set.
Python's scikit-learn
is a commonly used, well documented Python library for machine learning. This library can help you split your data into training and test sets, fit models and use them to predict results on new data, and evaluate your results.
We will start with the simplest LogisticRegression
model and see how well that does.
You can use any number of metrics to judge your models (see model evaluation), but we'll use accuracy_score()
(ratio of correct predictions to total number of predictions) as our measure.
# Let's fit a model
from sklearn import linear_model
model = linear_model.LogisticRegression(penalty='l1', C=1)
model.fit( X_train, y_train )
print(model)
When we print the model results, we see different parameters we can adjust as we refine the model based on running it against test data (values such as intercept_scaling
, max_iters
, penalty
, and solver
). Example output:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr',
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0)
To adjust these parameters, one would alter the call that creates the LogisticRegression()
model instance, passing it one or more of these parameters with a value other than the default. So, to re-fit the model with max_iter
of 1000, intercept_scaling
of 2, and solver
of "lbfgs" (pulled from thin air as an example), you'd create your model as follows:
model = LogisticRegression( max_iter = 1000, intercept_scaling = 2, solver = "lbfgs" )
The basic way to choose values for, or "tune," these parameters is the same as the way you choose a model: fit the model to your training data with a variety of parameters, and see which perform the best on the test set. An obvious drawback is that you can also overfit to your test set; in this case, you can alter your method of cross-validation.
Machine learning models usually do not produce a prediction (0 or 1) directly. Rather, models produce a score between 0 and 1 (that can sometimes be interpreted as a probability), which is basically the model ranking all of the observations from most likely to least likely to have label of 1. The 0-1 score is then turned into a 0 or 1 based on a threshold.
If you use the sklearn method .predict()
then the model will select a threshold for you (generally 0.5) - it is almost never a good idea to let the model choose the threshold for you. Instead, you should get the actual score and test different threshold values.
# get the prediction scores
y_scores = model.predict_proba(X_test)[:,1]
Look at the distribution of scores:
sns.distplot(y_scores, kde=False, rug=False)
df_testing['y_score'] = y_scores
# see our selected features and prediction score
df_testing[sel_features + ['y_score']].head()
Tools like sklearn
often have a default threshold of 0.5, but a good threshold is selected based on the data, model and the specific problem you are solving. As a trial run, let's set a threshold to the value of 0.05.
# you can make a simple function in one line using "lambda":
calc_threshold = lambda x,y: 0 if x < y else 1
# given the distribution of the scores, what threshold would you set?
selected_threshold = 0.2
# create a list of our predicted outocmes
predicted = np.array( [calc_threshold(score, selected_threshold) for score in y_scores] )
# and our actual, or expected, outcomes
expected = y_test
Once we have tuned our scores to 0 or 1 for classification, we create a confusion matrix, which has four cells: true negatives, true positives, false negatives, and false positives. Each data point belongs in one of these cells, because it has both a ground truth and a predicted label. If an example was predicted to be negative and is negative, it's a true negative. If an example was predicted to be positive and is positive, it's a true positive. If an example was predicted to be negative and is positive, it's a false negative. If an example was predicted to be positive and is negative, it's a false negative.
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(expected,predicted)
print(conf_matrix)
The count of true negatives is conf_matrix[0,0]
, false negatives conf_matrix[1,0]
, true positives conf_matrix[1,1]
, and false_positives conf_matrix[0,1]
.
Accuracy is the ratio of the correct predictions (both positive and negative) to all predictions. $$ Accuracy = \frac{TP+TN}{TP+TN+FP+FN} $$
# generate an accuracy score by comparing expected to predicted.
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(expected, predicted)
print( "Accuracy = " + str( accuracy ) )
df_training['label'].value_counts(normalize=True)
what do we think about this accuracy? good? bad?
Two metrics that are often more relevant than overall accuracy are precision and recall.
Precision measures the accuracy of the classifier when it predicts an example to be positive. It is the ratio of correctly predicted positive examples to examples predicted to be positive.
$$ Precision = \frac{TP}{TP+FP}$$Recall measures the accuracy of the classifier to find positive examples in the data.
$$ Recall = \frac{TP}{TP+FN} $$By selecting different thresholds we can vary and tune the precision and recall of a given classifier. A conservative classifier (threshold 0.99) will classify a case as 1 only when it is very sure, leading to high precision. On the other end of the spectrum, a low threshold (e.g. 0.01) will lead to higher recall.
from sklearn.metrics import precision_score, recall_score
precision = precision_score(expected, predicted)
recall = recall_score(expected, predicted)
print( "Precision = " + str( precision ) )
print( "Recall= " + str(recall))
If we care about our whole precision-recall space, we can optimize for a metric known as the area under the curve (AUC-PR), which is the area under the precision-recall curve. The maximum AUC-PR is 1.
def plot_precision_recall(y_true,y_score):
"""
Plot a precision recall curve
Parameters
----------
y_true: ls
ground truth labels
y_score: ls
score output from model
"""
precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true,y_score)
plt.plot(recall_curve, precision_curve)
plt.xlabel('Recall')
plt.ylabel('Precision')
auc_val = auc(recall_curve,precision_curve)
print('AUC-PR: {0:1f}'.format(auc_val))
plt.show()
plt.clf()
plot_precision_recall(expected, y_scores)
If we only care about a specific part of the precision-recall curve we can focus on more fine-grained metrics. For instance, say there is a special program for those most likely to need assistance within the next year, but that it can only cover 1% of our test set. In that case, we would want to prioritize the 1% who are most likely to need assistance within the next year, and it wouldn't matter too much how accurate we were on the overall data.
Let's say that, out of the approximately 300,000 observations, we can intervene on 1% of them, or the "top" 3000 in a year (where "top" means highest likelihood of needing intervention in the next year). We can then focus on optimizing our precision at 1%.
def plot_precision_recall_n(y_true, y_prob, model_name):
"""
y_true: ls
ls of ground truth labels
y_prob: ls
ls of predic proba from model
model_name: str
str of model name (e.g, LR_123)
"""
from sklearn.metrics import precision_recall_curve
y_score = y_prob
precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true, y_score)
precision_curve = precision_curve[:-1]
recall_curve = recall_curve[:-1]
pct_above_per_thresh = []
number_scored = len(y_score)
for value in pr_thresholds:
num_above_thresh = len(y_score[y_score>=value])
pct_above_thresh = num_above_thresh / float(number_scored)
pct_above_per_thresh.append(pct_above_thresh)
pct_above_per_thresh = np.array(pct_above_per_thresh)
plt.clf()
fig, ax1 = plt.subplots()
ax1.plot(pct_above_per_thresh, precision_curve, 'b')
ax1.set_xlabel('percent of population')
ax1.set_ylabel('precision', color='b')
ax1.set_ylim(0,1.05)
ax2 = ax1.twinx()
ax2.plot(pct_above_per_thresh, recall_curve, 'r')
ax2.set_ylabel('recall', color='r')
ax2.set_ylim(0,1.05)
name = model_name
plt.title(name)
plt.show()
plt.clf()
plot_precision_recall_n(expected,y_scores, 'LR')
def precision_at_k(y_true, y_scores,k):
threshold = np.sort(y_scores)[::-1][int(k*len(y_scores))]
y_pred = np.asarray([1 if i > threshold else 0 for i in y_scores ])
return precision_score(y_true, y_pred)
p_at_10 = precision_at_k(expected,y_scores, 0.1)
print('Precision at 10%: {:.3f}'.format(p_at_10))
Now that we have evaluated our model overall, let's look at the coefficients for each feature, along with their standard deviation.
print("The coefficients for each of the features are ")
list(zip(sel_features, model.coef_[0]))
std_coef = np.std(X_train,0)*model.coef_
list(zip(sel_features, std_coef[0]))
from sklearn.tree import DecisionTreeClassifier
# packages to display a tree in Jupyter notebooks
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import graphviz as gv
import pydotplus
model = DecisionTreeClassifier(max_depth=3, min_samples_split=100)
model.fit( X_train, y_train )
print(model)
# visualize the tree
# object to hold the graphviz data
dot_data = StringIO()
# create the visualization
export_graphviz(model, out_file=dot_data, filled=True,
rounded=True, special_characters=True,
feature_names=df_training[sel_features].columns.values)
# convert to a graph from the data
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
# print out the graph to zoom in
# graph.write_pdf('./output/model_eval_tree1.pdf')
# or view it directly in notebook
Image(graph.create_png())
# get the prediction scores
y_scores = model.predict_proba(X_test)[:,1]
fig, ax = subplots(figsize=(10,5))
sns.distplot(y_scores, kde=False, rug=False, ax=ax)
# given the distribution of the scores, what threshold would you set?
selected_threshold = 0.15
# create a list of our predicted outocmes
predicted = np.array( [calc_threshold(score, selected_threshold) for score in y_scores] )
# and our actual, or expected, outcomes
expected = y_test
conf_matrix = confusion_matrix(expected,predicted)
print(conf_matrix)
accuracy = accuracy_score(expected, predicted)
print( "Accuracy = " + str( accuracy ) )
model_precision = precision_score(expected, predicted)
recall = recall_score(expected, predicted)
print( "Precision = " + str( model_precision ) )
print( "Recall= " + str(recall))
plot_precision_recall(expected, y_scores)
plot_precision_recall_n(expected,y_scores, 'DTC')
# Decistion Trees have a `.feature_importances_` rather than `.coef_` attribute:
list(zip(sel_features, model.feature_importances_))
It is important to check our model against a reasonable baseline to know how well our model is doing.
Without any context, over 85% accuracy can sound great... But it's not so great when you remember that you could as well or better by declaring that all of the firms will survive in the next year, which would be a stupid (not to mention useless) model.
A good place to start is checking against a random baseline, assigning every example a label (positive or negative) completely at random.
random_score = [random.uniform(0,1) for i in enumerate(y_test)]
random_p_at_selected = precision_at_k(expected,random_score, selected_threshold)
print(random_p_at_selected)
Another good practice is checking against an "expert" or rule of thumb baseline.
Here, ...
# avg_wage_pct_10 = np.percentile(df_testing['avg_wage'], 10)
# expert_predicted = np.array([1 if (avg_wage<avg_wage_pct_10)
# else 0
# for avg_wage in df_testing['avg_wage'].values])
# expert_precision = precision_score(expected, expert_predicted)
# print(expert_precision)
# confusion_matrix(expected, expert_predicted)
Another good baseline to compare against is the "all label" (label is always 1).
Here, let's compare to a prediction where all employers survive.
none_predicted = np.array([1 for i in range(df_testing.shape[0])])
none_precision = precision_score(expected, none_predicted)
print(none_precision)
sns.set_style("white")
sns.set_context("poster", font_scale=2.25, rc={"lines.linewidth":2.25, "lines.markersize":8})
fig, ax = plt.subplots(1, figsize=(22,12))
sns.barplot(['Random','None Employed', 'Our Model'],
# [random_p_at_1, none_precision, expert_precision, max_p_at_k],
[random_p_at_selected, none_precision, model_precision],
# palette=['#6F777D','#6F777D','#6F777D','#800000'])
palette=['#6F777D','#6F777D','#800000'])
sns.despine()
plt.ylim(0,1)
plt.ylabel('precision at {}%'.format(selected_threshold*100));
When working on machine learning projects, it is a good idea to structure your code as a modular pipeline, which contains all of the steps of your analysis, from the original data source to the results that you report, along with documentation. This has many advantages:
to see what you did, follow the exact same process, and come up with the exact same results. It also means that someone else can follow the steps you took and see what decisions you made, whether that person is a collaborator, a reviewer for a journal, or the agency you are working with.
updates to the data you used, you can easily substitute new data and reproduce the process without starting from scratch.
We have only scratched the surface of what we can do with our model. We've only tried one classifier (Logistic Regression), and there are plenty more classification algorithms in sklearn
. Let's try them!
clfs = {'RF': RandomForestClassifier(n_estimators=1000, n_jobs=-1),
'ET': ExtraTreesClassifier(n_estimators=1000, n_jobs=-1),
'LR': LogisticRegression(penalty='l1', C=1e5),
'SGD':SGDClassifier(loss='log'),
'GB': GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, random_state=17
, n_estimators=10),
'NB': GaussianNB(),
'DT': DecisionTreeClassifier(max_depth=10, min_samples_split=10)
}
sel_clfs = ['RF', 'ET', 'LR', 'SGD', 'GB', 'NB', 'DT']
max_p_at_k = 0
df_results = pd.DataFrame()
for clfNM in sel_clfs:
clf = clfs[clfNM]
clf.fit( X_train, y_train )
print(clf)
y_score = clf.predict_proba(X_test)[:,1]
predicted = np.array(y_score)
expected = np.array(y_test)
plot_precision_recall_n(expected,predicted, clfNM)
p_at_1 = precision_at_k(expected,y_score, 0.01)
p_at_5 = precision_at_k(expected,y_score,0.05)
p_at_10 = precision_at_k(expected,y_score,0.10)
fpr, tpr, thresholds = roc_curve(expected,y_score)
auc_val = auc(fpr,tpr)
df_results = df_results.append([{
'clfNM':clfNM,
'p_at_1':p_at_1,
'p_at_5':p_at_5,
'p_at_10':p_at_10,
'auc':auc_val,
'clf': clf
}])
#feature importances
if hasattr(clf, 'coef_'):
feature_import = dict(
zip(sel_features, clf.coef_.ravel()))
elif hasattr(clf, 'feature_importances_'):
feature_import = dict(
zip(sel_features, clf.feature_importances_))
print("FEATURE IMPORTANCES")
print(feature_import)
if max_p_at_k < p_at_1:
max_p_at_k = p_at_1
print('Precision at 1%: {:.2f}'.format(p_at_1))
# df_results.to_csv('output/modelrun.csv')
Let's compare all models at 1%
random_score = [random.uniform(0,1) for i in enumerate(y_test)]
random_p_at_1 = precision_at_k(expected,random_score, selected_threshold)
print(random_p_at_1)
sns.set_style("white")
sns.set_context("poster", font_scale=2.25, rc={"lines.linewidth":2.25, "lines.markersize":8})
fig, ax = plt.subplots(1, figsize=(22,12))
sns.barplot(['Random','None Employed', 'Best Model'],
# [random_p_at_1, none_precision, expert_precision, max_p_at_k],
[random_p_at_1, none_precision, max_p_at_k],
# palette=['#6F777D','#6F777D','#6F777D','#800000'])
palette=['#6F777D','#6F777D','#800000'])
sns.despine()
plt.ylim(0,1)
plt.ylabel('precision at 1%')
Our model has just scratched the surface. Try the following:
The notebook used to create features and labels is the "Data Preparation" notebook. Take the time to look at it and understand how every metric was created and added to the data table.