Created by Data School. Watch all 10 videos on YouTube. Download the notebooks from GitHub.
Note: This notebook uses Python 3.9.1 and scikit-learn 0.23.2. The original notebook (shown in the video) used Python 2.7 and scikit-learn 0.16.
Pima Indians Diabetes dataset originally from the UCI Machine Learning Repository
# added empty cell so that the cell numbering matches the video
# read the data into a pandas DataFrame
import pandas as pd
path = 'data/pima-indians-diabetes.data'
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv(path, header=None, names=col_names)
# print the first 5 rows of data
pima.head()
pregnant | glucose | bp | skin | insulin | bmi | pedigree | age | label | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
Question: Can we predict the diabetes status of a patient given their health measurements?
# define X and y
feature_cols = ['pregnant', 'insulin', 'bmi', 'age']
X = pima[feature_cols]
y = pima.label
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train, y_train)
LogisticRegression(solver='liblinear')
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)
Classification accuracy: percentage of correct predictions
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))
0.6927083333333334
Null accuracy: accuracy that could be achieved by always predicting the most frequent class
# examine the class distribution of the testing set (using a Pandas Series method)
y_test.value_counts()
0 130 1 62 Name: label, dtype: int64
# calculate the percentage of ones
y_test.mean()
0.3229166666666667
# calculate the percentage of zeros
1 - y_test.mean()
0.6770833333333333
# calculate null accuracy (for binary classification problems coded as 0/1)
max(y_test.mean(), 1 - y_test.mean())
0.6770833333333333
# calculate null accuracy (for multi-class classification problems)
y_test.value_counts().head(1) / len(y_test)
0 0.677083 Name: label, dtype: float64
Comparing the true and predicted response values
# print the first 25 true and predicted responses
print('True:', y_test.values[0:25])
print('Pred:', y_pred_class[0:25])
True: [1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0] Pred: [0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
Conclusion:
Table that describes the performance of a classification model
# IMPORTANT: first argument is true values, second argument is predicted values
print(metrics.confusion_matrix(y_test, y_pred_class))
[[118 12] [ 47 15]]
Basic terminology
# print the first 25 true and predicted responses
print('True:', y_test.values[0:25])
print('Pred:', y_pred_class[0:25])
True: [1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0] Pred: [0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
# save confusion matrix and slice into four pieces
confusion = metrics.confusion_matrix(y_test, y_pred_class)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]
Classification Accuracy: Overall, how often is the classifier correct?
print((TP + TN) / (TP + TN + FP + FN))
print(metrics.accuracy_score(y_test, y_pred_class))
0.6927083333333334 0.6927083333333334
Classification Error: Overall, how often is the classifier incorrect?
print((FP + FN) / (TP + TN + FP + FN))
print(1 - metrics.accuracy_score(y_test, y_pred_class))
0.3072916666666667 0.30729166666666663
Sensitivity: When the actual value is positive, how often is the prediction correct?
print(TP / (TP + FN))
print(metrics.recall_score(y_test, y_pred_class))
0.24193548387096775 0.24193548387096775
Specificity: When the actual value is negative, how often is the prediction correct?
print(TN / (TN + FP))
0.9076923076923077
False Positive Rate: When the actual value is negative, how often is the prediction incorrect?
print(FP / (TN + FP))
0.09230769230769231
Precision: When a positive value is predicted, how often is the prediction correct?
print(TP / (TP + FP))
print(metrics.precision_score(y_test, y_pred_class))
0.5555555555555556 0.5555555555555556
Many other metrics can be computed: F1 score, Matthews correlation coefficient, etc.
Conclusion:
Which metrics should you focus on?
# print the first 10 predicted responses
logreg.predict(X_test)[0:10]
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1])
# print the first 10 predicted probabilities of class membership
logreg.predict_proba(X_test)[0:10, :]
array([[0.63247571, 0.36752429], [0.71643656, 0.28356344], [0.71104114, 0.28895886], [0.5858938 , 0.4141062 ], [0.84103973, 0.15896027], [0.82934844, 0.17065156], [0.50110974, 0.49889026], [0.48658459, 0.51341541], [0.72321388, 0.27678612], [0.32810562, 0.67189438]])
# print the first 10 predicted probabilities for class 1
logreg.predict_proba(X_test)[0:10, 1]
array([0.36752429, 0.28356344, 0.28895886, 0.4141062 , 0.15896027, 0.17065156, 0.49889026, 0.51341541, 0.27678612, 0.67189438])
# store the predicted probabilities for class 1
y_pred_prob = logreg.predict_proba(X_test)[:, 1]
# allow plots to appear in the notebook
%matplotlib inline
import matplotlib.pyplot as plt
# histogram of predicted probabilities
plt.hist(y_pred_prob, bins=8)
plt.xlim(0, 1)
plt.title('Histogram of predicted probabilities')
plt.xlabel('Predicted probability of diabetes')
plt.ylabel('Frequency')
Text(0, 0.5, 'Frequency')
Decrease the threshold for predicting diabetes in order to increase the sensitivity of the classifier
# predict diabetes if the predicted probability is greater than 0.3
from sklearn.preprocessing import binarize
y_pred_class = binarize([y_pred_prob], threshold=0.3)[0]
# print the first 10 predicted probabilities
y_pred_prob[0:10]
array([0.36752429, 0.28356344, 0.28895886, 0.4141062 , 0.15896027, 0.17065156, 0.49889026, 0.51341541, 0.27678612, 0.67189438])
# print the first 10 predicted classes with the lower threshold
y_pred_class[0:10]
array([1., 0., 0., 1., 0., 0., 1., 1., 0., 1.])
# previous confusion matrix (default threshold of 0.5)
print(confusion)
[[118 12] [ 47 15]]
# new confusion matrix (threshold of 0.3)
print(metrics.confusion_matrix(y_test, y_pred_class))
[[80 50] [16 46]]
# sensitivity has increased (used to be 0.24)
print(46 / (46 + 16))
0.7419354838709677
# specificity has decreased (used to be 0.91)
print(80 / (80 + 50))
0.6153846153846154
Conclusion:
Question: Wouldn't it be nice if we could see how sensitivity and specificity are affected by various thresholds, without actually changing the threshold?
Answer: Plot the ROC curve!
# IMPORTANT: first argument is true values, second argument is predicted probabilities
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)
# define a function that accepts a threshold and prints sensitivity and specificity
def evaluate_threshold(threshold):
print('Sensitivity:', tpr[thresholds > threshold][-1])
print('Specificity:', 1 - fpr[thresholds > threshold][-1])
evaluate_threshold(0.5)
Sensitivity: 0.24193548387096775 Specificity: 0.9076923076923077
evaluate_threshold(0.3)
Sensitivity: 0.7258064516129032 Specificity: 0.6153846153846154
AUC is the percentage of the ROC plot that is underneath the curve:
# IMPORTANT: first argument is true values, second argument is predicted probabilities
print(metrics.roc_auc_score(y_test, y_pred_prob))
0.7245657568238213
# calculate cross-validated AUC
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()
0.7378233618233618
Confusion matrix advantages:
ROC/AUC advantages:
© 2021 Data School. All rights reserved.