Find a Machine Learning (ML) model that accurately predicts breast cancer based on the 30 features described below.
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].
This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/
Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
Attribute Information:
Ten real-valued features are computed for each cell nucleus:
The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
All feature values are recoded with four significant digits.
Missing attribute values: none
Class distribution: 357 benign, 212 malignant
When it comes to diagnosing breast cancer, we want to make sure we don't have too many false-positives (you don't have cancer, but told you do and go on treatment) or false-negatives (you have cancer, but told you don't and don't get treatment). Therefore, the highest overall accuracy model is chosen.
The Data was split into 80% training (~455 people) and 20% testing (~114 people).
Several different models were evaluated through k-fold Cross-Validation with GridSearchCV, which iterates on different algorithm's hyperparameters:
All of the models performed well after fine tunning their hyperparameters, but the best model is the one the highest overall accuracy. Out of the 20% of data witheld in this test (114 random individuals), only a handful were misdiagnosed. No model is perfect, but I am happy about how accurate my model is here. If on average less than a handful of people out of 114 are misdiagnosed, that is a good start for making a model. Furthermore, the Feature Importance plots show that the "concave points worst" and "concave points mean" were the significant features. Therefore, I recommend the concave point features should be extracted from each future biopsy as a strong predictor for diagnosing breast cancer.
import warnings
import os # Get Current Directory
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score
import pandas as pd # data processing, CSV file I/O (e.i. pd.read_csv)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
from time import time
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.decomposition import PCA
from scipy import stats
import subprocess
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.utils.multiclass import unique_labels
import itertools
from sklearn.preprocessing import StandardScaler
warnings.filterwarnings("ignore")
pd.set_option('mode.chained_assignment', None)
currentDirectory=os.getcwd()
print(currentDirectory)
/Users/stevensmiley/Documents/GitHub/BreastCancer
def folder_path(path):
try:
os.mkdir(path)
except OSError:
print ("Creation of the directory %s failed" % path)
return(path)
else:
print ("Successfully created the directory %s " % path)
return(path)
return(path)
# OUTPUTS: Folder for storing OUTPUTS
OUTPUT_path=folder_path(currentDirectory+'/Outputs')
# Models: Folder for storing models
models_path=folder_path(OUTPUT_path+'/Models')
# Figures: Folder for storing figures
figures_path=folder_path(OUTPUT_path+'/Figures')
Creation of the directory /Users/stevensmiley/Documents/GitHub/BreastCancer/Outputs failed Creation of the directory /Users/stevensmiley/Documents/GitHub/BreastCancer/Outputs/Models failed Creation of the directory /Users/stevensmiley/Documents/GitHub/BreastCancer/Outputs/Figures failed
try:
data= pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')
except OSError:
print ("Input file not found at location:",'/kaggle/input/breast-cancer-wisconsin-data/data.csv')
INPUT_path=currentDirectory+'/Inputs'
data_path=os.path.join(INPUT_path,'data.csv')
data= pd.read_csv(data_path)
print ("Successfully loaded Input file from:",data_path)
else:
print ("Successfully loaded Input file from:",'/kaggle/input/breast-cancer-wisconsin-data/data.csv')
data.head(10) # view the first 10 columns
Input file not found at location: /kaggle/input/breast-cancer-wisconsin-data/data.csv Successfully loaded Input file from: /Users/stevensmiley/Documents/GitHub/BreastCancer/Inputs/data.csv
id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.30010 | 0.14710 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | NaN |
1 | 842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.08690 | 0.07017 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | NaN |
2 | 84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.19740 | 0.12790 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | NaN |
3 | 84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.24140 | 0.10520 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | NaN |
4 | 84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.19800 | 0.10430 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | NaN |
5 | 843786 | M | 12.45 | 15.70 | 82.57 | 477.1 | 0.12780 | 0.17000 | 0.15780 | 0.08089 | ... | 23.75 | 103.40 | 741.6 | 0.1791 | 0.5249 | 0.5355 | 0.1741 | 0.3985 | 0.12440 | NaN |
6 | 844359 | M | 18.25 | 19.98 | 119.60 | 1040.0 | 0.09463 | 0.10900 | 0.11270 | 0.07400 | ... | 27.66 | 153.20 | 1606.0 | 0.1442 | 0.2576 | 0.3784 | 0.1932 | 0.3063 | 0.08368 | NaN |
7 | 84458202 | M | 13.71 | 20.83 | 90.20 | 577.9 | 0.11890 | 0.16450 | 0.09366 | 0.05985 | ... | 28.14 | 110.60 | 897.0 | 0.1654 | 0.3682 | 0.2678 | 0.1556 | 0.3196 | 0.11510 | NaN |
8 | 844981 | M | 13.00 | 21.82 | 87.50 | 519.8 | 0.12730 | 0.19320 | 0.18590 | 0.09353 | ... | 30.73 | 106.20 | 739.3 | 0.1703 | 0.5401 | 0.5390 | 0.2060 | 0.4378 | 0.10720 | NaN |
9 | 84501001 | M | 12.46 | 24.04 | 83.97 | 475.9 | 0.11860 | 0.23960 | 0.22730 | 0.08543 | ... | 40.68 | 97.65 | 711.4 | 0.1853 | 1.0580 | 1.1050 | 0.2210 | 0.4366 | 0.20750 | NaN |
10 rows × 33 columns
As the background stated, no missing values should be present. The following verifies that. The last column doesn't hold any information and should be removed. In addition, the diagnosis should be changed to a binary classification of 0= benign and 1=malignant.
data.isnull().sum()
id 0 diagnosis 0 radius_mean 0 texture_mean 0 perimeter_mean 0 area_mean 0 smoothness_mean 0 compactness_mean 0 concavity_mean 0 concave points_mean 0 symmetry_mean 0 fractal_dimension_mean 0 radius_se 0 texture_se 0 perimeter_se 0 area_se 0 smoothness_se 0 compactness_se 0 concavity_se 0 concave points_se 0 symmetry_se 0 fractal_dimension_se 0 radius_worst 0 texture_worst 0 perimeter_worst 0 area_worst 0 smoothness_worst 0 compactness_worst 0 concavity_worst 0 concave points_worst 0 symmetry_worst 0 fractal_dimension_worst 0 Unnamed: 32 569 dtype: int64
# Drop Unnamed: 32 variable that has NaN values.
data.drop(['Unnamed: 32'],axis=1,inplace=True)
# Convert Diagnosis for Cancer from Categorical Variable to Binary
diagnosis_num={'B':0,'M':1}
data['diagnosis']=data['diagnosis'].map(diagnosis_num)
# Verify Data Changes, look at first 5 rows
data.head(5)
id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 842302 | 1 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ... | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
1 | 842517 | 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ... | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
2 | 84300903 | 1 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | ... | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
3 | 84348301 | 1 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | ... | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
4 | 84358402 | 1 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | ... | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
5 rows × 32 columns
A strong correlation is indicated by a Pearson Correlation Coefficient value near 1. Therefore, when looking at the Heatmap, we want to see what correlates most with the first column, "diagnosis." It appears that the features of "concave points worst" [0.79] has the strongest correlation with "diagnosis".
#fix,ax = plt.subplots(figsize=(25,25))
fix,ax = plt.subplots(figsize=(22,22))
heatmap_data = data.drop(['id'],axis=1)
sns.heatmap(heatmap_data.corr(),vmax=1,linewidths=0.01,square=True,annot=True,linecolor="white")
bottom,top=ax.get_ylim()
ax.set_ylim(bottom+0.5,top-0.5)
heatmap_title='Figure 1: Heatmap with Pearson Correlation Coefficient for Features'
ax.set_title(heatmap_title)
heatmap_fig=os.path.join(figures_path,'Figure1.Heatmap.png')
plt.savefig(heatmap_fig,dpi=300,bbox_inches='tight')
plt.show()
X = data.drop(['id','diagnosis'], axis= 1)
y = data.diagnosis
#Standardize Data
scaler = StandardScaler()
X=StandardScaler().fit_transform(X.values)
X = pd.DataFrame(X)
X.columns=(data.drop(['id','diagnosis'], axis= 1)).columns
A good rule of thumb is to hold out 20 percent of the data for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)
#Standardize Data
scaler = StandardScaler()
#Fit on training set only.
scaler.fit(X_train)
#Apply transform to both the training and test set
X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)
# Feature Extraction: Principal Component Analysis: PC1, PC2
pca = PCA(n_components=2, random_state=42)
# Only fit to the training set
pca.fit((X_train))
# transform with PCA model from training
principalComponents_train = pca.transform(X_train)
principalComponents_test = pca.transform(X_test)
# Use Pandas DataFrame
X_train = pd.DataFrame(X_train)
X_test=pd.DataFrame(X_test)
X_train.columns=(data.drop(['id','diagnosis'], axis= 1)).columns
X_test.columns=(data.drop(['id','diagnosis'], axis= 1)).columns
y_train = pd.DataFrame(y_train)
y_test=pd.DataFrame(y_test)
X_train['PC1']=principalComponents_train[:,0]
X_train['PC2']=principalComponents_train[:,1]
X_test['PC1']=principalComponents_test[:,0]
X_test['PC2']=principalComponents_test[:,1]
tr_features=X_train
tr_labels=y_train
val_features = X_test
val_labels=y_test
Verify the data was split correctly
print('X_train - length:',len(X_train), 'y_train - length:',len(y_train))
print('X_test - length:',len(X_test),'y_test - length:',len(y_test))
print('Percent heldout for testing:', round(100*(len(X_test)/len(data)),0),'%')
X_train - length: 455 y_train - length: 455 X_test - length: 114 y_test - length: 114 Percent heldout for testing: 20.0 %
In order to find a good model, several algorithms are tested on the training dataset. A senstivity study using different Hyperparameters of the algorithms are iterated on with GridSearchCV in order optimize each model. The best model is the one that has the highest accuracy without overfitting by looking at both the training data and the validation data results. Computer time does not appear to be an issue for these models, so it has little weight on deciding between models.
class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None, iid='deprecated', refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)[source]¶
Exhaustive search over specified parameter values for an estimator.
Important members are fit, predict.
GridSearchCV implements a “fit” and a “score” method. It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.
The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
def print_results(results,name,filename_pr):
with open(filename_pr, mode='w') as file_object:
print(name,file=file_object)
print(name)
print('BEST PARAMS: {}\n'.format(results.best_params_),file=file_object)
print('BEST PARAMS: {}\n'.format(results.best_params_))
means = results.cv_results_['mean_test_score']
stds = results.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, results.cv_results_['params']):
print('{} {} (+/-{}) for {}'.format(name,round(mean, 3), round(std * 2, 3), params),file=file_object)
print('{} {} (+/-{}) for {}'.format(name,round(mean, 3), round(std * 2, 3), params))
print(GridSearchCV)
<class 'sklearn.model_selection._search.GridSearchCV'>
Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
Regularization is when a penality is applied with increasing value to prevent overfitting. The inverse of regularization strength means as the value of C goes up, the value of the regularization strength goes down and vice versa.
'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
# Make Directory
path=folder_path(OUTPUT_path+'/Models/LR')
Creation of the directory /Users/stevensmiley/Documents/GitHub/BreastCancer/Outputs/Models/LR failed
LR_model_dir=os.path.join(path,'LR_model.pkl')
if os.path.exists(LR_model_dir) == False:
lr = LogisticRegression()
parameters = {
'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}
cv=GridSearchCV(lr, parameters, cv=5)
cv.fit(tr_features,tr_labels.values.ravel())
print_results(cv,'Logistic Regression (LR)',os.path.join(path,'LR_GridSearchCV_results.txt'))
cv.best_estimator_
joblib.dump(cv.best_estimator_,LR_model_dir)
else:
print('Already have LR')
Already have LR
Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples).
A linear kernel type is good when the data is Linearly seperable, which means it can be separated by a single Line. A radial basis function (rbf) kernel type is an expontential function of the squared Euclidean distance between two vectors and a constant. Since the value of RBF kernel decreases with distance and ranges between zero and one, it has a ready interpretation as a similiarity measure.
'kernel': ['linear','rbf']
Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.
Regularization is when a penality is applied with increasing value to prevent overfitting. The inverse of regularization strength means as the value of C goes up, the value of the regularization strength goes down and vice versa.
'C': [0.1, 1, 10]
print(SVC())
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto_deprecated', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
# Make Directory
path=folder_path(OUTPUT_path+'/Models/SVM')
Creation of the directory /Users/stevensmiley/Documents/GitHub/BreastCancer/Outputs/Models/SVM failed
SVM_model_dir=os.path.join(path,'SVM_model.pkl')
if os.path.exists(SVM_model_dir) == False:
svc = SVC()
parameters = {
'kernel': ['linear','rbf'],
'C': [0.1, 1, 10]
}
cv=GridSearchCV(svc,parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())
print_results(cv,'Support Vector Machine (SVM)',os.path.join(path,'SVM_GridSearchCV_results.txt'))
cv.best_estimator_
joblib.dump(cv.best_estimator_,SVM_model_dir)
else:
print('Already have SVM')
Already have SVM
The ith element represents the number of neurons in the ith hidden layer.
A rule of thumb is (2/3)*(# of input features) = neurons per hidden layer.
'hidden_layer_sizes': [(10,),(50,),(100,)]
Activation function for the hidden layer.
'hidden_layer_sizes': [(10,),(50,),(100,)]
Learning rate schedule for weight updates.
Only used when solver='sgd'.
'learning_rate': ['constant','invscaling','adaptive']
print(MLPClassifier())
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08, hidden_layer_sizes=(100,), learning_rate='constant', learning_rate_init=0.001, max_iter=200, momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5, random_state=None, shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1, verbose=False, warm_start=False)
# Make Directory
path=folder_path(OUTPUT_path+'/Models/MLP')
Creation of the directory /Users/stevensmiley/Documents/GitHub/BreastCancer/Outputs/Models/MLP failed
MLP_model_dir=os.path.join(path,'MLP_model.pkl')
if os.path.exists(MLP_model_dir) == False:
mlp = MLPClassifier()
parameters = {
'hidden_layer_sizes': [(10,),(50,),(100,)],
'activation': ['relu','tanh','logistic'],
'learning_rate': ['constant','invscaling','adaptive']
}
cv=GridSearchCV(mlp, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())
print_results(cv,'Neural Network (MLP)',os.path.join(path,'MLP_GridSearchCV_results.txt'))
cv.best_estimator_
joblib.dump(cv.best_estimator_,MLP_model_dir)
else:
print('Already have MLP')
Already have MLP
The number of trees in the forest.
Changed in version 0.22: The default value of n_estimators changed from 10 to 100 in 0.22.
Usually 500 does the trick and the accuracy and out of bag error doesn't change much after.
'n_estimators': [500],
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
None usually does the trick, but a few shallow trees are tested.
'max_depth': [5,7,9, None]
print(RandomForestClassifier())
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
# Make Directory
path=folder_path(OUTPUT_path+'/Models/RF')
Creation of the directory /Users/stevensmiley/Documents/GitHub/BreastCancer/Outputs/Models/RF failed
RF_model_dir=os.path.join(path,'RF_model.pkl')
if os.path.exists(RF_model_dir) == False:
rf = RandomForestClassifier(oob_score=False)
parameters = {
'n_estimators': [500],
'max_depth': [5,7,9, None]
}
cv = GridSearchCV(rf, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())
print_results(cv,'Random Forest (RF)',os.path.join(path,'RF_GridSearchCV_results.txt'))
cv.best_estimator_
joblib.dump(cv.best_estimator_,RF_model_dir)
else:
print('Already have RF')
Already have RF
The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.
Usually 500 does the trick and the accuracy and out of bag error doesn't change much after.
'n_estimators': [5, 50, 250, 500],
maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables.
A variety of shallow trees are tested.
'max_depth': [1, 3, 5, 7, 9],
learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators.
A variety was chosen because of the trade-off.
'learning_rate': [0.01, 0.1, 1]
print(GradientBoostingClassifier())
GradientBoostingClassifier(criterion='friedman_mse', init=None, learning_rate=0.1, loss='deviance', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_iter_no_change=None, presort='auto', random_state=None, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False)
# Make Directory
path=folder_path(OUTPUT_path+'/Models/GB')
Creation of the directory /Users/stevensmiley/Documents/GitHub/BreastCancer/Outputs/Models/GB failed
GB_model_dir=os.path.join(path,'GB_model.pkl')
if os.path.exists(GB_model_dir) == False:
gb = GradientBoostingClassifier()
parameters = {
'n_estimators': [5, 50, 250, 500],
'max_depth': [1, 3, 5, 7, 9],
'learning_rate': [0.01, 0.1, 1]
}
cv=GridSearchCV(gb, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())
print_results(cv,'Gradient Boost (GB)',os.path.join(path,'GR_GridSearchCV_results.txt'))
cv.best_estimator_
joblib.dump(cv.best_estimator_,GB_model_dir)
else:
print('Already have GB')
Already have GB
Usually 500 does the trick and the accuracy and out of bag error doesn't change much after.
'n_estimators': [5, 50, 250, 500],
Maximum tree depth for base learners.
A variety of shallow trees are tested.
'max_depth': [1, 3, 5, 7, 9],
Boosting learning rate (xgb’s “eta”)
A variety was chosen because of the trade-off.
'learning_rate': [0.01, 0.1, 1]
# Make Directory
path=folder_path(OUTPUT_path+'/Models/XGB')
Creation of the directory /Users/stevensmiley/Documents/GitHub/BreastCancer/Outputs/Models/XGB failed
XGB_model_dir=os.path.join(path,'XGB_model.pkl')
if os.path.exists(XGB_model_dir) == False:
xgb = XGBClassifier()
parameters = {
'n_estimators': [5, 50, 250, 500],
'max_depth': [1, 3, 5, 7, 9],
'learning_rate': [0.01, 0.1, 1]
}
cv=GridSearchCV(xgb, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())
print_results(cv,'eXtreme Gradient Boost (XGB)',os.path.join(path,'XGB_GridSearchCV_results.txt'))
cv.best_estimator_
joblib.dump(cv.best_estimator_,XGB_model_dir)
else:
print('Already have XGB')
Already have XGB
## all models
models = {}
#for mdl in ['LR', 'SVM', 'MLP', 'RF', 'GB','XGB']:
for mdl in ['LR', 'SVM', 'MLP', 'RF', 'GB','XGB']:
model_path=os.path.join(OUTPUT_path,'Models/{}/{}_model.pkl')
models[mdl] = joblib.load(model_path.format(mdl,mdl))
def evaluate_model(fig_path,name, model, features, labels, y_test_ev, fc):
CM_fig=os.path.join(fig_path,'Figure{}.A_{}_Confusion_Matrix.png'.format(fc,name))
VI_fig=os.path.join(fig_path,'Figure{}.B_{}_Variable_Importance_Plot.png'.format(fc,name))
start = time()
pred = model.predict(features)
end = time()
y_truth=y_test_ev
accuracy = round(accuracy_score(labels, pred), 3)
precision = round(precision_score(labels, pred), 3)
recall = round(recall_score(labels, pred), 3)
print('{} -- Accuracy: {} / Precision: {} / Recall: {} / Latency: {}ms'.format(name,
accuracy,
precision,
recall,
round((end - start)*1000, 1)))
pred=pd.DataFrame(pred)
pred.columns=['diagnosis']
# Convert Diagnosis for Cancer from Binary to Categorical
diagnosis_name={0:'Benign',1:'Malginant'}
y_truth['diagnosis']=y_truth['diagnosis'].map(diagnosis_name)
pred['diagnosis']=pred['diagnosis'].map(diagnosis_name)
class_names = ['Benign','Malginant']
cm = confusion_matrix(y_test_ev, pred, class_names)
FP_L='False Positive'
FP = cm[0][1]
FN_L='False Negative'
FN = cm[1][0]
TP_L='True Positive'
TP = cm[1][1]
TN_L='True Negative'
TN = cm[0][0]
#TPR_L= 'Sensitivity, hit rate, recall, or true positive rate'
TPR_L= 'Sensitivity'
TPR = round(TP/(TP+FN),3)
#TNR_L= 'Specificity or true negative rate'
TNR_L= 'Specificity'
TNR = round(TN/(TN+FP),3)
#PPV_L= 'Precision or positive predictive value'
PPV_L= 'Precision'
PPV = round(TP/(TP+FP),3)
#NPV_L= 'Negative predictive value'
NPV_L= 'NPV'
NPV = round(TN/(TN+FN),3)
#FPR_L= 'Fall out or false positive rate'
FPR_L= 'FPR'
FPR = round(FP/(FP+TN),3)
#FNR_L= 'False negative rate'
FNR_L= 'FNR'
FNR = round(FN/(TP+FN),3)
#FDR_L= 'False discovery rate'
FDR_L= 'FDR'
FDR = round(FP/(TP+FP),3)
ACC_L= 'Accuracy'
ACC = round((TP+TN)/(TP+FP+FN+TN),3)
stats_data = {'Name':name,
ACC_L:ACC,
FP_L:FP,
FN_L:FN,
TP_L:TP,
TN_L:TN,
TPR_L:TPR,
TNR_L:TNR,
PPV_L:PPV,
NPV_L:NPV,
FPR_L:FPR,
FNR_L:FDR}
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(cm,cmap=plt.cm.gray_r)
plt.title('Figure {}.A: {} Confusion Matrix on Unseen Test Data'.format(fc,name),y=1.08)
fig.colorbar(cax)
ax.set_xticklabels([''] + class_names)
ax.set_yticklabels([''] + class_names)
# Loop over data dimensions and create text annotations.
for i in range(len(class_names)):
for j in range(len(class_names)):
text = ax.text(j, i, cm[i, j],
ha="center", va="center", color="r")
plt.xlabel('Predicted')
plt.ylabel('True')
plt.savefig(CM_fig,dpi=400,bbox_inches='tight')
#plt.show()
if name == 'RF' or name == 'GB' or name == 'XGB':
# Get numerical feature importances
importances = list(model.feature_importances_)
importances=100*(importances/max(importances))
feature_list = list(features.columns)
sorted_ID=np.argsort(importances)
plt.figure(figsize=[10,10])
plt.barh(sort_list(feature_list,importances),importances[sorted_ID],align='center')
plt.title('Figure {}.B: {} Variable Importance Plot'.format(fc,name))
plt.xlabel('Relative Importance')
plt.ylabel('Feature')
plt.savefig(VI_fig,dpi=300,bbox_inches='tight')
#plt.show()
return accuracy,name, model, stats_data
def sort_list(list1, list2):
zipped_pairs = zip(list2, list1)
z = [x for _, x in sorted(zipped_pairs)]
return z
ev_accuracy=[None]*len(models)
ev_name=[None]*len(models)
ev_model=[None]*len(models)
ev_stats=[None]*len(models)
count=1
for name, mdl in models.items():
y_test_ev=y_test
ev_accuracy[count-1],ev_name[count-1],ev_model[count-1], ev_stats[count-1] = evaluate_model(figures_path,
name,
mdl,
val_features,
val_labels,
y_test_ev,
count+1)
diagnosis_name={'Benign':0,'Malginant':1}
y_test['diagnosis']=y_test['diagnosis'].map(diagnosis_name)
count=count+1
LR -- Accuracy: 0.991 / Precision: 1.0 / Recall: 0.977 / Latency: 18.5ms SVM -- Accuracy: 0.982 / Precision: 1.0 / Recall: 0.953 / Latency: 1.2ms MLP -- Accuracy: 0.974 / Precision: 0.976 / Recall: 0.953 / Latency: 1.0ms RF -- Accuracy: 0.965 / Precision: 0.976 / Recall: 0.93 / Latency: 48.9ms GB -- Accuracy: 0.956 / Precision: 0.952 / Recall: 0.93 / Latency: 3.0ms XGB -- Accuracy: 0.965 / Precision: 0.953 / Recall: 0.953 / Latency: 4.1ms
best_name=ev_name[ev_accuracy.index(max(ev_accuracy))] #picks the maximum accuracy
print('Best Model:',best_name,'with Accuracy of ',max(ev_accuracy))
best_model=ev_model[ev_accuracy.index(max(ev_accuracy))] #picks the maximum accuracy
if best_name == 'RF' or best_name == 'GB' or best_name == 'XGB':
# Get numerical feature importances
importances = list(best_model.feature_importances_)
importances=100*(importances/max(importances))
feature_list = list(X.columns)
sorted_ID=np.argsort(importances)
plt.figure(figsize=[10,10])
plt.barh(sort_list(feature_list,importances),importances[sorted_ID],align='center')
plt.title('Figure 8: Variable Importance Plot -- {}'.format(best_name))
plt.xlabel('Relative Importance')
plt.ylabel('Feature')
plt.savefig(os.path.join(figures_path,'Figure8.png'),dpi=300,bbox_inches='tight')
plt.show()
Best Model: LR with Accuracy of 0.991
When it comes to diagnosing breast cancer, we want to make sure we don't have too many false-positives (you don't have cancer, but told you do and go on treatment) or false-negatives (you have cancer, but told you don't and don't get treatment). Therefore, the highest overall accuracy model is chosen.
All of the models performed well after fine tunning their hyperparameters, but the best model is the one the highest overall accuracy. Out of the 20% of data witheld in this test (114 random individuals), only a handful were misdiagnosed. No model is perfect, but I am happy about how accurate my model is here. If on average less than a handful of people out of 114 are misdiagnosed, that is a good start for making a model. Furthermore, the Feature Importance plots show that the "concave points worst" and "concave points mean" were the significant features. Therefore, I recommend the concave point features should be extracted from each future biopsy as a strong predictor for diagnosing breast cancer.
ev_stats=pd.DataFrame(ev_stats)
print(ev_stats.head(10))
Name Accuracy False Positive False Negative True Positive \ 0 LR 0.991 0 1 42 1 SVM 0.982 0 2 41 2 MLP 0.974 1 2 41 3 RF 0.965 1 3 40 4 GB 0.956 2 3 40 5 XGB 0.965 2 2 41 True Negative Sensitivity Specificity Precision NPV FPR FNR 0 71 0.977 1.000 1.000 0.986 0.000 0.000 1 71 0.953 1.000 1.000 0.973 0.000 0.000 2 70 0.953 0.986 0.976 0.972 0.014 0.024 3 70 0.930 0.986 0.976 0.959 0.014 0.024 4 69 0.930 0.972 0.952 0.958 0.028 0.048 5 69 0.953 0.972 0.953 0.972 0.028 0.047