In this project, my aim is to use data from the National Health and Nutrition Examination Survey (NHANES) and predict the likelihood of a US adult being obese. Since my target is an "either or" variable, I use Machine Learning Classification Models to split the data into training and testing sets, build models on the training set and use the testing set to predict the likelihood of obesity. According to the CDC, an obese person is someone with a Body Mass Index of 30 or higher. Hence, the data used here are only for adults with a BMI greater than 29.
The specific models I will employ are Logistic Regression, Decision Trees and Random Forest.
Finally, I shall use "accuracy" and Receiver Operating Charactersitic (ROC) to measure the models' performance and compare them to determine which one did best in predicting Obesity.
pip install -U scikit-learn
Requirement already up-to-date: scikit-learn in c:\jupyter\lib\site-packages (0.22.2.post1) Requirement already satisfied, skipping upgrade: numpy>=1.11.0 in c:\jupyter\lib\site-packages (from scikit-learn) (1.16.5) Requirement already satisfied, skipping upgrade: scipy>=0.17.0 in c:\jupyter\lib\site-packages (from scikit-learn) (1.3.1) Requirement already satisfied, skipping upgrade: joblib>=0.11 in c:\jupyter\lib\site-packages (from scikit-learn) (0.13.2) Note: you may need to restart the kernel to use updated packages.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from imblearn.over_sampling import SMOTE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus
from sklearn.tree import DecisionTreeClassifier
from pydot import graph_from_dot_data
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import r2_score
from rfpimp import permutation_importances
import warnings
warnings.filterwarnings('ignore')
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
from sklearn.impute import SimpleImputer
Having imported the data, I now conduct some Exploratory Data Analysis (EDA) to find patterns when and where they exist. This also involves cleaning the data, visualizing, creating dummies and pairplots.
nhanes1 = pd.read_excel('C:/nhanes data.xlsx')
print(nhanes1.shape)
print(list(nhanes1.columns))
(3834, 20) ['Unnamed: 0', 'SEQN', 'gender', 'age', 'race', 'foreign_born', 'education', 'poverty_ratio', 'Tot_Chol', 'BMI', 'Unnamed: 10', 'Blood_Pressure', 'obese', 'MaritalStatus', 'SleepTrouble', 'PhysActive', 'Smoke100n', 'SexEver', 'Diabetes', 'UrineVol1']
# Here, I filled missing values of the continuous variable with their mean values
mean_value = nhanes1[['Tot_Chol', 'Blood_Pressure', 'UrineVol1']].mean()
nhanes1[['Tot_Chol', 'Blood_Pressure','UrineVol1']]=nhanes1[['Tot_Chol', 'Blood_Pressure','UrineVol1' ]].fillna(mean_value)
nhanes1.head()
Unnamed: 0 | SEQN | gender | age | race | foreign_born | education | poverty_ratio | Tot_Chol | BMI | Unnamed: 10 | Blood_Pressure | obese | MaritalStatus | SleepTrouble | PhysActive | Smoke100n | SexEver | Diabetes | UrineVol1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 83732 | Male | 62 | White | USA_Born | College_Degree | 4.39 | 113.000000 | 113.0 | NaN | 101.1 | No | Married | Yes | No | Smoker | Yes | No | 352.0000 |
1 | 2 | 83733 | Male | 53 | White | Foreign_Born | No_College_Degree | 1.32 | 113.000000 | 113.0 | NaN | 107.9 | Yes | Married | Yes | No | Smoker | Yes | No | 352.0000 |
2 | 3 | 83735 | Female | 56 | White | USA_Born | College_Degree | 5.00 | 113.000000 | 113.0 | NaN | 110.1 | Yes | Married | Yes | No | Smoker | Yes | No | 352.0000 |
3 | 4 | 83736 | Female | 42 | Blakc | USA_Born | No_College_Degree | 1.23 | 105.346124 | 20.3 | NaN | 80.4 | No | NaN | NaN | NaN | NaN | NaN | No | 124.6465 |
4 | 5 | 83741 | Male | 22 | Blakc | USA_Born | No_College_Degree | 2.08 | 112.000000 | 28.0 | NaN | 86.6 | No | LivePartner | Yes | No | Smoker | Yes | No | 77.0000 |
# Select Variables
nhanes1 = nhanes1[[ 'gender', 'age', 'race', 'foreign_born', 'education', 'poverty_ratio', 'Tot_Chol',
'Blood_Pressure', 'obese', 'MaritalStatus', 'SleepTrouble', 'PhysActive', 'Smoke100n',
'SexEver', 'Diabetes', 'UrineVol1']]
nhanes1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3834 entries, 0 to 3833 Data columns (total 16 columns): gender 3834 non-null object age 3834 non-null int64 race 3834 non-null object foreign_born 3834 non-null object education 3834 non-null object poverty_ratio 3834 non-null float64 Tot_Chol 3834 non-null float64 Blood_Pressure 3834 non-null float64 obese 3834 non-null object MaritalStatus 2794 non-null object SleepTrouble 3018 non-null object PhysActive 3208 non-null object Smoke100n 2795 non-null object SexEver 2215 non-null object Diabetes 3772 non-null object UrineVol1 3834 non-null float64 dtypes: float64(4), int64(1), object(11) memory usage: 479.4+ KB
fig, axs = plt.subplots(1,2,figsize=(10,6))
sns.countplot(x='obese',data=nhanes1,ax=axs[0])
axs[0].set_title("Frequency of Obesity")
nhanes1.obese.value_counts().plot(x=None,y=None, kind='pie', ax=axs[1],autopct='%1.2f%%')
axs[1].set_title("Percentage of Obesity")
plt.show()
plt.savefig('count_plot')
<Figure size 432x288 with 0 Axes>
It can be seen above that there are more normal than obese people based on the dataset
count_normal = len(nhanes1[nhanes1['obese']=='No'])
count_obese = len(nhanes1[nhanes1['obese']=='Yes'])
pct_of_normal = count_normal/(count_normal+count_obese)
print("percentage of normal is", pct_of_normal*100)
pct_of_obese = count_obese/(count_normal+count_obese)
print("percentage of obese", pct_of_obese*100)
percentage of normal is 52.92123109024518 percentage of obese 47.07876890975483
The classes are only a little unbalanced and the ratio of normal to obese is about 53:47
nhanes1.groupby('obese').mean()
age | poverty_ratio | Tot_Chol | Blood_Pressure | UrineVol1 | |
---|---|---|---|---|---|
obese | |||||
No | 43.385904 | 2.539788 | 97.566230 | 107.535107 | 123.449053 |
Yes | 46.170083 | 2.406609 | 114.091501 | 115.640908 | 125.992549 |
To see the association of obesity with categorical factors, I use grouped bar graphs.
%matplotlib inline
pd.crosstab(nhanes1.gender,nhanes1.obese).plot(kind='bar')
plt.title('Obesity Frequency Across Gender')
plt.xlabel('Gender')
plt.ylabel('Frequency of Obesity')
plt.savefig('obese_fre_gender')
The proportion of females who are obese is higher than that of males.
%matplotlib inline
pd.crosstab(nhanes1.race,nhanes1.obese).plot(kind='bar')
plt.title('Obesity Frequency Across Race')
plt.xlabel('Race')
plt.ylabel('Frequency of Obesity')
plt.savefig('obese_fre_race')
The frequency of obesity depends a great deal on race. Thus, race can be another good predictor of the outcome variable as well.
%matplotlib inline
pd.crosstab(nhanes1.foreign_born,nhanes1.obese).plot(kind='bar')
plt.title('Obesity Frequency Across Place of Birth')
plt.xlabel('Place of Birth')
plt.ylabel('Frequency of Obesity')
plt.savefig('obese_fre_pob')
POB can be another good predictor of the outcome variable.
%matplotlib inline
pd.crosstab(nhanes1.education,nhanes1.obese).plot(kind='bar')
plt.title('Obesity Frequency Across Education')
plt.xlabel('Education')
plt.ylabel('Frequency of Obesity')
plt.savefig('obese_fre_edu')
The frequency of Obesity also depends of education. Hence, education could be another good predictor of the outcome variable.
%matplotlib inline
pd.crosstab(nhanes1.MaritalStatus,nhanes1.obese).plot(kind='bar')
plt.title('Obesity Frequency Across Marital Status')
plt.xlabel('Marital Status')
plt.ylabel('Frequency of Obesity')
plt.savefig('obese_fre_edu')
The frequency of Obesity also depends of Marriage. Hence, Marriage could be another good predictor of the outcome variable.
%matplotlib inline
pd.crosstab(nhanes1.Smoke100n,nhanes1.obese).plot(kind='bar')
plt.title('Obesity Frequency Based on Smoking')
plt.xlabel('Smoking')
plt.ylabel('Frequency of Obesity')
plt.savefig('obese_fre_edu')
nhanes1 = pd.get_dummies(nhanes1, columns = ['gender', 'race', 'foreign_born', 'education', 'obese', 'Diabetes',
'MaritalStatus', 'SleepTrouble',
'PhysActive', 'Smoke100n', 'SexEver'],
prefix_sep = '_', drop_first = True) # To avoid dummy variable trap in the estimation,I drop one variable!
nhanes1.rename(columns= {'foreign_born_Foreign_Born':'foreign_born',
'foreign_born_USA_Born':'USA_born', 'race_Blakc':'race_Black',
'obese_Yes':'obese'}, inplace = True)
# Pairplot Colored by Incidence of Obesity
sns.pairplot(nhanes1, hue="obese", vars=["Tot_Chol", "age", "Blood_Pressure"], diag_kind = 'kde',
plot_kws = { 's': 50, 'edgecolor': 'k'}
)
# Title
plt.suptitle('PairPlot of Factors of Obesity Among US adults');
I added the correlation coefficients since no clear relationship among most variables in the pairplot above
# Function to calculate correlation coefficient between two arrays
def corr(x, y, **kwargs):
# Calculate the value
coef = np.corrcoef(x, y)[0][1]
# Make the label
label = r'$\rho$ = ' + str(round(coef, 2))
# Add the label to the plot
ax = plt.gca()
ax.annotate(label, xy = (0.2, 0.95), size = 20, xycoords = ax.transAxes)
# Create a pair grid instance
grid = sns.PairGrid(data= nhanes1,
vars = ["Tot_Chol","age", "Blood_Pressure"], size = 3)
# Map the plots
grid = grid.map_upper(plt.scatter, color = 'darkred')
grid = grid.map_upper(corr)
grid = grid.map_lower(sns.kdeplot, cmap = 'Reds')
grid = grid.map_diag(plt.hist, bins = 10, edgecolor = 'k', color = 'darkred');
X = nhanes1.loc[:, nhanes1.columns != 'obese'] # Includes all features except the response
y = nhanes1.loc[:, nhanes1.columns == 'obese']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
columns = X_train.columns
X_train.columns.values
array(['age', 'poverty_ratio', 'Tot_Chol', 'Blood_Pressure', 'UrineVol1', 'gender_Male', 'race_Black', 'race_Hispanic', 'race_Mexican', 'race_Other', 'race_White', 'foreign_born', 'USA_born', 'education_No_College_Degree', 'Diabetes_Yes', 'MaritalStatus_LivePartner', 'MaritalStatus_Married', 'MaritalStatus_NeverMarried', 'MaritalStatus_Separated', 'MaritalStatus_Widowed', 'SleepTrouble_Yes', 'PhysActive_Yes', 'Smoke100n_Smoker', 'SexEver_Yes'], dtype=object)
In a bid to select the best features for my prediction, I use RFE.
data_final_vars=nhanes1.columns.values.tolist()
y=['obese']
X=[i for i in data_final_vars if i not in y]
logreg = LogisticRegression()
rfe = RFE(logreg, 20)
rfe = rfe.fit(X_train, y_train.values.ravel())
print(rfe.support_)
print(rfe.ranking_)
[False True True True False True True True True True True True True True True True True True True True False True True False] [5 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 2]
Based on the RFE output, I dropped age, total cholesterol, Sleep Trouble and Sex from my features.
feature_cols=['poverty_ratio', 'Tot_Chol', 'Blood_Pressure',
'gender_Male', 'race_Black', 'race_Hispanic', 'race_Mexican',
'race_Other', 'race_White', 'foreign_born', 'USA_born',
'education_No_College_Degree', 'Diabetes_Yes',
'MaritalStatus_LivePartner', 'MaritalStatus_Married',
'MaritalStatus_NeverMarried', 'MaritalStatus_Separated',
'MaritalStatus_Widowed', 'PhysActive_Yes',
'Smoke100n_Smoker']
X=X_train[feature_cols]
y=y_train['obese']
Having Chosen the best predictors based on the RFE, I am ready to estimate the first model; Logit.
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())
Optimization terminated successfully. Current function value: inf Iterations 7 Results: Logit =============================================================================== Model: Logit Pseudo R-squared: inf Dependent Variable: obese AIC: inf Date: 2020-04-18 04:48 BIC: inf No. Observations: 2683 Log-Likelihood: -inf Df Model: 19 LL-Null: 0.0000 Df Residuals: 2663 LLR p-value: 1.0000 Converged: 1.0000 Scale: 1.0000 No. Iterations: 7.0000 ------------------------------------------------------------------------------- Coef. Std.Err. z P>|z| [0.025 0.975] ------------------------------------------------------------------------------- poverty_ratio 0.0288 0.0342 0.8428 0.3993 -0.0382 0.0959 Tot_Chol 0.0847 0.0039 21.8508 0.0000 0.0771 0.0923 Blood_Pressure 0.0569 0.0034 16.9645 0.0000 0.0504 0.0635 gender_Male -0.5984 0.0993 -6.0276 0.0000 -0.7930 -0.4038 race_Black 1.1720 0.2306 5.0826 0.0000 0.7201 1.6240 race_Hispanic 1.5027 0.2192 6.8561 0.0000 1.0731 1.9322 race_Mexican 1.4125 0.2160 6.5391 0.0000 0.9892 1.8359 race_Other 1.6203 0.3165 5.1191 0.0000 0.9999 2.2407 race_White 0.7309 0.2250 3.2478 0.0012 0.2898 1.1719 foreign_born -15.8323 0.6735 -23.5076 0.0000 -17.1523 -14.5123 USA_born -15.6394 0.6902 -22.6605 0.0000 -16.9921 -14.2867 education_No_College_Degree 0.1632 0.1269 1.2855 0.1986 -0.0856 0.4120 Diabetes_Yes -0.5605 0.1978 -2.8329 0.0046 -0.9483 -0.1727 MaritalStatus_LivePartner -0.3251 0.2288 -1.4206 0.1554 -0.7736 0.1234 MaritalStatus_Married -0.6247 0.1212 -5.1545 0.0000 -0.8623 -0.3872 MaritalStatus_NeverMarried -0.2055 0.1692 -1.2141 0.2247 -0.5371 0.1262 MaritalStatus_Separated -1.0379 0.3528 -2.9419 0.0033 -1.7293 -0.3464 MaritalStatus_Widowed -1.4196 0.2565 -5.5333 0.0000 -1.9224 -0.9167 PhysActive_Yes -0.2269 0.0995 -2.2796 0.0226 -0.4220 -0.0318 Smoke100n_Smoker -0.2034 0.1121 -1.8153 0.0695 -0.4231 0.0162 ===============================================================================
feature_cols=[ 'Tot_Chol', 'Blood_Pressure',
'gender_Male', 'race_Black', 'race_Hispanic', 'race_Mexican',
'race_Other', 'race_White', 'foreign_born', 'USA_born',
'Diabetes_Yes',
'MaritalStatus_Married',
'MaritalStatus_Separated',
'MaritalStatus_Widowed', 'PhysActive_Yes',
'Smoke100n_Smoker']
X=X_train[feature_cols]
y=y_train['obese']
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())
Optimization terminated successfully. Current function value: inf Iterations 7 Results: Logit =========================================================================== Model: Logit Pseudo R-squared: inf Dependent Variable: obese AIC: inf Date: 2020-04-18 04:49 BIC: inf No. Observations: 2683 Log-Likelihood: -inf Df Model: 15 LL-Null: 0.0000 Df Residuals: 2667 LLR p-value: 1.0000 Converged: 1.0000 Scale: 1.0000 No. Iterations: 7.0000 --------------------------------------------------------------------------- Coef. Std.Err. z P>|z| [0.025 0.975] --------------------------------------------------------------------------- Tot_Chol 0.0844 0.0039 21.8862 0.0000 0.0768 0.0920 Blood_Pressure 0.0565 0.0033 16.9618 0.0000 0.0500 0.0631 gender_Male -0.5908 0.0989 -5.9733 0.0000 -0.7847 -0.3969 race_Black 1.2146 0.2245 5.4106 0.0000 0.7746 1.6546 race_Hispanic 1.5413 0.2141 7.1976 0.0000 1.1216 1.9610 race_Mexican 1.4554 0.2069 7.0354 0.0000 1.0499 1.8608 race_Other 1.6638 0.3135 5.3063 0.0000 1.0492 2.2783 race_White 0.7745 0.2222 3.4850 0.0005 0.3389 1.2100 foreign_born -15.6736 0.6601 -23.7432 0.0000 -16.9675 -14.3798 USA_born -15.4754 0.6744 -22.9484 0.0000 -16.7971 -14.1537 Diabetes_Yes -0.5555 0.1975 -2.8130 0.0049 -0.9425 -0.1684 MaritalStatus_Married -0.5268 0.1062 -4.9599 0.0000 -0.7350 -0.3186 MaritalStatus_Separated -0.9356 0.3475 -2.6929 0.0071 -1.6166 -0.2546 MaritalStatus_Widowed -1.3172 0.2492 -5.2869 0.0000 -1.8056 -0.8289 PhysActive_Yes -0.2477 0.0983 -2.5199 0.0117 -0.4404 -0.0550 Smoke100n_Smoker -0.2588 0.1068 -2.4221 0.0154 -0.4682 -0.0494 ===========================================================================
ODDS RATIO AND INTERPRETATION
np.exp(result.params)
Tot_Chol 1.088069e+00 Blood_Pressure 1.058178e+00 gender_Male 5.538810e-01 race_Black 3.369063e+00 race_Hispanic 4.670792e+00 race_Mexican 4.286018e+00 race_Other 5.279163e+00 race_White 2.169407e+00 foreign_born 1.559674e-07 USA_born 1.901606e-07 Diabetes_Yes 5.737946e-01 MaritalStatus_Married 5.904796e-01 MaritalStatus_Separated 3.923355e-01 MaritalStatus_Widowed 2.678771e-01 PhysActive_Yes 7.805712e-01 Smoke100n_Smoker 7.719920e-01 dtype: float64
I then use the Logistic Regression Function to fit the Model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
Accuracy of logistic regression classifier on test set: 0.78
from sklearn.preprocessing import scale
Xs = scale(X)
Xs_train, Xs_test, y_train, y_test = train_test_split(Xs, y, test_size=0.3, random_state=0)
logreg2 = LogisticRegression()
logreg2.fit(Xs_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)
y_pred = logreg2.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg2.score(Xs_test, y_test)))
Accuracy of logistic regression classifier on test set: 0.78
Lo and behold! It did not. So, I will maintain the logistic model with unscaled data
logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for LOGIT')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
The bigger the area under the curve, the better. Hence, the logistic regression does a good job in predicting obesity than a random classifier.
Here, I use decision tree classifier to estimate the same training set and evaluate it similar to the logistic regression model.
# Create Decision Tree classifer object
dt = DecisionTreeClassifier()
# Train Decision Tree Classifer
dt_train = dt.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = dt_train.predict(X_test)
y_pred = dt_train.predict(X_test)
print('Accuracy of Decision Tree classifier on test set: {:.2f}'.format(dt_train.score(X_test, y_test)))
Accuracy of Decision Tree classifier on test set: 0.79
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'
dot_data = StringIO()
export_graphviz(dt_train, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('obese.png')
Image(graph.create_png())
# Create Decision Tree classifer object
dt_clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)
# Train Decision Tree Classifer
dt_clf = dt_clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = dt_clf.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print('Accuracy of Decision Tree classifier on test set: {:.2f}'.format(dt_clf.score(X_test, y_test)))
Accuracy of Decision Tree classifier on test set: 0.83
dot_data = StringIO()
export_graphviz(dt_clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True, feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('obese.png')
Image(graph.create_png())
dt_roc_auc = roc_auc_score(y_test, dt_clf.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, dt_train.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Decision Tree (area = %0.2f)' % dt_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for Decision Tree')
plt.legend(loc="lower right")
plt.savefig('DT_ROC')
plt.show()
# Instantiate model with 1000 decision trees
rf = RandomForestClassifier(n_estimators = 1000, random_state = 40, max_features = 'sqrt')
# Train the model on training data
rf.fit(X_train, y_train);
# Actual class predictions
rf_predictions = rf.predict(X_test)
# Probabilities for each class
rf_probs = rf.predict_proba(X_test)[:, 1]
print('Accuracy of Decision Tree classifier on test set: {:.2f}'.format(rf.score(X_test, y_test)))
Accuracy of Decision Tree classifier on test set: 0.85
rf_roc_auc = roc_auc_score(y_test, rf.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, rf.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Decision Tree (area = %0.2f)' % rf_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for Random Forest')
plt.legend(loc="lower right")
plt.savefig('RF_ROC')
plt.show()
features = feature_cols
importances = rf.feature_importances_
indices = np.argsort(importances)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
plt.figure()
# Adding Models to the list that I want to view on the ROC plot
models = [
{
'label': 'Logistic Regression',
'model': LogisticRegression(),
},
{
'label': 'Decision Tree',
'model': DecisionTreeClassifier(),
},
{
'label': 'Random Forest',
'model': RandomForestClassifier()
}
]
# Below for loop iterates through the models list
for m in models:
model = m['model'] # select the model
model.fit(X_train, y_train) # train the model
y_pred=model.predict(X_test) # predict the test data
# Compute False postive rate, and True positive rate
fpr, tpr, thresholds = metrics.roc_curve(y_test, model.predict_proba(X_test)[:,1])
# Calculate Area under the curve to display on the plot
auc = metrics.roc_auc_score(y_test,model.predict(X_test))
# Now, plot the computed values
plt.plot(fpr, tpr, label='%s ROC (area = %0.2f)' % (m['label'], auc))
# Custom settings for the plot
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1-Specificity(False Positive Rate)')
plt.ylabel('Sensitivity(True Positive Rate)')
plt.title('Receiver Operating Characteristic for Logit DT and RF')
plt.legend(loc="lower right")
plt.show() # Display
Based on the three Machine Learning Models employed on the NHANES dataset to predict the Incidence of Obesity among US adults, I find the Random Forest Model to be the best in terms of ROC and accuracy metrics.
I also found that high Cholesterol intake, blood pressure and Gender are the most important variables in predicting Obesity.