Notebook

Exercise 1.04¶

Creating a simple model¶

In this exercise we will create a simple logistic regression model from the scikit learn package. We will then create some model evaluation metrics and test the predictions against those model evaluation metrics. Let's load the feature data from the first excerice.

We should always approach training any machine learning model training as an iterative approach, beginning first with a simple model, and using model evaluation metrics to evaluate the performance of the models.

In [1]:

import pandas as pd
feats = pd.read_csv('../data/OSI_feats_e3.csv')
target = pd.read_csv('../data/OSI_target_e2.csv')

We first begin by creating a test and train dataset. We will train the data using the training dataset and evaluate the performance of the model on the test dataset. Later in the lesson we will add validation datasets that will help us tune the hyperparameters.

We will use a test_size = 0.2 which means that 20% of the data will be reserved for testing

In [2]:

from sklearn.model_selection import train_test_split
test_size = 0.2
random_state = 42
X_train, X_test, y_train, y_test = train_test_split(feats, target, test_size=test_size, random_state=random_state)

Let's make sure our dimensions are correct

In [3]:

print(f'Shape of X_train: {X_train.shape}')
print(f'Shape of y_train: {y_train.shape}')
print(f'Shape of X_test: {X_test.shape}')
print(f'Shape of y_test: {y_test.shape}')

Shape of X_train: (9864, 68)
Shape of y_train: (9864, 1)
Shape of X_test: (2466, 68)
Shape of y_test: (2466, 1)

We fit our model first by instantiating it, then by fitting the model to the training data

In [4]:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=42, max_iter=10000)
model.fit(X_train, y_train['Revenue'])

Out[4]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=10000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

To test the model performance we will predict the outcome on the test features (X_test), and compare those outcomes to real values (y_test)

In [5]:

y_pred = model.predict(X_test)

Now let's compare against the true values. Let's start by using accuracy, accuracy is defined as the propotion of correct predictions out of the total predictions.

In [6]:

from sklearn import metrics
accuracy = metrics.accuracy_score(y_pred=y_pred, y_true=y_test)
print(f'Accuracy of the model is {accuracy*100:.4f}%')

Accuracy of the model is 87.0641%

87.0641% - that's not bad with for a simple model with little feature engineering!

Other evaluation metrics¶

Other common metrics in classification models are precision, recall, and f1-score. Recall is defined as the proportion of correct positive predictions relative to total true postive values. Precision is defined as the proportion of correct positive predictions relative to total predicted postive values. F1 score is a combination of precision and recall, defined as 2 times the product of precision and recall, divided by the sum of the two.

It's useful to use these other evaluation metrics other than accuracy when the distribution of true and false values. We want these values to be as close to 1.0 as possible.

In [7]:

precision, recall, fscore, _ = metrics.precision_recall_fscore_support(y_pred=y_pred, y_true=y_test, average='binary')
print(f'Precision: {precision:.4f}\nRecall: {recall:.4f}\nfscore: {fscore:.4f}')

Precision: 0.7347
Recall: 0.3504
fscore: 0.4745

We can see here that while the accuracy is high, the recall is much lower, which means that we're missing most of the true positive values.

Feature importances¶

We can look at which features are important by looking at the magnitude of the coefficients. Those with a larger coefficients will have a greater contribution to the result. Those with a positive value will make the result head toward the true result, that the customer will not subscribe. Features with a negative value for the coefficient will make the result heads towards a false result, that the customer will not subscribe to the product.

As a note, since the features were not normalized (having the same scale), the values for these coefficients shouls serve as a rough guide as to observe which features add predictive power.

In [8]:

coef_list = [f'{feature}: {coef}' for coef, feature in sorted(zip(model.coef_[0], X_train.columns.values.tolist()))]
for item in coef_list:
    print(item)

TrafficType_13: -0.9393317018656502
VisitorType_Returning_Visitor: -0.7126379729869377
Month_Dec: -0.6356666079086347
ExitRates: -0.6168306621684505
Month_Mar: -0.5531772345591857
Region_9: -0.5493990371550316
TrafficType_3: -0.5230504004211978
OperatingSystems_3: -0.5047311736766499
SpecialDay: -0.48888883272346506
BounceRates: -0.4573686067908481
Month_May: -0.4436363104925222
Month_June: -0.4225194836012355
OperatingSystems_8: -0.35057329371369783
Browser_6: -0.33033671140440707
TrafficType_6: -0.2572321108188088
TrafficType_1: -0.24969535181259417
Browser_3: -0.23765128996809284
VisitorType_New_Visitor: -0.22945892368475135
Browser_1: -0.22069737949723414
Region_7: -0.21116529737609177
Browser_13: -0.20773332314846657
Region_4: -0.20645936733062473
Browser_4: -0.18452552602906916
OperatingSystems_4: -0.17537032410289136
OperatingSystems_2: -0.17087815382440244
OperatingSystems_1: -0.14530926674716454
TrafficType_15: -0.12601954689866632
TrafficType_4: -0.12551302296797587
Browser_2: -0.12254444691952127
Region_3: -0.116409339032699
TrafficType_9: -0.09345050196986791
Browser_8: -0.07432180699436479
Browser_5: -0.06731941488695285
TrafficType_19: -0.04763319631540111
Browser_10: -0.03030326779492614
TrafficType_14: -0.02486754694456821
Region_1: -0.024392989712640506
TrafficType_18: -0.02222257922449895
TrafficType_20: -0.018331800703584155
OperatingSystems_6: -0.016786449649954342
TrafficType_7: -0.006542353054798274
TrafficType_12: -0.0032342542351401346
Browser_11: -0.002452753984304908
Informational_Duration: -0.00032045144921367014
Administrative_Duration: -0.00010008862449623993
ProductRelated_Duration: 4.6077899325827885e-05
ProductRelated: 0.003291131517956643
Administrative: 0.008809132521965357
TrafficType_2: 0.025894902253396974
Browser_7: 0.028686788285342275
Region_8: 0.029319493036519817
OperatingSystems_7: 0.03298640042309421
TrafficType_16: 0.047341484936212506
Informational: 0.08555002045301442
TrafficType_5: 0.0859889420171317
PageValues: 0.08672528112710322
Region_6: 0.09309020409318655
Month_Aug: 0.09668425308005028
Browser_12: 0.1189651797379178
is_weekend: 0.11966844048422016
Month_Sep: 0.12544889935651957
Region_2: 0.13313545468089413
TrafficType_11: 0.19223716898106263
Month_Jul: 0.21082793061040983
Month_Oct: 0.2715030204884287
TrafficType_10: 0.35298265536282414
TrafficType_8: 0.4020350043660541
Month_Nov: 0.5044070793869467

We can see from the coefficients that a whether or not the traffic type is a key indicator, as well as which month the user browsed in.