Lecture 7: Logistic Regression and Decision Trees

This week we're discussing more classifiers and their applications.

Logistic Regression

Logistic regression, like linear regression, is a generalized linear model. However, the final output of a logistic regression model is not continuous; it's binary (0 or 1). The following sections will explain how this works.

What is Conditional Probability?

Conditional probability is the probability that an event (A) will occur given that some condition (B) is true. For example, say you want to find the probability that a student will take the bus as opposed to walking to class today (A) given that it's snowing heavily outside (B). The probability that the student will take the bus when it's snowing is likely higher than the probability that s/he would take the bus on some other day.

An Overview

The goal of logistic regression is to take a set of datapoints and classify them. This means that we expect to have discrete outputs representing a set of classes. In simple logistic regression, this must be a binary set: our classes must be one of only two possible values. Here are some things that are sometimes modeled as binary classes:

• Male or Female
• Rainy or Dry
• Democrat or Republican
• The objective is to find an equation that is able to take input data and classify it into one of the two classes. Luckily, the logistic equation is for just such a task.

The Logistic Equation

The logistic equation is the basis of the logistic regression model. It looks like this:

The t in the equation is some linear combination of n variables, or a linear function in an n-dimensional feature space. The formulation of t therefore has the form ax+b. In fitting a logistic regression model, the goal is therefore to minimize error in the logistic equation with the chosen t (of the form ax+b) by tuning a and b.

The logistic equation (also known as the sigmoid function) works as follows:

1. Takes an input of n variables
2. Takes a linear combination of the variables as parameter t (this is another way of saying t has the form ax+b)
3. Outputs a value given the input and parameter t

The output of the logistic equation is always between 0 and 1.

A visualization of the outputs of the logistic equation is as below (note that this is but one possible output of a logit regression model):

Threshold Value

The final output of a logistic regression model should be a binary set of numbers - that is, 0 or 1. However, you'll notice that the output of the logistic equation is a continuous set of numbers between 0 and 1- the function output itself is not 0 or 1.

We do change the output to a 0 or 1 by picking a threshold value. This is a value between 0 and 1 such that if f(x) > threshold, we give it the value 1, and otherwise it is 0.

The threshold value is the epsilon value in the equation. Usually, the threshold value is set to 0.5: in binary classification, a probability greater than 0.5 for 1 class guarantees that that is the highest probability- if one probability is greater than 1, then the other must be less than 1 as the sum of the two probabilities must be 1.

The threshold value epsilon determines two key characteristics of a logistic regression classifier:

1. Sensitivity
2. Specificity

Sensitivity and Specificity

The Confusion Matrix

Sensitivity, also known as the true positive rate, is the proportion of true positives out of all "actual positives" - that is, it is the proportion of positives that are correctly identified as positives.

Sensitivity = True Positive / (True Positive + False Negatives)



Specificity, also called the true negative rate, is the proportion of true negatives out of all "actual negatives" - that is, it is the proportion of negatives that are correctly identified as negatives.

Specificity = True Negative / (True Negative + False Positives)



There is always a trade-off between the two characteristics. Both depend on the threshold value we choose; the higher the threshold, the lower the sensitivity and the higher the specificity. If we have an arbitrarily high threshold value (i.e. 1), all points will be classified as negative; sensitivity = 0 and specificity = 1. The opposite will be true if we set the threshold to be arbitrarily low (i.e. 0).

The ROC Curve

The ROC curve represents how well a model performs in terms of sensitivity and specificity over all possible thresholds. Sensitivity (on the y-axis) is plotted against 1-specificity, or equivalently the false positive rate (on the x-axis) as the threshold value varies from 0 to 1. An example:

Example 1: Predicting Income from Census Data

We'll use logistic regression to predict whether annual income is greater than $50k based on census data. You can read more about the dataset here. In [47]: # import necessary packages import pandas as pd from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.linear_model import LogisticRegression  In [48]: inc_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = None, names = ['age', 'workclass', 'fnlwgt', 'education', 'education.num', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'capital.gain', 'capital.loss', 'hours.per.week', 'native.country', 'income']) # drop null values inc_data = inc_data.dropna()  The following uses LabelEncoder() in scikit-learn to encode all features to categorical integer values. Many features in this particular dataset, such as race and sex, are represented as strings with a limited number of possible values. LabelEncoder() re-labels these values as integers between 0 and number of classes-1. In [49]: # the column is present in both categorical and numeric form del inc_data['education'] # convert all features to categorical integer values enc = LabelEncoder() for i in inc_data.columns: inc_data[i] = enc.fit_transform(inc_data[i])  In [50]: # target is stored in y y = inc_data['income'] # X contains all other features, which we will use to predict target X = inc_data.drop('income', axis=1)  Here we split the data into train and test sets, where the test set is 30% of the initial dataset. In [51]: # train/test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)  In [52]: # build model and fit on train set logit = LogisticRegression() logit.fit(X_train, y_train)  Out[52]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) In [53]: # make predictions on test set pred_logit = logit.predict(X_test) pred_logit  Out[53]: array([0, 0, 0, ..., 0, 1, 1]) In [54]: # measure accuracy accuracy_score(y_true = y_test, y_pred = pred_logit)  Out[54]: 0.8110349063363701 Example 2: Predict Iris Species In [55]: # import necessary packages import pandas as pd from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression  In [68]: from sklearn import datasets iris = datasets.load_iris() #Here we load the built-in iris dataset X = iris.data[:, :2] Y = iris.target isSetosa = Y == 0 isNot = Y > 0 isSetosa.any() Y[isSetosa] = 1 Y[isNot] = 0  In [69]: #Here we create the train/test split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)  In [70]: #Building the model logreg = LogisticRegression() logreg.fit(X_train, Y_train)  Out[70]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) In [71]: #Predictions pred = logreg.predict(X_test) #Accuracy accuracy_score(y_true = Y_test, y_pred = pred)  Out[71]: 0.97777777777777775 Multinomial Logistic Regression We won't discuss this in detail here, but it's worth mentioning briefly. Multinomial logistic regression is another classification algorithm. The difference is that the output isn't binary; there can be multiple possible categories for the target, as implied by the name. For example, we can use multinomial regression to predict which movie genre people will like based on their other characteristics. If you're interested in learning how this model works in more detail, there are a lot of good resources on the internet and we encourage you to explore. Decision Trees The decision tree algorithm can be used to do both classification as well as regression and has the advantage of not assuming a linear model. Decisions trees are usually easy to represent visually which makes it easy to understand how the model actually works. Another frequently used classifier is CART, or classification and regression trees. Geometric Interpretation Mathematical Formulation The hard part is really to construct this tree from the data set. The heart of the CART algorithm lies in deciding how/where to split the data (choosing the right feature). The idea is to associate a quantitative measure the quality of a split because then we simply choose the best feature to split. A very common measure is the Shannon entropy: Given a discrete probablity distribution$(p_1, p_2,...p_n)$. The shannon entropy$E(p_1, p_2,...p_n)\$ is: $$-\sum_{i = 1}^n p_ilog_2(p_i)$$

The goal of the algorithm is to take the necessary steps to minimize this entropy, by choosing the right features at every stage to accomplish this.

Example 1: Are Mushrooms Poisonous or Not?

We'll use the decision tree classifier to predict whether mushroooms are poisonous. You can read about the dataset we're using here. The data shortens the categorical variables to just letters, so the data overview is especially helpful.

In [1]:
# import necessary packages
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz

In [2]:
m_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data', header = None, names = ['class','cap-shape','cap-surface','cap-color','bruises','odor','gill-attachment','gill-spacing','gill-size','gill-color','stalk-shape','stalk-root','stalk-surface-above-ring','stalk-surface-below-ring','stalk-color-above-ring','stalk-color-below-ring','veil-type','veil-color','ring-number','ring-type','spore-print-color','population','habitat'])

# drop null values
m_data = m_data.dropna()

# convert all features to categorical integer values
enc = LabelEncoder()
for i in m_data.columns:
m_data[i] = enc.fit_transform(m_data[i])

In [3]:
# target is stored in y
y = m_data['class']

# X contains all other features, which we will use to predict target
X = m_data.drop('class', axis=1)


The following note may seem self-evident, but just to be extra clear:

In the cell above, we create X so that it contains all features except for the target variable, and we'll make predictions using X. This doesn't have to be the case, and in fact is usually not the best practice; we can pick features that we think are significant rather than using the entire dataset, and doing so often results in more accurate predictions. For simplicity's sake, however, we omit this in this example.

In [4]:
# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [5]:
# build model and fit on train set
tree_classifier = DecisionTreeClassifier(max_leaf_nodes=15)
tree_classifier.fit(X_train, y_train)

Out[5]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=15, min_impurity_split=1e-07,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')

Python does not have a good method to visually display a decision tree. If you want to see it, run the code below which writes the tree to a file. Then, go here where you will have to copy the file's contents into the webpage and run it.

In [9]:
# creates a file with the decision tree plotted
with open("decisiontree.txt", 'w') as f:
export_graphviz(tree_classifier, out_file=f, feature_names=list(X))

In [7]:
# make predictions on test set
tree_pred = tree_classifier.predict(X_test)
tree_pred

Out[7]:
array([0, 0, 1, ..., 0, 1, 1])
In [8]:
# measure accuracy
accuracy_score(y_true = y_test, y_pred = tree_pred)

Out[8]:
0.99138638228055787

Example 2: Predict Higgs Boson Signal

In [72]:
# import necessary packages
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [73]:
# this code is only necessary because of the data is in an arff file
import urllib2
from StringIO import StringIO
higgs = pd.DataFrame(dataset[0], columns=dataset[1].names())

# target is stored in y
Y = higgs['class']

# X contains all other features, which we will use to predict target
X = higgs.drop('class', axis=1)

In [74]:
# train/test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)

# build model and fit on train set
dTree = DecisionTreeClassifier(max_leaf_nodes=15)
dTree.fit(X_train, Y_train)

Out[74]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=15, min_impurity_split=1e-07,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
In [75]:
# make predictions on test set
dTree_pred = dTree.predict(X_test)
dTree_pred

Out[75]:
array(['1', '1', '1', ..., '1', '0', '0'], dtype=object)
In [76]:
# measure accuracy
accuracy_score(y_true = Y_test, y_pred = dTree_pred)

Out[76]:
0.6762345679012346