Lecture 7: Logistic Regression and Decision Trees

This week we're discussing more classifiers and their applications.


Logistic Regression

Logistic regression, like linear regression, is a generalized linear model. However, the final output of a logistic regression model is not continuous; it's binary (0 or 1). The following sections will explain how this works.


What is Conditional Probability?

Conditional probability is the probability that an event (A) will occur given that some condition (B) is true. For example, say you want to find the probability that a student will take the bus as opposed to walking to class today (A) given that it's snowing heavily outside (B). The probability that the student will take the bus when it's snowing is likely higher than the probability that s/he would take the bus on some other day.


An Overview

The goal of logistic regression is to take a set of datapoints and classify them. This means that we expect to have discrete outputs representing a set of classes. In simple logistic regression, this must be a binary set: our classes must be one of only two possible values. Here are some things that are sometimes modeled as binary classes:

  • Male or Female
  • Rainy or Dry
  • Democrat or Republican
  • The objective is to find an equation that is able to take input data and classify it into one of the two classes. Luckily, the logistic equation is for just such a task.


    The Logistic Equation

    The logistic equation is the basis of the logistic regression model. It looks like this: image

    The t in the equation is some linear combination of n variables, or a linear function in an n-dimensional feature space. The formulation of t therefore has the form ax+b. In fitting a logistic regression model, the goal is therefore to minimize error in the logistic equation with the chosen t (of the form ax+b) by tuning a and b.

    The logistic equation (also known as the sigmoid function) works as follows:

    1. Takes an input of n variables
    2. Takes a linear combination of the variables as parameter t (this is another way of saying t has the form ax+b)
    3. Outputs a value given the input and parameter t

    The output of the logistic equation is always between 0 and 1.

    A visualization of the outputs of the logistic equation is as below (note that this is but one possible output of a logit regression model): image


    Threshold Value

    The final output of a logistic regression model should be a binary set of numbers - that is, 0 or 1. However, you'll notice that the output of the logistic equation is a continuous set of numbers between 0 and 1- the function output itself is not 0 or 1.

    We do change the output to a 0 or 1 by picking a threshold value. This is a value between 0 and 1 such that if f(x) > threshold, we give it the value 1, and otherwise it is 0. image

    The threshold value is the epsilon value in the equation. Usually, the threshold value is set to 0.5: in binary classification, a probability greater than 0.5 for 1 class guarantees that that is the highest probability- if one probability is greater than 1, then the other must be less than 1 as the sum of the two probabilities must be 1.

    The threshold value epsilon determines two key characteristics of a logistic regression classifier:

    1. Sensitivity
    2. Specificity


    Sensitivity and Specificity

    The Confusion Matrix image

    Sensitivity, also known as the true positive rate, is the proportion of true positives out of all "actual positives" - that is, it is the proportion of positives that are correctly identified as positives.

    Sensitivity = True Positive / (True Positive + False Negatives)
    
    

    Specificity, also called the true negative rate, is the proportion of true negatives out of all "actual negatives" - that is, it is the proportion of negatives that are correctly identified as negatives.

    Specificity = True Negative / (True Negative + False Positives)
    
    

    There is always a trade-off between the two characteristics. Both depend on the threshold value we choose; the higher the threshold, the lower the sensitivity and the higher the specificity. If we have an arbitrarily high threshold value (i.e. 1), all points will be classified as negative; sensitivity = 0 and specificity = 1. The opposite will be true if we set the threshold to be arbitrarily low (i.e. 0).


    The ROC Curve

    The ROC curve represents how well a model performs in terms of sensitivity and specificity over all possible thresholds. Sensitivity (on the y-axis) is plotted against 1-specificity, or equivalently the false positive rate (on the x-axis) as the threshold value varies from 0 to 1. An example: image


    Example 1: Predicting Income from Census Data

    We'll use logistic regression to predict whether annual income is greater than $50k based on census data. You can read more about the dataset here.

    In [47]:
    # import necessary packages
    import pandas as pd
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import LabelEncoder
    from sklearn.linear_model import LogisticRegression
    
    In [48]:
    inc_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = None, names = ['age', 'workclass', 'fnlwgt', 'education', 'education.num', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'capital.gain', 'capital.loss', 'hours.per.week', 'native.country', 'income'])
    
    # drop null values
    inc_data = inc_data.dropna()
    

    The following uses LabelEncoder() in scikit-learn to encode all features to categorical integer values. Many features in this particular dataset, such as race and sex, are represented as strings with a limited number of possible values. LabelEncoder() re-labels these values as integers between 0 and number of classes-1.

    In [49]:
    # the column is present in both categorical and numeric form
    del inc_data['education']
    
    # convert all features to categorical integer values
    enc = LabelEncoder()
    for i in inc_data.columns:
        inc_data[i] = enc.fit_transform(inc_data[i])
    
    In [50]:
    # target is stored in y
    y = inc_data['income']
    
    # X contains all other features, which we will use to predict target
    X = inc_data.drop('income', axis=1)
    

    Here we split the data into train and test sets, where the test set is 30% of the initial dataset.

    In [51]:
    # train/test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    
    In [52]:
    # build model and fit on train set
    logit = LogisticRegression()
    logit.fit(X_train, y_train)
    
    Out[52]:
    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
              intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
              penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
              verbose=0, warm_start=False)
    In [53]:
    # make predictions on test set
    pred_logit = logit.predict(X_test)
    pred_logit
    
    Out[53]:
    array([0, 0, 0, ..., 0, 1, 1])
    In [54]:
    # measure accuracy
    accuracy_score(y_true = y_test, y_pred = pred_logit)
    
    Out[54]:
    0.8110349063363701


    Example 2: Predict Iris Species

    In [55]:
    # import necessary packages
    import pandas as pd
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    
    In [68]:
    from sklearn import datasets
    iris = datasets.load_iris()
    
    #Here we load the built-in iris dataset
    X = iris.data[:, :2]
    Y = iris.target
    
    isSetosa = Y == 0
    isNot = Y > 0
    isSetosa.any()
    Y[isSetosa] = 1
    Y[isNot] = 0
    
    In [69]:
    #Here we create the train/test split
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
    
    In [70]:
    #Building the model
    logreg = LogisticRegression()
    logreg.fit(X_train, Y_train)
    
    Out[70]:
    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
              intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
              penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
              verbose=0, warm_start=False)
    In [71]:
    #Predictions
    pred = logreg.predict(X_test)
    
    #Accuracy
    accuracy_score(y_true = Y_test, y_pred = pred)
    
    Out[71]:
    0.97777777777777775


    Multinomial Logistic Regression

    We won't discuss this in detail here, but it's worth mentioning briefly. Multinomial logistic regression is another classification algorithm. The difference is that the output isn't binary; there can be multiple possible categories for the target, as implied by the name. For example, we can use multinomial regression to predict which movie genre people will like based on their other characteristics. If you're interested in learning how this model works in more detail, there are a lot of good resources on the internet and we encourage you to explore.


    Decision Trees

    The decision tree algorithm can be used to do both classification as well as regression and has the advantage of not assuming a linear model. Decisions trees are usually easy to represent visually which makes it easy to understand how the model actually works.

    Another frequently used classifier is CART, or classification and regression trees. image


    Geometric Interpretation

    image


    Mathematical Formulation

    The hard part is really to construct this tree from the data set. The heart of the CART algorithm lies in deciding how/where to split the data (choosing the right feature). The idea is to associate a quantitative measure the quality of a split because then we simply choose the best feature to split.

    A very common measure is the Shannon entropy: Given a discrete probablity distribution $(p_1, p_2,...p_n)$. The shannon entropy $E(p_1, p_2,...p_n)$ is: $$-\sum_{i = 1}^n p_ilog_2(p_i)$$

    The goal of the algorithm is to take the necessary steps to minimize this entropy, by choosing the right features at every stage to accomplish this.


    Example 1: Are Mushrooms Poisonous or Not?

    We'll use the decision tree classifier to predict whether mushroooms are poisonous. You can read about the dataset we're using here. The data shortens the categorical variables to just letters, so the data overview is especially helpful.

    In [1]:
    # import necessary packages
    import pandas as pd
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import LabelEncoder
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.tree import export_graphviz
    
    In [2]:
    m_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data', header = None, names = ['class','cap-shape','cap-surface','cap-color','bruises','odor','gill-attachment','gill-spacing','gill-size','gill-color','stalk-shape','stalk-root','stalk-surface-above-ring','stalk-surface-below-ring','stalk-color-above-ring','stalk-color-below-ring','veil-type','veil-color','ring-number','ring-type','spore-print-color','population','habitat'])
    
    # drop null values
    m_data = m_data.dropna()
    
    # convert all features to categorical integer values
    enc = LabelEncoder()
    for i in m_data.columns:
        m_data[i] = enc.fit_transform(m_data[i])
    
    In [3]:
    # target is stored in y
    y = m_data['class']
    
    # X contains all other features, which we will use to predict target
    X = m_data.drop('class', axis=1)
    

    The following note may seem self-evident, but just to be extra clear:

    In the cell above, we create X so that it contains all features except for the target variable, and we'll make predictions using X. This doesn't have to be the case, and in fact is usually not the best practice; we can pick features that we think are significant rather than using the entire dataset, and doing so often results in more accurate predictions. For simplicity's sake, however, we omit this in this example.

    In [4]:
    # train/test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    
    In [5]:
    # build model and fit on train set
    tree_classifier = DecisionTreeClassifier(max_leaf_nodes=15)
    tree_classifier.fit(X_train, y_train)
    
    Out[5]:
    DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                max_features=None, max_leaf_nodes=15, min_impurity_split=1e-07,
                min_samples_leaf=1, min_samples_split=2,
                min_weight_fraction_leaf=0.0, presort=False, random_state=None,
                splitter='best')

    Python does not have a good method to visually display a decision tree. If you want to see it, run the code below which writes the tree to a file. Then, go here where you will have to copy the file's contents into the webpage and run it.

    In [9]:
    # creates a file with the decision tree plotted
    with open("decisiontree.txt", 'w') as f:
        export_graphviz(tree_classifier, out_file=f, feature_names=list(X))
    
    In [7]:
    # make predictions on test set
    tree_pred = tree_classifier.predict(X_test)
    tree_pred
    
    Out[7]:
    array([0, 0, 1, ..., 0, 1, 1])
    In [8]:
    # measure accuracy
    accuracy_score(y_true = y_test, y_pred = tree_pred)
    
    Out[8]:
    0.99138638228055787


    Example 2: Predict Higgs Boson Signal

    In [72]:
    # import necessary packages
    import pandas as pd
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeClassifier
    
    In [73]:
    # this code is only necessary because of the data is in an arff file
    from scipy.io.arff import loadarff
    import urllib2
    from StringIO import StringIO
    data = urllib2.urlopen('https://www.openml.org/data/download/2063675/phpZLgL9q').read(5005537)
    dataset = loadarff(StringIO(data))
    higgs = pd.DataFrame(dataset[0], columns=dataset[1].names())
    
    # target is stored in y
    Y = higgs['class']
    
    # X contains all other features, which we will use to predict target
    X = higgs.drop('class', axis=1)
    
    In [74]:
    # train/test split
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
    
    # build model and fit on train set
    dTree = DecisionTreeClassifier(max_leaf_nodes=15)
    dTree.fit(X_train, Y_train)
    
    Out[74]:
    DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                max_features=None, max_leaf_nodes=15, min_impurity_split=1e-07,
                min_samples_leaf=1, min_samples_split=2,
                min_weight_fraction_leaf=0.0, presort=False, random_state=None,
                splitter='best')
    In [75]:
    # make predictions on test set
    dTree_pred = dTree.predict(X_test)
    dTree_pred
    
    Out[75]:
    array(['1', '1', '1', ..., '1', '0', '0'], dtype=object)
    In [76]:
    # measure accuracy
    accuracy_score(y_true = Y_test, y_pred = dTree_pred)
    
    Out[76]:
    0.6762345679012346