Welcome to this exercise! You'll review how to interpret the coefficients in a logistic regression model.
# Import libraries that you will use in this notebook
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
# Read in the data
data = pd.read_csv("dummy_data.csv", index_col=0)
# View a few rows of the data
data.head()
sex | age | obstruct | outcome | TRTMT | |
---|---|---|---|---|---|
1 | 0 | 57 | 0 | 1 | True |
2 | 1 | 68 | 0 | 0 | False |
3 | 0 | 72 | 0 | 0 | True |
4 | 0 | 66 | 1 | 1 | True |
5 | 1 | 69 | 0 | 1 | False |
Here is a description of all the fields:
sex (binary): 1 if Male, 0 if Female
age (int): age of patient at the beginning of the study
obstruct (binary): obstruction of colon by tumor
outcome (binary): 1 if patient died within 5 years
TRTMT (binary): if patient was treated
You'll want to pay close attention to the TRTMT
and outcome
columns.
TRTMT
: Whether a treatment is effective, and how effective it is for particular patients, is what you are interested in determining from a random control trial.outcome
: To measure the effective of treatment, you'll have the 5-year survival rate. This is stored in the outcome
variable, which is a binary variable with two possible values. 1 indicates that the patient died, and 0 indicates that the patient did not die during the 5-year period.The formula for computing a logistic regression has the following form:
$$\sigma(\theta^T x^{(i)}) = \frac{1}{1 + e^{\left(-\theta^T x^{(i)}\right)}},$$$x^{(i)}$ refers to example 'i' (a particular patient, or generally, a single row in a data table).
$\theta^T x^{(i)} = \sum_{j} \theta_j x^{(i)}_j$ is the linear combination of the features $x_1^{(i)}$, $x_2^{(i)}$, $x_2^{(i)}$ etc., weighted by the coefficients $\theta_1$, $\theta_2$, $\theta_3$ etc.
So for this example, $\theta^T x^{(i)} = \theta_{TRTMT} x^{(i)}_{TRTMT} + \theta_{AGE}x_{AGE}^{(i)} + \theta_{SEX}x^{(i)}_{SEX}$
Also, $\sigma$ is the sigmoid function, defined as $\sigma(a) = \frac{1}{1 + e^{(-a)}}$ for some variable $a$. The output of the sigmoid function ranges from 0 to 1, so it's useful. in representing probabilities (whose values also range from 0 to 1).
If $x^{(i)}$ is the input vector and $OUTCOME$ is the target variable, then $\sigma(\theta^T x^{(i)})$ models the probability of death within 5 years.
For example, if the data has three features, $TRTMT$, $AGE$, and $SEX$, then the patient's probability of death is estimated by:
$$Prob(OUTCOME=1) = \sigma(\theta^T x^{(i)}) = \frac{1}{1 + e^{\left(-\theta_{TRTMT} x^{(i)}_{TRTMT} - \theta_{AGE}x_{AGE}^{(i)} - \theta_{SEX}x^{(i)}_{SEX}\right)}}$$Let's separate the data into the target variable and the features and fit a logistic regression to it. Notice that in this case you are not separating the data into train and test sets because you're interested in the interpretation of the model, not its predictive capabilities.
# Get the labels
y = data.outcome
# Get the features (exclude the label)
X = data.drop('outcome', axis=1)
# Fit the logistic regression on the features and labels
classifier = LogisticRegression(solver='lbfgs').fit(X, y)
Looking at the underlying equation, you can't interpret the model in the same way as with a regular linear regression.
Let's introduce the concept of odds, and you'll see how this helps with the interpretation of the logistic regression.
If an outcome is binary (either an event happens or the event doesn't happen):
Going back to the logistic regression, recall that the sigmoid function $\sigma$ ranges between 0 and 1, and so it's a useful function for representing a probability.
The odds defined in terms of the probability of an event $p$ are:
$$\mathrm{odds} = \frac{p}{1 - p}$$Substitute $p = \sigma(\theta^T x^{(i)})$ to get: $$\mathrm{odds} = \frac{\sigma(\theta^T x)}{1 - \sigma(\theta^T x)}$$
Substitute for the definition of sigmoid: $\sigma(\theta^T x^{(i)}) = \frac{1}{1 + e^{(-\theta^T x)}}$ $$\mathrm{odds} = \frac{\frac{1}{1 + e^{(-\theta^T x)}}}{1 - \frac{1}{1 + e^{(-\theta^T x)}}} $$
Multiply top and bottom by $1 + e^{(-\theta^T x)}$ and simplify to get: $$\mathrm{odds} = \frac{1}{\left ( 1 + e^{(-\theta^T x)} \right) - (1)}$$
Do some more cleanup to get: $$\mathrm{odds} = e^{\left(\theta^T x^{(i)}\right)}$$
So what is this saying?
Since working with the exponential of something isn't necessarily easier to think about, you can take one additional transformation to get rid of the exponential, coming up next.
Note that the inverse function of exponentiation is the natural log
So if you want to "remove" the exponential $e$, you can apply the natural log function, which we'll write as $\mathrm{log}$. You may have seen natural log written as $\mathrm{ln}$ as well, but we'll use $\mathrm{log}$ because Python functions usually name natural log functions as log
.
Note that the log of odds is defined as the logit function: $$\mathrm{logit}(a) = \mathrm{log}\frac{a}{1-a}$$
Apply the $\mathrm{log}()$ to the odds:
$$\text{logit} = \log(\text{odds}) = \log\left(\frac{p}{1 - p}\right)= \log\left( e^{\left(\theta^T x^{(i)}\right)}\right) = \theta^T x^{(i)}$$So, what's nice about this?
This is an improvement in the interpretability of your model.
For a small example, let's say the coefficient for age is 0.2, patient A has age=40, and the logit for patient A is 3. $$\text{logit} = \theta_{age} \times x_{age} + \cdots $$
Patient A (now)
$$ 3 = \theta_{age} \times 40 + \cdots $$
If you increase patient A's age by 1 year, then this increases the logit by 0.2 (which is the coefficient for age).
Patient (A one year older):
$$ 3 + 0.2 = 0.2 \times (40 + 1) + \cdots $$
A nice feature of the logit (log odds) is the range of possible values it can have. The $\mathrm{logit}$ function can be any real number between $-\infty$ and $+\infty$.
One way to see this is to look at the ranges of values for the sigmoid, the odds, and then logit.
To check the coefficients of the model, you can use the model's coef_ attribute
.
# Get the coefficients (the thetas, or weights for each feature)
thetas = classifier.coef_
thetas
array([[-0.21704833, 0.0460642 , 0.37798496, -0.418984 ]])
This will return a numpy array containing the coefficient for each feature variable. Let's print it in a nicer way:
# Print the name of the feature and the coefficient for each feature
for i in range(len(X.columns)):
print("Feature {:<9s}: coefficient = {:<10f}".format(X.columns[i], thetas[0, i]))
Feature sex : coefficient = -0.217048 Feature age : coefficient = 0.046064 Feature obstruct : coefficient = 0.377985 Feature TRTMT : coefficient = -0.418984
The coefficient for age is 0.046
. This means that when the age
variable increases by one, the logit will increase in 0.046
.
In order to fully leverage the information that the odds provide, there's one more very useful concept: the "Odds Ratio", which we will write as OR for short.
The OR allows you to compare the odds of one situation versus another (by dividing one odds by another odds).
When computing the OR for binary variables, it's defined as the odds when the variable is 1 divided by the odds when the variable is 0. For example:
$$OR_{TRTMT} = \frac{\text{odds}(TRTMT=1)}{\text{odds}(TRTMT=0)}$$In contrast, when computing the OR for continuous variables, it's defined as the ratio between the odds of the variable plus one unit and the odds of the variable. For example:
$$OR_{age} = \frac{\text{odds}(age+1)}{\text{odds}(age)}$$For both cases, after applying the appropriate algebra to the formula, you should find that the OR for a variable is equal to $e$ to the power of the coefficient associated with it.
$$OR_{x_j} = e^{\theta_j}$$Let's try it for the variable $AGE$:
$$OR_{age} = \frac{odds(age+1)}{odds(age)} = \frac{e^\left(\theta_{INTERCEPT} + \theta_{SEX}x^{(i)}_{SEX} + \theta_{AGE}(1+x_{AGE}^{(i)}) + \theta_{OBSTRUCT}x_{OBSTRUCT}^{(i)} + \theta_{TRTMT} x^{(i)}_{TRTMT}\right)}{e^\left(\theta_{INTERCEPT} + \theta_{SEX}x^{(i)}_{SEX} + \theta_{AGE}x_{AGE}^{(i)} + \theta_{OBSTRUCT}x_{OBSTRUCT}^{(i)} + \theta_{TRTMT} x^{(i)}_{TRTMT}\right)} = e^{\theta_{AGE}}$$The case for binary variables is quite similar. You can see the derivation for the binary variable TRTMT
in this week's graded assignment.
Now, let's compute the ORs for the feature variables:
# Compute Odds Ratios for each feature
odds_ratios = np.exp(thetas)
odds_ratios
array([[0.80489107, 1.04714164, 1.45934099, 0.65771472]])
# Display the coefficient and odds ratio for each feature
for i in range(len(X.columns)):
print("Feature {:<10s}: coefficient = {:<10f} // OR = {:.2f}".format(X.columns[i], thetas[0, i], odds_ratios[0, i]))
Feature sex : coefficient = -0.217048 // OR = 0.80 Feature age : coefficient = 0.046064 // OR = 1.05 Feature obstruct : coefficient = 0.377985 // OR = 1.46 Feature TRTMT : coefficient = -0.418984 // OR = 0.66
A negative coefficient (like treatment) indicates that the feature TRTMT
reduces the outcome (reduces risk of death), but it's not easy to interpret how much the coefficient value of -0.419
actually reduces the risk of death.
However, now that you have calculated the Odds Ratio of 0.66
for TRTMT
, you can interpret this value:
As you can see from the Odds Ratios, the give data indicates that the following reduce risk of death:
Congratulations on finishing this notebook! By now, you should have a clearer understanding of how to interpret a Logistic Regression model as well as a better intuition for odds and odd ratios. Great job!