In the previous lecture, we learned about linear regression, which explores a **linear** relationship between the independent and dependent variables. In a sense, logistic regression is analogous to linear regression in that it is a generalized linear model. However there are several key differences in logistic regression that makes it very different. Let us explore those differences, and understand how logistic regression works

One key difference between logistic regression and linear regression is that the final output of the logistic regression model is binary, that is, 0 or 1, whereas linear regression has no such property. Thus a logistic regression model will always map from the real number space to a binary space of 0 and 1. Let us examine how logistic regression does this.

The key underlying equation that underlies the model is the **logistic equation** and is formulated as below:

In this case, the t in the equation is some linear combination of n variables, or a linear function in an n-dimensional feature space. The formulation of t is therefore identical to the linear regression formula.

To summarise the logistic equation:

- Takes an input of n variables
- Takes a linear combination of the variables as parameter t
- Using the parameter, outputs a value that always lies between 0 and 1

A visualization of the outputs of the logistic equation is as below (note that this is but one possible output of a logit regression model):

It's important to realize that the logistic regression should output a **binary** set of numbers, namely 0 and 1. While the logistic equation does have an output between 0 and 1, the output is continuous. So how do we convert it to 0 and 1?

We use something called a threshold value, such that if the output of the F(x) > threshold, then 1 otherwise, 0. As a general formula:

The threshold value is the epsilon value in the equation, and is a key parameter in logistic regression, because it determines two key characteristics of a logistic regression classifier:

**Sensitivity****Specificity**

**The Confusion Matrix**

The confusion matrix is a good representation of the predictive power of a logistic regression model.

**Sensitivity**, otherwise known as a **True Positive Rate**, is the proportion of true positives out of the entire pool of "actual positives."

```
The formula is True Positive / ( True Positive + False Negatives )
```

**Specificity**, otherwise known as a **True Negative Rate**, is the proportion of true negatives out of the entire pool of "actual negatives."

```
The formula is True Negative / ( True Negative + False Positives )
```

It is important to understand that there will always be a trade-off between the two characteristics. This trade-off is best understood in terms of how we set our threshold values

Let us consider the trivial cases:classify everything as the same value.
If we classify all points as positive, **sensitivity = 1 and specificity = 0**. All positive data points have been classified as positive, along with all the negative data points.

On the contrary, if everything was classified as negative, **sensitivity = 0 and specificity = 1** All negative points have been classified correctly.

Sensitivity **decreases** as threshold grows, since the predictor will classify more and more positive points incorrectly.
Specificity **increases** as threhold grows, since the predictor will classify more and more negative points correctly.

This trade-off is represented by the ROC curve, which tells us how good a model performs in terms of specificity and sensitivity. Sample ROC curve

In [10]:

```
# Preprocess iris to create a binary case
iris_mod <- iris
iris_mod <- dplyr::mutate(iris_mod, is_setosa = as.numeric(Species=='setosa'))[-5]
print(iris_mod)
# Split train, test
ind <- sample(nrow(iris_mod),0.8*nrow(iris_mod))
train <- iris_mod[ind,]
test <- iris_mod[-ind,]
```

In [11]:

```
# Use glm function to predict species just with Sepal.Width
fit_logit <- glm(data=train,family=binomial,formula = is_setosa ~ Sepal.Width)
print(summary(fit_logit))
```

In [14]:

```
# predict using predict
pred_logit <- predict(object = fit_logit, newdata = test)
# use roc function in pROC library
library(pROC)
roc_curve <- roc(predictor = pred_logit, response=test$is_setosa )
plot(roc_curve)
auc(curve)
```

The area under the ROC curve should always be greater and equal to the total proportion of the majority class, since the worse case is classifying everything as one. The closer to 1 the area is, the stronger the model strength

In [15]:

```
auc(roc_curve)
```

Linear and logistic regression aren't the only ways to make predications, we can also use a method called cllassifcation and regression trees, or CART.

CART builds what is called a tree by splitting on the values of the independent variables. To predict the outcome for a new observation or case, you can follow the splits in the tree and at the end, you predict the most frequent outcome in the training set that followed the same path. Some advantages of CART are that it does not assume a linear model, like logistic regression or linear regression, and it's a very to interpret how the model works.

Let's make a simple CART model. We'll be attempting to predict supreme court decisions, as mentioned in lecture.

In [2]:

```
install.packages("rpart")
library(rpart)
install.packages("rpart.plot")
library(rpart.plot)
TrainCourt = read.csv("resources/TrainCourt.csv")
TestCourt = read.csv("resources/TestCourt.csv")
SupremeCourtTree = rpart(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt +
Unconst, data = Train, method="class", minbucket=25)
```

In the above code, we're trying to predict the value of Reverse (whether or not the supreme court will reverse a lower court's decision) using the Circuit, Issue, Petitioner, Respondent, LowerCourt, and Uncost features, all found in the Train dataset.

The minbucket parameter controls how many splits are made in our tree by setting the minimum number of observed data points in each branch of the tree. If it’s too small, overfitting will occur (variance).If it’s too large, model will be too simple and inaccurate (bias). You'll learn more on bias vs. variance in future lectures.

It's very easy to visualize our CART model, and can be done with the prp function. This is one of the reasons CART is more interpretable than Logisitc regression. We can see exactly how it works.

In [8]:

```
prp(SupremeCourtTree)
```

The predict() function allows us to apply our model and predict future cases.

In [16]:

```
PredictCART = predict(SupremeCourtTree, newdata = TestCourt, type = "class")
```

- Logistic Equation / Logistic function
- Confusion Matrix
- Sensitivity
- Specificity