Spam classification using logistic regression¶

We have 4601 email addresses with 57 features. The data can be found on GitHub. The column names can be found here. We want to predict whether the email is spam or not, so we have a binary response, where 1 indicates spam and 0 indicates non-spam.

You can find some utility functions on GitHub.

In [1]:

%matplotlib inline

import numpy as np
import pandas as pd
from classifiers import read_spam_data, transform_log, transform_binary
#from sklearn.linear_model import LogisticRegression # reference sklearn implementation
from classifiers import LogisticRegression

train_data, test_data = read_spam_data()
train_data.head()

Out[1]:

	spam	word_freq_make	word_freq_address	word_freq_all	word_freq_our	word_freq_over	word_freq_remove	word_freq_internet	word_freq_order	...	char_freq_(	char_freq_!	char_freq_$	char_freq_#	capital_run_length_average	capital_run_length_longest	capital_run_length_total
0	1	0.21	0.28	0.5	0.14	0.28	0.21	0.07	0.00	...	0.132	0.372	0.180	0.048	5.114	101.0	1028.0
1	1	0.00	0.00	0.0	0.63	0.00	0.31	0.63	0.31	...	0.137	0.137	0.000	0.000	3.537	40.0	191.0
2	1	0.00	0.00	0.0	0.63	0.00	0.31	0.63	0.31	...	0.135	0.135	0.000	0.000	3.537	40.0	191.0
3	1	0.00	0.00	0.0	1.85	0.00	0.00	1.85	0.00	...	0.223	0.000	0.000	0.000	3.000	15.0	54.0
4	1	0.00	0.00	0.0	1.92	0.00	0.00	0.00	0.00	...	0.054	0.164	0.054	0.000	1.671	4.0	112.0

5 rows × 58 columns

In logistic regression, we let

\begin{equation} y \mid \mathbf{x} \sim \operatorname{Bernoulli}\left(\frac{1}{1 + \exp\left(-\left(w_0 + \mathbf{w}^\intercal \mathbf{x}\right)\right)}\right), \end{equation}

so if we make $N$ observations the likelihood is

\begin{equation} L\left(w_0, \mathbf{w}\right) = \prod_{i=1}^N\left(\frac{1}{1 + \exp\left(-\left(w_0 + \mathbf{w}^\intercal \mathbf{x}\right)\right)}\right)^{y_i}\left(\frac{1}{1 + \exp\left(w_0 + \mathbf{w}^\intercal \mathbf{x}\right)}\right)^{1-y_i}. \end{equation}

Define $\tilde{y}_i = 2y_i - 1.$ We want to minimize the negative log-likelihood, which is

\begin{equation} J\left(w_0, \mathbf{w}\right) = \sum_{i=1}^N\log\left(1 + \exp\left(-\tilde{y}_i\left(w_0 + \mathbf{w}^\intercal \mathbf{x}\right)\right)\right). \end{equation}

In practice, this function is not always convex, so we add some regularization. Now, we minimize

\begin{equation} J_\lambda\left(w_0, \mathbf{w}\right) = \sum_{i=1}^N\log\left(1 + \exp\left(-\tilde{y}_i\left(w_0 + \mathbf{w}^\intercal \mathbf{x}\right)\right)\right) + \frac{\lambda}{2}\mathbf{w}^\intercal\mathbf{w}. \end{equation}

Note that there is no penalty for $w_0$. One way to minimize this is with Newton's algorithm. To do so, we need to calculate the gradient $\mathbf{g}$ and Hessian $\mathbf{H}$. In exercise 3 (not shown), I show that

\begin{align} \mathbf{g}\left(w_0, \mathbf{w}\right) &= \tilde{\mathbf{X}}^\intercal\left(\boldsymbol\mu - \mathbf{y}\right) + \lambda\begin{pmatrix}0 \\ \mathbf{w}\end{pmatrix} \\ \mathbf{H}\left(w_0, \mathbf{w}\right) &= \tilde{\mathbf{X}}^\intercal \mathbf{M} \tilde{\mathbf{X}} + \lambda \begin{pmatrix} 0 & \mathbf{0} \\ \mathbf{0} & I \end{pmatrix}, \end{align}

where $\tilde{\mathbf{X}}$ is the data matrix with a column of 1s inserted as the first column. Also, we have defined

\begin{equation} \boldsymbol\mu = \begin{pmatrix} \left(1 + \exp\left(-\left(w_0 + \mathbf{w}^\intercal\mathbf{x}_1\right)\right)\right)^{-1}\\ \left(1 + \exp\left(-\left(w_0 + \mathbf{w}^\intercal\mathbf{x}_2\right)\right)\right)^{-1}\\ \vdots \\ \left(1 + \exp\left(-\left(w_0 + \mathbf{w}^\intercal\mathbf{x}_N\right)\right)\right)^{-1}\\ \end{pmatrix}, ~\text{and}~ \mathbf{M} = \operatorname{diag}\left( \mu_1(1-\mu_1), \mu_2(1-\mu_2), \ldots,\mu_N(1-\mu_N) \right). \end{equation}

Let $\boldsymbol\theta = (w_0,\mathbf{w})^\intercal$. Choose some initial estimate for $\boldsymbol\theta_0$. Then, by Newton's method,

\begin{equation} \mathbf{H}\left(\boldsymbol\theta_k\right)\left(\boldsymbol\theta_{k+1} - \boldsymbol\theta_{k}\right) = -\mathbf{g}\left(\boldsymbol\theta_k\right). \end{equation}

In [2]:

#logReg = LogisticRegression(solver='newton-cg', C=1) # sklearn implementation, only newton-cg converges
logReg = LogisticRegression(regularization = 1)
logReg.fit(train_data.drop('spam', axis = 1).as_matrix(), train_data['spam'].as_matrix())
print(logReg.intercept_)
print(logReg.coef_)
print(1-logReg.score(train_data.drop('spam', axis = 1).as_matrix(), train_data['spam'].as_matrix()))
print(1-logReg.score(test_data.drop('spam', axis = 1).as_matrix(), test_data['spam'].as_matrix()))

-1.49981367068
[ -1.26600248e-01  -1.56094390e-01   2.54708222e-01   8.58581226e-01
   4.09630906e-01   8.96135831e-01   2.19302802e+00   7.60903407e-01
   6.71931927e-01   1.01103996e-01  -2.82796444e-01  -1.83924290e-01
  -1.11320045e-01   1.03485121e-01   5.54364725e-01   1.18430618e+00
   1.06659155e+00   1.31293171e-01   1.43401894e-01   7.11353892e-01
   2.31623776e-01   3.36640589e-01   1.87194517e+00   6.08773300e-01
  -1.82320650e+00  -5.14474564e-01  -4.14759176e+00  -1.98631296e-01
  -1.18020809e+00  -1.99879005e-01  -1.77496082e-01   1.48991710e-02
  -7.45306456e-01  -1.68448134e-01  -5.55552499e-01   8.47931331e-01
   3.72525690e-02  -6.12885760e-01  -1.00486600e+00  -1.36561188e-01
  -1.16561300e+00  -1.59350006e+00  -9.86968673e-01  -1.38680860e+00
  -7.55505052e-01  -1.82108130e+00  -6.55434252e-01  -1.61610798e+00
  -1.18107595e+00  -3.44812567e-01  -3.76344925e-01   2.25717319e-01
   4.05625095e+00   1.06168683e+00  -1.38267945e-02   8.02586523e-03
   4.62754722e-04]
0.0698205546493
0.0703125

The default settings work pretty well and give us a 7% misclassification rate. Let us try various transformations.

Standardize, that is, $\displaystyle\frac{\mathbf{x}_j - \hat{\mu}_j}{\hat{\sigma}_j}$
log, $\log\left(\mathbf{x}_j + 0.1\right)$
Binarize, $\mathbb{I}\left(\mathbf{x}_j > 0\right)$

You can see the transformations and the logistic regression class that I wrote by hand on GitHub.

In [3]:

from sklearn import preprocessing

# transform the data
ytrain = train_data['spam'].as_matrix()
ytest = test_data['spam'].as_matrix()
Xtrain_raw = train_data.drop('spam', axis = 1).as_matrix()
Xtest_raw = test_data.drop('spam', axis = 1).as_matrix()
Xtrain_standard = preprocessing.scale(Xtrain_raw, axis=0)
Xtest_standard = preprocessing.scale(Xtest_raw, axis=0)
Xtrain_log = np.apply_along_axis(transform_log, axis = 0, arr=Xtrain_raw)
Xtest_log = np.apply_along_axis(transform_log, axis = 0, arr=Xtest_raw)
Xtrain_binary = np.apply_along_axis(transform_binary, axis = 0, arr=Xtrain_raw)
Xtest_binary = np.apply_along_axis(transform_binary, axis = 0, arr=Xtest_raw)

data_transform = ['Raw', 'Standard', 'Log', 'Binary']
Xtrain = [Xtrain_raw, Xtrain_standard, Xtrain_log, Xtrain_binary]
Xtest = [Xtest_raw, Xtest_standard, Xtest_log, Xtest_binary]

## now run lots of models to find regularization parameter
regularization = np.linspace(0, 20, num=41)
misclassification_rates = pd.DataFrame(dtype=np.float64,
                                       index = np.arange(len(regularization)),
                                       columns = ['Regularization'] + 
                                       list(map(lambda x : x + ' Train',  data_transform)) + 
                                       list(map(lambda x : x + ' Test',  data_transform)))
for i in range(len(regularization)):
    misclassification_rates.iloc[i]['Regularization'] = regularization[i]
    if regularization[i] == 0:
        regularization[i] += 0.01 # hack when there's no convergence.
    logReg = LogisticRegression(regularization = regularization[i])
    for j in range(len(data_transform)):
        logReg.fit(Xtrain[j], ytrain)
        misclassification_rates.iloc[i][data_transform[j] + ' Train'] = 1 - logReg.score(Xtrain[j], ytrain)
        misclassification_rates.iloc[i][data_transform[j] + ' Test'] = 1 - logReg.score(Xtest[j], ytest)
misclassification_rates

Out[3]:

	Regularization	Raw Train	Standard Train	Log Train	Binary Train	Raw Test	Standard Test	Log Test	Binary Test
0	0.0	0.070799	0.070147	0.051876	0.064600	0.076172	0.086589	0.058594	0.074219
1	0.5	0.069494	0.072431	0.051223	0.064274	0.070964	0.087891	0.059245	0.072266
2	1.0	0.069821	0.074062	0.050571	0.063295	0.070312	0.087240	0.059896	0.072266
3	1.5	0.071126	0.075367	0.049918	0.063295	0.070312	0.085938	0.060547	0.072917
4	2.0	0.072757	0.076020	0.050245	0.063295	0.069661	0.084635	0.061198	0.073568
5	2.5	0.073409	0.075367	0.050571	0.063295	0.071615	0.085286	0.059896	0.074219
6	3.0	0.072757	0.075693	0.051876	0.063948	0.072266	0.084635	0.059896	0.074219
7	3.5	0.074715	0.075367	0.051223	0.063295	0.075521	0.085286	0.059896	0.074219
8	4.0	0.075693	0.076020	0.051223	0.063622	0.075521	0.083984	0.059896	0.074219
9	4.5	0.076020	0.076020	0.051223	0.063295	0.076823	0.084635	0.059896	0.074219
10	5.0	0.075041	0.076020	0.050897	0.063622	0.076823	0.085286	0.059896	0.074870
11	5.5	0.074062	0.076346	0.050897	0.063622	0.077474	0.085938	0.059896	0.075521
12	6.0	0.074388	0.076998	0.050571	0.063948	0.076823	0.085286	0.059896	0.076172
13	6.5	0.074388	0.076998	0.050245	0.064600	0.075521	0.083984	0.058594	0.076172
14	7.0	0.074388	0.077651	0.050245	0.064600	0.076172	0.084635	0.058594	0.076172
15	7.5	0.074388	0.078303	0.050897	0.064600	0.076172	0.084635	0.057943	0.076172
16	8.0	0.074388	0.078303	0.051550	0.064927	0.076823	0.083984	0.057943	0.076172
17	8.5	0.074715	0.078630	0.051550	0.064927	0.077474	0.083984	0.059245	0.075521
18	9.0	0.075367	0.078630	0.051876	0.064600	0.076823	0.083984	0.059245	0.075521
19	9.5	0.074715	0.079282	0.051876	0.065253	0.076823	0.083984	0.059245	0.075521
20	10.0	0.074715	0.078956	0.051876	0.064600	0.076823	0.085286	0.059245	0.075521
21	10.5	0.074715	0.078956	0.052202	0.064600	0.076823	0.084635	0.059245	0.075521
22	11.0	0.074388	0.079282	0.051550	0.065253	0.077474	0.084635	0.059245	0.074219
23	11.5	0.075367	0.079282	0.051223	0.066232	0.077474	0.084635	0.059245	0.074870
24	12.0	0.075367	0.079608	0.051550	0.069168	0.077474	0.084635	0.059245	0.075521
25	12.5	0.075693	0.079608	0.051550	0.069821	0.077474	0.085286	0.059245	0.074219
26	13.0	0.075693	0.079608	0.051550	0.070147	0.076823	0.085286	0.059245	0.074219
27	13.5	0.075693	0.079608	0.051550	0.070147	0.076823	0.085286	0.059896	0.074219
28	14.0	0.076020	0.080587	0.052202	0.070147	0.076823	0.085286	0.059896	0.075521
29	14.5	0.076346	0.080587	0.052202	0.069821	0.077474	0.085286	0.060547	0.075521
30	15.0	0.076346	0.080587	0.051876	0.069821	0.079427	0.085286	0.060547	0.075521
31	15.5	0.076346	0.080914	0.051876	0.070473	0.079427	0.085286	0.060547	0.078125
32	16.0	0.076346	0.080914	0.052202	0.071778	0.080078	0.085286	0.061849	0.078776
33	16.5	0.076346	0.081240	0.052202	0.071778	0.080078	0.085286	0.061849	0.078776
34	17.0	0.077325	0.080914	0.051876	0.071452	0.080078	0.085938	0.061849	0.078776
35	17.5	0.077651	0.080914	0.051876	0.071452	0.080729	0.085938	0.062500	0.078776
36	18.0	0.077651	0.080914	0.051550	0.071452	0.080729	0.085938	0.062500	0.078776
37	18.5	0.077651	0.081240	0.051550	0.071452	0.080729	0.085938	0.062500	0.079427
38	19.0	0.077651	0.081566	0.051876	0.071452	0.081380	0.086589	0.062500	0.079427
39	19.5	0.077651	0.081566	0.052529	0.071778	0.081380	0.086589	0.062500	0.080078
40	20.0	0.077651	0.081566	0.052529	0.072431	0.081380	0.086589	0.062500	0.080078

Let's plot these results.

In [4]:

import matplotlib.pyplot as plt
colors = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3']
plt.figure(figsize=(12,8))
for i in range(len(data_transform)):
    plt.plot(misclassification_rates['Regularization'], misclassification_rates[data_transform[i] + ' Train'],
             color=colors[i], linestyle='--', linewidth=2, marker='.')
    plt.plot(misclassification_rates['Regularization'], misclassification_rates[data_transform[i] + ' Test'],
             color=colors[i], linestyle='-', linewidth=2, marker='.')
plt.grid(True)
plt.legend(loc='center left', bbox_to_anchor=(1,0.5), title="Dataset")
plt.ylabel("Misclassification Rate")
plt.xlabel("Regularization Parameter ($\lambda$)")
plt.title("Misclassification Rates for Various Transforms and Regularizations")
plt.show()

The problem tells us to select the regularization paramter with cross validation, but that seems pointless. We see that the log transform performs best with a misclassification rate of 5.8% when $\lambda = 8$.