# Spam classification using logistic regression¶

We have 4601 email addresses with 57 features. The data can be found on GitHub. The column names can be found here. We want to predict whether the email is spam or not, so we have a binary response, where 1 indicates spam and 0 indicates non-spam.

You can find some utility functions on GitHub.

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
from classifiers import read_spam_data, transform_log, transform_binary
#from sklearn.linear_model import LogisticRegression # reference sklearn implementation
from classifiers import LogisticRegression


spam word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order ... word_freq_conference char_freq_; char_freq_( char_freq_[ char_freq_! char_freq_$char_freq_# capital_run_length_average capital_run_length_longest capital_run_length_total 0 1 0.21 0.28 0.5 0.0 0.14 0.28 0.21 0.07 0.00 ... 0.0 0.0 0.132 0.0 0.372 0.180 0.048 5.114 101.0 1028.0 1 1 0.00 0.00 0.0 0.0 0.63 0.00 0.31 0.63 0.31 ... 0.0 0.0 0.137 0.0 0.137 0.000 0.000 3.537 40.0 191.0 2 1 0.00 0.00 0.0 0.0 0.63 0.00 0.31 0.63 0.31 ... 0.0 0.0 0.135 0.0 0.135 0.000 0.000 3.537 40.0 191.0 3 1 0.00 0.00 0.0 0.0 1.85 0.00 0.00 1.85 0.00 ... 0.0 0.0 0.223 0.0 0.000 0.000 0.000 3.000 15.0 54.0 4 1 0.00 0.00 0.0 0.0 1.92 0.00 0.00 0.00 0.00 ... 0.0 0.0 0.054 0.0 0.164 0.054 0.000 1.671 4.0 112.0 5 rows × 58 columns In logistic regression, we let $$y \mid \mathbf{x} \sim \operatorname{Bernoulli}\left(\frac{1}{1 + \exp\left(-\left(w_0 + \mathbf{w}^\intercal \mathbf{x}\right)\right)}\right),$$ so if we make$N$observations the likelihood is $$L\left(w_0, \mathbf{w}\right) = \prod_{i=1}^N\left(\frac{1}{1 + \exp\left(-\left(w_0 + \mathbf{w}^\intercal \mathbf{x}\right)\right)}\right)^{y_i}\left(\frac{1}{1 + \exp\left(w_0 + \mathbf{w}^\intercal \mathbf{x}\right)}\right)^{1-y_i}.$$ Define$\tilde{y}_i = 2y_i - 1.$We want to minimize the negative log-likelihood, which is $$J\left(w_0, \mathbf{w}\right) = \sum_{i=1}^N\log\left(1 + \exp\left(-\tilde{y}_i\left(w_0 + \mathbf{w}^\intercal \mathbf{x}\right)\right)\right).$$ In practice, this function is not always convex, so we add some regularization. Now, we minimize $$J_\lambda\left(w_0, \mathbf{w}\right) = \sum_{i=1}^N\log\left(1 + \exp\left(-\tilde{y}_i\left(w_0 + \mathbf{w}^\intercal \mathbf{x}\right)\right)\right) + \frac{\lambda}{2}\mathbf{w}^\intercal\mathbf{w}.$$ Note that there is no penalty for$w_0$. One way to minimize this is with Newton's algorithm. To do so, we need to calculate the gradient$\mathbf{g}$and Hessian$\mathbf{H}. In exercise 3 (not shown), I show that \begin{align} \mathbf{g}\left(w_0, \mathbf{w}\right) &= \tilde{\mathbf{X}}^\intercal\left(\boldsymbol\mu - \mathbf{y}\right) + \lambda\begin{pmatrix}0 \\ \mathbf{w}\end{pmatrix} \ \mathbf{H}\left(w_0, \mathbf{w}\right) &= \tilde{\mathbf{X}}^\intercal \mathbf{M} \tilde{\mathbf{X}} • \lambda \begin{pmatrix} 0 & \mathbf{0} \\ \mathbf{0} & I \end{pmatrix}, \end{align} where\tilde{\mathbf{X}}$is the data matrix with a column of 1s inserted as the first column. Also, we have defined $$\boldsymbol\mu = \begin{pmatrix} \left(1 + \exp\left(-\left(w_0 + \mathbf{w}^\intercal\mathbf{x}_1\right)\right)\right)^{-1}\\ \left(1 + \exp\left(-\left(w_0 + \mathbf{w}^\intercal\mathbf{x}_2\right)\right)\right)^{-1}\\ \vdots \\ \left(1 + \exp\left(-\left(w_0 + \mathbf{w}^\intercal\mathbf{x}_N\right)\right)\right)^{-1}\\ \end{pmatrix}, ~\text{and}~ \mathbf{M} = \operatorname{diag}\left( \mu_1(1-\mu_1), \mu_2(1-\mu_2), \ldots,\mu_N(1-\mu_N) \right).$$ Let$\boldsymbol\theta = (w_0,\mathbf{w})^\intercal$. Choose some initial estimate for$\boldsymbol\theta_0$. Then, by Newton's method, $$\mathbf{H}\left(\boldsymbol\theta_k\right)\left(\boldsymbol\theta_{k+1} - \boldsymbol\theta_{k}\right) = -\mathbf{g}\left(\boldsymbol\theta_k\right).$$ In [2]: #logReg = LogisticRegression(solver='newton-cg', C=1) # sklearn implementation, only newton-cg converges logReg = LogisticRegression(regularization = 1) logReg.fit(train_data.drop('spam', axis = 1).as_matrix(), train_data['spam'].as_matrix()) print(logReg.intercept_) print(logReg.coef_) print(1-logReg.score(train_data.drop('spam', axis = 1).as_matrix(), train_data['spam'].as_matrix())) print(1-logReg.score(test_data.drop('spam', axis = 1).as_matrix(), test_data['spam'].as_matrix()))  -1.49981367068 [ -1.26600248e-01 -1.56094390e-01 2.54708222e-01 8.58581226e-01 4.09630906e-01 8.96135831e-01 2.19302802e+00 7.60903407e-01 6.71931927e-01 1.01103996e-01 -2.82796444e-01 -1.83924290e-01 -1.11320045e-01 1.03485121e-01 5.54364725e-01 1.18430618e+00 1.06659155e+00 1.31293171e-01 1.43401894e-01 7.11353892e-01 2.31623776e-01 3.36640589e-01 1.87194517e+00 6.08773300e-01 -1.82320650e+00 -5.14474564e-01 -4.14759176e+00 -1.98631296e-01 -1.18020809e+00 -1.99879005e-01 -1.77496082e-01 1.48991710e-02 -7.45306456e-01 -1.68448134e-01 -5.55552499e-01 8.47931331e-01 3.72525690e-02 -6.12885760e-01 -1.00486600e+00 -1.36561188e-01 -1.16561300e+00 -1.59350006e+00 -9.86968673e-01 -1.38680860e+00 -7.55505052e-01 -1.82108130e+00 -6.55434252e-01 -1.61610798e+00 -1.18107595e+00 -3.44812567e-01 -3.76344925e-01 2.25717319e-01 4.05625095e+00 1.06168683e+00 -1.38267945e-02 8.02586523e-03 4.62754722e-04] 0.0698205546493 0.0703125  The default settings work pretty well and give us a 7% misclassification rate. Let us try various transformations. • Standardize, that is,$\displaystyle\frac{\mathbf{x}_j - \hat{\mu}_j}{\hat{\sigma}_j}$• log,$\log\left(\mathbf{x}_j + 0.1\right)$• Binarize,$\mathbb{I}\left(\mathbf{x}_j > 0\right)$You can see the transformations and the logistic regression class that I wrote by hand on GitHub. In [3]: from sklearn import preprocessing # transform the data ytrain = train_data['spam'].as_matrix() ytest = test_data['spam'].as_matrix() Xtrain_raw = train_data.drop('spam', axis = 1).as_matrix() Xtest_raw = test_data.drop('spam', axis = 1).as_matrix() Xtrain_standard = preprocessing.scale(Xtrain_raw, axis=0) Xtest_standard = preprocessing.scale(Xtest_raw, axis=0) Xtrain_log = np.apply_along_axis(transform_log, axis = 0, arr=Xtrain_raw) Xtest_log = np.apply_along_axis(transform_log, axis = 0, arr=Xtest_raw) Xtrain_binary = np.apply_along_axis(transform_binary, axis = 0, arr=Xtrain_raw) Xtest_binary = np.apply_along_axis(transform_binary, axis = 0, arr=Xtest_raw) data_transform = ['Raw', 'Standard', 'Log', 'Binary'] Xtrain = [Xtrain_raw, Xtrain_standard, Xtrain_log, Xtrain_binary] Xtest = [Xtest_raw, Xtest_standard, Xtest_log, Xtest_binary] ## now run lots of models to find regularization parameter regularization = np.linspace(0, 20, num=41) misclassification_rates = pd.DataFrame(dtype=np.float64, index = np.arange(len(regularization)), columns = ['Regularization'] + list(map(lambda x : x + ' Train', data_transform)) + list(map(lambda x : x + ' Test', data_transform))) for i in range(len(regularization)): misclassification_rates.iloc[i]['Regularization'] = regularization[i] if regularization[i] == 0: regularization[i] += 0.01 # hack when there's no convergence. logReg = LogisticRegression(regularization = regularization[i]) for j in range(len(data_transform)): logReg.fit(Xtrain[j], ytrain) misclassification_rates.iloc[i][data_transform[j] + ' Train'] = 1 - logReg.score(Xtrain[j], ytrain) misclassification_rates.iloc[i][data_transform[j] + ' Test'] = 1 - logReg.score(Xtest[j], ytest) misclassification_rates  Out[3]: Regularization Raw Train Standard Train Log Train Binary Train Raw Test Standard Test Log Test Binary Test 0 0.0 0.070799 0.070147 0.051876 0.064600 0.076172 0.086589 0.058594 0.074219 1 0.5 0.069494 0.072431 0.051223 0.064274 0.070964 0.087891 0.059245 0.072266 2 1.0 0.069821 0.074062 0.050571 0.063295 0.070312 0.087240 0.059896 0.072266 3 1.5 0.071126 0.075367 0.049918 0.063295 0.070312 0.085938 0.060547 0.072917 4 2.0 0.072757 0.076020 0.050245 0.063295 0.069661 0.084635 0.061198 0.073568 5 2.5 0.073409 0.075367 0.050571 0.063295 0.071615 0.085286 0.059896 0.074219 6 3.0 0.072757 0.075693 0.051876 0.063948 0.072266 0.084635 0.059896 0.074219 7 3.5 0.074715 0.075367 0.051223 0.063295 0.075521 0.085286 0.059896 0.074219 8 4.0 0.075693 0.076020 0.051223 0.063622 0.075521 0.083984 0.059896 0.074219 9 4.5 0.076020 0.076020 0.051223 0.063295 0.076823 0.084635 0.059896 0.074219 10 5.0 0.075041 0.076020 0.050897 0.063622 0.076823 0.085286 0.059896 0.074870 11 5.5 0.074062 0.076346 0.050897 0.063622 0.077474 0.085938 0.059896 0.075521 12 6.0 0.074388 0.076998 0.050571 0.063948 0.076823 0.085286 0.059896 0.076172 13 6.5 0.074388 0.076998 0.050245 0.064600 0.075521 0.083984 0.058594 0.076172 14 7.0 0.074388 0.077651 0.050245 0.064600 0.076172 0.084635 0.058594 0.076172 15 7.5 0.074388 0.078303 0.050897 0.064600 0.076172 0.084635 0.057943 0.076172 16 8.0 0.074388 0.078303 0.051550 0.064927 0.076823 0.083984 0.057943 0.076172 17 8.5 0.074715 0.078630 0.051550 0.064927 0.077474 0.083984 0.059245 0.075521 18 9.0 0.075367 0.078630 0.051876 0.064600 0.076823 0.083984 0.059245 0.075521 19 9.5 0.074715 0.079282 0.051876 0.065253 0.076823 0.083984 0.059245 0.075521 20 10.0 0.074715 0.078956 0.051876 0.064600 0.076823 0.085286 0.059245 0.075521 21 10.5 0.074715 0.078956 0.052202 0.064600 0.076823 0.084635 0.059245 0.075521 22 11.0 0.074388 0.079282 0.051550 0.065253 0.077474 0.084635 0.059245 0.074219 23 11.5 0.075367 0.079282 0.051223 0.066232 0.077474 0.084635 0.059245 0.074870 24 12.0 0.075367 0.079608 0.051550 0.069168 0.077474 0.084635 0.059245 0.075521 25 12.5 0.075693 0.079608 0.051550 0.069821 0.077474 0.085286 0.059245 0.074219 26 13.0 0.075693 0.079608 0.051550 0.070147 0.076823 0.085286 0.059245 0.074219 27 13.5 0.075693 0.079608 0.051550 0.070147 0.076823 0.085286 0.059896 0.074219 28 14.0 0.076020 0.080587 0.052202 0.070147 0.076823 0.085286 0.059896 0.075521 29 14.5 0.076346 0.080587 0.052202 0.069821 0.077474 0.085286 0.060547 0.075521 30 15.0 0.076346 0.080587 0.051876 0.069821 0.079427 0.085286 0.060547 0.075521 31 15.5 0.076346 0.080914 0.051876 0.070473 0.079427 0.085286 0.060547 0.078125 32 16.0 0.076346 0.080914 0.052202 0.071778 0.080078 0.085286 0.061849 0.078776 33 16.5 0.076346 0.081240 0.052202 0.071778 0.080078 0.085286 0.061849 0.078776 34 17.0 0.077325 0.080914 0.051876 0.071452 0.080078 0.085938 0.061849 0.078776 35 17.5 0.077651 0.080914 0.051876 0.071452 0.080729 0.085938 0.062500 0.078776 36 18.0 0.077651 0.080914 0.051550 0.071452 0.080729 0.085938 0.062500 0.078776 37 18.5 0.077651 0.081240 0.051550 0.071452 0.080729 0.085938 0.062500 0.079427 38 19.0 0.077651 0.081566 0.051876 0.071452 0.081380 0.086589 0.062500 0.079427 39 19.5 0.077651 0.081566 0.052529 0.071778 0.081380 0.086589 0.062500 0.080078 40 20.0 0.077651 0.081566 0.052529 0.072431 0.081380 0.086589 0.062500 0.080078 Let's plot these results. In [4]: import matplotlib.pyplot as plt colors = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3'] plt.figure(figsize=(12,8)) for i in range(len(data_transform)): plt.plot(misclassification_rates['Regularization'], misclassification_rates[data_transform[i] + ' Train'], color=colors[i], linestyle='--', linewidth=2, marker='.') plt.plot(misclassification_rates['Regularization'], misclassification_rates[data_transform[i] + ' Test'], color=colors[i], linestyle='-', linewidth=2, marker='.') plt.grid(True) plt.legend(loc='center left', bbox_to_anchor=(1,0.5), title="Dataset") plt.ylabel("Misclassification Rate") plt.xlabel("Regularization Parameter ($\lambda$)") plt.title("Misclassification Rates for Various Transforms and Regularizations") plt.show()  The problem tells us to select the regularization paramter with cross validation, but that seems pointless. We see that the log transform performs best with a misclassification rate of 5.8% when$\lambda = 8\$.