scikit-learn tutorials

Copyright (c) 2017 Yu Ohori

Do classification with scikit-learn.

Outline

  1. Launch Jupyter Notebook App
  2. Import related packages
  3. Load a dataset
  4. Preprocess data
  5. Learn the model
  6. Predict class labels
  7. Evaluate the model
  8. Select the model
  9. Save the model
  10. Implement a new estimator

1. Launch Jupyter Notebook App

$ cd path/to/directory
$ jupyter notebook
In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
  • Matplotlib: a plotting library
  • NumPy: a library used for scientific computing
  • Pandas: a data analysis library
  • Seaborn: a visualization library based on matplotlib
  • scikit-learn: a machine learning library

3. Load a dataset

In [2]:
# Load a dataset
iris = sns.load_dataset('iris')

# Return first 10 rows
iris.head(10)
Out[2]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
7 5.0 3.4 1.5 0.2 setosa
8 4.4 2.9 1.4 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa
  • Number of instances: 150
  • Number of attributes: 4
  • Number of classes: 3
In [3]:
from sklearn.model_selection import train_test_split

# Split arrays into random train and test subsets
feature_names                    = [
    'sepal_length', 'sepal_width', 'petal_length', 'petal_width'
]
X                                = iris[feature_names]
y                                = iris['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

4. Preprocess data

In [4]:
from sklearn.preprocessing import StandardScaler

scaler         = StandardScaler()

# Standardize to zero mean and unit variance
scaler.fit(X_train, y_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)

5. Learn the model

In [5]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=1.0, random_state=0)

clf.fit(X_train_scaled, y_train)
Out[5]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

6. Predict class labels

In [6]:
y_pred = pd.Series(clf.predict(X_test_scaled), name='predicted_label')

# Return first 10 rows
y_pred.head(10)
Out[6]:
0     virginica
1    versicolor
2        setosa
3     virginica
4        setosa
5     virginica
6        setosa
7     virginica
8    versicolor
9    versicolor
Name: predicted_label, dtype: object

7. Evaluate the model

  1. Accuracy
  2. Precision, recall and F1-score
In [7]:
# Return the accuracy on the given test data and class labels
clf.score(X_test_scaled, y_test)
Out[7]:
0.80000000000000004

Confusion matrix

prediction positive prediction negative
condition positive True Positive (TP) False Negative (FN)
condition negative False Positive (FP) True Negative (TN)
  • $\textrm{Accuracy} = \frac{\textrm{TP} + \textrm{TN}}{\textrm{TP} + \textrm{FP} + \textrm{FN} + \textrm{TN}}$
  • $\textrm{Precision} = \frac{\textrm{TP}}{\textrm{TP} + \textrm{FP}}$
  • $\textrm{Recall} = \frac{\textrm{TP}}{\textrm{TP} + \textrm{FN}}$
  • $\textrm{F1-score} = \frac{2 \cdot \textrm{Precision} \cdot \textrm{Recall}}{\textrm{Precsion} + \textrm{Recall}}$
In [8]:
from sklearn.metrics import confusion_matrix

# Compute a confusion matrix to evaluate the accuracy of a classification
target_names = np.unique(y)
cnf_matrix   = pd.DataFrame(
    confusion_matrix(y_test, y_pred), columns=target_names, index=target_names
)

# Create a figure
f, ax        = plt.subplots()

# Plot rectangular data as a color-encoded matrix
sns.heatmap(cnf_matrix, annot=True, cmap='Blues', ax=ax)

ax.set_title('Confusion matrix')
ax.set_xlabel('Predicted label')
ax.set_ylabel('True label')
plt.show()

# Clear the current figure
plt.clf()
<matplotlib.figure.Figure at 0x112aea278>
In [9]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=target_names))
             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        16
 versicolor       0.74      0.74      0.74        23
  virginica       0.71      0.71      0.71        21

avg / total       0.80      0.80      0.80        60

8. Select the model

  1. Cross validation
  2. Validation curve
  3. Grid search
In [10]:
from sklearn.model_selection import cross_val_score
from sklearn.pipeline        import Pipeline

pipe   = Pipeline([('scale', scaler), ('lr', clf)])

# Evaluate a score by cross validation
scores = cross_val_score(pipe, X_train, y_train, cv=5)

print('CV accuracy {0:.3f} +/- {1:.3f}'.format(np.mean(scores), np.std(scores)))
CV accuracy 0.901 +/- 0.040
In [11]:
from sklearn.model_selection import validation_curve

param_name                 = 'lr__C'
param_range                = [
    0.001, 0.003, 0.01, 0.03,
    0.1,   0.3,   1,    3,
    10,    30,    100,  300,
    1000
]
train_scores, valid_scores = validation_curve(
    pipe, X_train, y_train, param_name, param_range, cv=5
)
train_mean                 = np.mean(train_scores, axis=1)
train_std                  = np.std(train_scores, axis=1)
valid_mean                 = np.mean(valid_scores, axis=1)
valid_std                  = np.std(valid_scores, axis=1)

# Create a figure
f, ax                      = plt.subplots()

ax.plot(
    param_range,               train_mean,
    color='#ff2800',           linestyle='-',
    marker='o',                markersize=5,
    label='Training accuracy'
)
ax.fill_between(
    param_range,            train_mean + train_std,
    train_mean - train_std, alpha=0.2,
    color='#ff2800'
)

ax.plot(
    param_range,         valid_mean,
    color='#0041ff',     linestyle='--',
    marker='s',          markersize=5,
    label='CV accuracy'
)
ax.fill_between(
    param_range,            valid_mean + valid_std,
    valid_mean - valid_std, alpha=0.2,
    color='#0041ff'
)

ax.set_title('Validation curve')
ax.set_xlabel('Parameter C')
ax.set_ylabel('Accuracy')
ax.set_xlim(1e-03, 1e+03)
ax.set_ylim(0.75, 1)
ax.set_xscale('log')
ax.legend(loc='lower right')
plt.show()

# Clear the current figure
plt.clf()
<matplotlib.figure.Figure at 0x112c0f6d8>
In [12]:
from sklearn.model_selection import GridSearchCV

param_grid  = {param_name: param_range}
grid_search = GridSearchCV(pipe, param_grid, cv=5, verbose=1)

# Tune the hyperparameter
grid_search.fit(X_train, y_train)
Fitting 5 folds for each of 13 candidates, totalling 65 fits
[Parallel(n_jobs=1)]: Done  65 out of  65 | elapsed:    0.3s finished
Out[12]:
GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('lr', LogisticRegression(C=1000, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'lr__C': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)
In [13]:
best_pipe = grid_search.best_estimator_

# Return the accuracy on the given test data and class labels
best_pipe.score(X_test, y_test)
Out[13]:
0.93333333333333335

9. Save the model

In [14]:
from sklearn.externals.joblib import dump

# Serialize the model
dump(best_pipe, 'lr.pkl')
Out[14]:
['lr.pkl']
In [15]:
from sklearn.externals.joblib import load

# Deserialize the model
another_pipe = load('lr.pkl')

10. Implement a new estimator

Perceptron [Rosenblatt, 1957]

Update the model when observation data is misclassified.

\begin{align} \boldsymbol{w}_{t + 1} = \begin{cases} \boldsymbol{w}_{t} + \eta y_{t} \boldsymbol{x}_{t} & \text{if} \quad y_{t} \langle \boldsymbol{w}_{t}, \boldsymbol{x}_{t} \rangle \leq 0 \\ \boldsymbol{w}_{t} & \text{otherwise} \end{cases} \end{align}
  • feature vector $\boldsymbol{x} \in \mathbb{R}^{m + 1}$
  • label $y \in \{ \pm 1 \}$
  • weight vector $\boldsymbol{w} \in \mathbb{R}^{m + 1}$
  • learning rate $\eta$

In [16]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils import shuffle
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted

class MyPerceptron(BaseEstimator, ClassifierMixin):
    """Perceptron.

    Parameters
    ----------
    fit_intercept : bool
        Whether the intercept should be estimated or not.

    learning_rate : float
        The learning rate.

    n_iter : int
        The number of passes over the training data (aka epochs).

    random_state : int, RandomState instance, or None
        The seed of the pseudo random number generator to use when shuffling the data.

    shuffle : bool
        Whether or not the training data should be shuffled after each epoch.
    """

    def __init__(
        self,              fit_intercept=True,
        learning_rate=1.0, n_iter=5,
        random_state=None, shuffle=True
    ):
        self.fit_intercept = fit_intercept
        self.learning_rate = learning_rate
        self.n_iter        = n_iter
        self.random_state  = random_state
        self.shuffle       = shuffle

    def decision_function(self, X):
        """Predict confidence scores for samples.

        Parameters
        ----------
        X : array-like, shape = [n_samples, n_features]
            Samples.

        Returns
        -------
        y_score: array-like, shape = [n_samples]
            Confidence scores.
        """

        return X @ self.coef_ + self.intercept_

    def fit(self, X, y):
        """Fit the model according to the given training data.

        Prameters
        ---------
        X : array-like, shpae = [n_samples, n_features]
            Training data.

        y : array-like, shpae = [n_samples]
            Target values.

        Returns
        -------
        self : object
            Returns self.
        """

        # Check that X and y have correct shape
        X, y                  = check_X_y(X, y, accept_sparse='csr')

        n_samples, n_features = X.shape
        func                  = np.vectorize(lambda elm: 2 * elm - 1)
        y                     = func(np.copy(y))

        # Initialize the model
        self.coef_            = np.zeros(n_features)
        self.intercept_       = 0.0

        for epoch in range(self.n_iter):
            if self.shuffle:
                X, y          = shuffle(X, y, random_state=self.random_state)

            for i in range(n_samples):
                self._update(X[i:i + 1], y[i:i + 1])

        return self

    def predict(self, X):
        """Predict class labels for samples in X.

        Parameters
        ----------
        X : array-like, shape = [n_samples, n_features]
            Samples.

        Returns
        -------
        y_pred : array, shape = [n_samples]
            Predictions for input data.
        """

        # Check is fit had been called
        check_is_fitted(self, ['coef_', 'intercept_'])

        # Input validation
        X            = check_array(X, accept_sparse='csr')

        inverse_func = np.vectorize(lambda elm: (elm + 1) // 2)
        y_score      = self.decision_function(X)
        y_pred       = inverse_func(np.sign(y_score))

        return y_pred

    def _update(self, X, y):
        """Update the model.

        Parameters
        ----------
        X : array-like, shape = [1, n_features]
            The sample oveserved on rount t.

        y : array-like, shape = [1]
            The target value oveserved on rount t.
        """

        if y * self.decision_function(X) <= 0.0:
            self.coef_          += self.learning_rate * X.T @ y

            if self.fit_intercept:
                self.intercept_ += self.learning_rate * y
In [17]:
from sklearn.multiclass import OneVsRestClassifier

my_pipe = Pipeline([
    ('scaler',  scaler),
    ('prcptrn', OneVsRestClassifier(MyPerceptron(random_state=0)))
])

my_pipe.fit(X_train, y_train)
Out[17]:
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('prcptrn', OneVsRestClassifier(estimator=MyPerceptron(fit_intercept=True, learning_rate=1.0, n_iter=5, random_state=0,
       shuffle=True),
          n_jobs=1))])
In [18]:
my_pipe.score(X_test, y_test)
Out[18]:
0.73333333333333328

References

  1. M. Lichman, "UCI Machine Learning Repository", 2013.
  2. W. McKinney, "Python for Data Analysis", O'Reilly Media, 2012.
  3. S. Raschka, "Python Machine Learning", Packt Publishing, 2015.
  4. W. Richert and L. P. Coelho, "Building Machine Learning Systems with Python", O'Reilly Media, 2013.
  5. F. Rosenblatt, "The Perceptron , a Perceiving and Recongnizing Automaton", Cornell Aeronautical Laboratory, 1957.