Advanced scikit-learn

Agenda

  • StandardScaler
  • Pipeline (bonus content)

StandardScaler

What is the problem we're trying to solve?

In [1]:
# fake data
import pandas as pd
train = pd.DataFrame({'id':[0,1,2], 'length':[0.9,0.3,0.6], 'mass':[0.1,0.2,0.8], 'rings':[40,50,60]})
test = pd.DataFrame({'length':[0.59], 'mass':[0.79], 'rings':[54]})
In [2]:
# training data
train
Out[2]:
id length mass rings
0 0 0.9 0.1 40
1 1 0.3 0.2 50
2 2 0.6 0.8 60
In [3]:
# testing data
test
Out[3]:
length mass rings
0 0.59 0.79 54
In [4]:
# define X and y
feature_cols = ['length', 'mass', 'rings']
X = train[feature_cols]
y = train.id
In [5]:
# KNN with K=1
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X, y)
Out[5]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=1, p=2, weights='uniform')
In [6]:
# what "should" it predict?
knn.predict(test)
Out[6]:
array([1], dtype=int64)
In [7]:
# allow plots to appear in the notebook
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 14
plt.rcParams['figure.figsize'] = (5, 5)
In [8]:
# create a "colors" array for plotting
import numpy as np
colors = np.array(['red', 'green', 'blue'])
In [9]:
# scatter plot of training data, colored by id (0=red, 1=green, 2=blue)
plt.scatter(train.mass, train.rings, c=colors[train.id], s=50)

# testing data
plt.scatter(test.mass, test.rings, c='white', s=50)

# add labels
plt.xlabel('mass')
plt.ylabel('rings')
plt.title('How we interpret the data')
Out[9]:
<matplotlib.text.Text at 0x1709c198>
In [10]:
# adjust the x-limits
plt.scatter(train.mass, train.rings, c=colors[train.id], s=50)
plt.scatter(test.mass, test.rings, c='white', s=50)
plt.xlabel('mass')
plt.ylabel('rings')
plt.title('How KNN interprets the data')
plt.xlim(0, 30)
Out[10]:
(0, 30)

How does StandardScaler solve the problem?

StandardScaler is used for the "standardization" of features, also known as "center and scale" or "z-score normalization".

In [11]:
# standardize the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
In [12]:
# original values
X.values
Out[12]:
array([[  0.9,   0.1,  40. ],
       [  0.3,   0.2,  50. ],
       [  0.6,   0.8,  60. ]])
In [13]:
# standardized values
X_scaled
Out[13]:
array([[ 1.22474487, -0.86266219, -1.22474487],
       [-1.22474487, -0.53916387,  0.        ],
       [ 0.        ,  1.40182605,  1.22474487]])
In [14]:
# figure out how it standardized
print scaler.mean_
print scaler.std_
[  0.6          0.36666667  50.        ]
[ 0.24494897  0.30912062  8.16496581]
In [15]:
# manually standardize
(X.values - scaler.mean_) / scaler.std_
Out[15]:
array([[ 1.22474487, -0.86266219, -1.22474487],
       [-1.22474487, -0.53916387,  0.        ],
       [ 0.        ,  1.40182605,  1.22474487]])

Applying StandardScaler to a real dataset

  • Wine dataset from the UCI Machine Learning Repository: data, data dictionary
  • Goal: Predict the origin of wine using chemical analysis
In [16]:
# read three columns from the dataset into a DataFrame
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
col_names = ['label', 'color', 'proline']
wine = pd.read_csv(url, header=None, names=col_names, usecols=[0, 10, 13])
In [17]:
wine.head()
Out[17]:
label color proline
0 1 5.64 1065
1 1 4.38 1050
2 1 5.68 1185
3 1 7.80 1480
4 1 4.32 735
In [18]:
wine.describe()
Out[18]:
label color proline
count 178.000000 178.000000 178.000000
mean 1.938202 5.058090 746.893258
std 0.775035 2.318286 314.907474
min 1.000000 1.280000 278.000000
25% 1.000000 3.220000 500.500000
50% 2.000000 4.690000 673.500000
75% 3.000000 6.200000 985.000000
max 3.000000 13.000000 1680.000000
In [19]:
# define X and y
feature_cols = ['color', 'proline']
X = wine[feature_cols]
y = wine.label
In [20]:
# split into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
In [21]:
# standardize X_train
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
In [22]:
# check that it standardized properly
print X_train_scaled[:, 0].mean()
print X_train_scaled[:, 0].std()
print X_train_scaled[:, 1].mean()
print X_train_scaled[:, 1].std()
-3.90664944003e-16
1.0
1.6027279754e-16
1.0
In [23]:
# standardize X_test
X_test_scaled = scaler.transform(X_test)
In [24]:
# is this right?
print X_test_scaled[:, 0].mean()
print X_test_scaled[:, 0].std()
print X_test_scaled[:, 1].mean()
print X_test_scaled[:, 1].std()
0.0305898576303
0.866822198488
0.0546533341088
1.14955947533
In [25]:
# KNN accuracy on original data
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred_class = knn.predict(X_test)
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred_class)
0.644444444444
In [26]:
# KNN accuracy on scaled data
knn.fit(X_train_scaled, y_train)
y_pred_class = knn.predict(X_test_scaled)
print metrics.accuracy_score(y_test, y_pred_class)
0.866666666667

Pipeline (bonus content)

What is the problem we're trying to solve?

In [27]:
# define X and y
feature_cols = ['color', 'proline']
X = wine[feature_cols]
y = wine.label
In [28]:
# proper cross-validation on the original (unscaled) data
knn = KNeighborsClassifier(n_neighbors=3)
from sklearn.cross_validation import cross_val_score
cross_val_score(knn, X, y, cv=5, scoring='accuracy').mean()
Out[28]:
0.71983168041991563
In [29]:
# why is this improper cross-validation on the scaled data?
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
cross_val_score(knn, X_scaled, y, cv=5, scoring='accuracy').mean()
Out[29]:
0.90104247104247115

How does Pipeline solve the problem?

Pipeline is used for chaining steps together:

In [30]:
# fix the cross-validation process using Pipeline
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=3))
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
Out[30]:
0.89516011810129448

Pipeline can also be used with GridSearchCV for parameter searching:

In [31]:
# search for an optimal n_neighbors value using GridSearchCV
neighbors_range = range(1, 21)
param_grid = dict(kneighborsclassifier__n_neighbors=neighbors_range)
from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')
grid.fit(X, y)
print grid.best_score_
print grid.best_params_
0.910112359551
{'kneighborsclassifier__n_neighbors': 1}