%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(
style='white',
context='talk',
palette='Set1'
)
Scikit-learn is a Python package for machine learning:
We will do one of the many tutorials from the scikit-learn website.
You can install scikit-learn with conda install scikit-learn
.
Supervised learning consists in learning the link between two datasets: the observed data X
and an external variable y
that we are trying to predict, usually called “target” or “labels”. Most often, y
is a 1D array of length n_samples
.
All supervised estimators in scikit-learn implement a fit(X, y)
method to fit the model and a predict(X)
method that, given unlabeled observations X
, returns the predicted labels y
.
Fisher's Iris dataset is a classification task consisting in identifying 3 different types of irises (Setosa, Versicolour, and Virginica) from their petal and sepal length and width.
Let's start by loading the dataset.
import sklearn.datasets
iris = sklearn.datasets.load_iris()
print("Features:", iris.feature_names)
print("Types:", iris.target_names)
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target_names[iris.target]
df.head()
The simplest possible classifier is the nearest neighbor: given a new observation X_test
, find in the training set (i.e. the data used to train the estimator) the observation with the closest feature vector.
Training set and testing set. While experimenting with any learning algorithm, it is important not to test the prediction of an estimator on the data used to fit the estimator as this would not be evaluating the performance of the estimator on new data. This is why datasets are often split into train and test data.
Split the dataset to train and test data using a random permutation - this is easily done with functions from the model_selection
module, which has many methods to split datasets. We'll use a very simple one, train_test_split
which just splits that data by sampling a fraction of the rows to the training set and the rest to the test set (without replacement).
from sklearn.model_selection import train_test_split
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
Import the nearest-neighbor classifier, then create and fit it:
import sklearn.neighbors as nb
knn = nb.KNeighborsClassifier()
knn.fit(X_train, y_train)
Predict the labels (Iris species) for the test data and compare with the real labels:
y_hat = knn.predict(X_test)
print(y_hat)
print(y_test)
print('Accuracy:', (y_hat == y_test).mean())
To reduce the dimensionality of the problem (4 features - 4D) we can use Seaborn's PairGrid
plot to look at the joint distributions of each pair of features.
sns.pairplot(df, hue='target', plot_kws={'s': 20});
From this figure it seems like using just the petal (last two columns in our features matrix) will do a produce a good separation between blue and others, and a decent one between green and red.
Let's try it.
X_train = X_train[:, 2:]
X_test = X_test[:, 2:]
Fit and predict:
knn = nb.KNeighborsClassifier()
knn.fit(X_train, y_train)
y_hat = knn.predict(X_test)
print(y_hat)
print(y_test)
print('Accuracy:', (y_hat == y_test).mean())
We didn't gained any accuracy, but that's expected as the test set size is just 50. But we are in 2D we can plot the classifier fit:
from matplotlib.colors import ListedColormap
h = .02 # step size in the mesh
X = iris.data[:,2:]
y = iris.target
# Create color maps
cmap_light = ListedColormap(sorted(sns.color_palette('Pastel1', 3)))
cmap_bold = ListedColormap(sorted(sns.color_palette('Set1', 3)))
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(
np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h)
)
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
fig, ax = plt.subplots()
ax.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, s=20)
ax.set(
xlim=(xx.min(), xx.max()),
ylim=(yy.min(), yy.max()),
xlabel=iris.feature_names[2],
ylabel=iris.feature_names[3]
);
Learning to recognize handwritten digits with a K-nearest neighbors classifier, inspired by IPython Interactive Computing and Visualization Cookbook.
Start by looking at the data. We'll use IPython's widgets to create a slider so we can move between the > 1500 digits images that are in scikit-learn's datasets package.
X, y = sklearn.datasets.load_digits(return_X_y=True)
X.shape, y.shape
from ipywidgets import interact
@interact(idx=(0, X.shape[0] - 1))
def show_digit(idx):
fig, ax = plt.subplots(figsize=(1, 1))
ax.matshow(X[idx].reshape(8, 8), cmap='gray_r')
ax.set(xticks=[], yticks=[])
sns.despine(left=True, bottom=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
knn = nb.KNeighborsClassifier()
knn.fit(X_train, y_train)
knn.score(X_test, y_test)
Can we do any better with a different classifier? Let's try the logistic regression classifer. In this model, we try to estimate the probability $p_k$ that a data sample (image) belongs to class $k$ (digit $k$).
from sklearn import linear_model
logistic = linear_model.LogisticRegression()
logistic.fit(X_train, y_train)
logistic.score(X_test, y_test)
How about a neural network?
from sklearn import neural_network
nn = neural_network.MLPClassifier(hidden_layer_sizes=(1000, 500))
nn.fit(X_train, y_train)
nn.score(X_test, y_test)
As you can see, the models all have the same API, which allows us to use them like this:
from sklearn import ensemble
from sklearn import svm
models = [
nb.KNeighborsClassifier(),
linear_model.LogisticRegression(),
neural_network.MLPClassifier(),
neural_network.MLPClassifier(hidden_layer_sizes=(1000, 500)),
ensemble.RandomForestClassifier(n_estimators=10),
ensemble.RandomForestClassifier(n_estimators=100),
ensemble.RandomForestClassifier(n_estimators=1000),
svm.SVC()
]
for model in models:
model.fit(X_train, y_train)
print(model)
print(model.score(X_test, y_test))
print()
The wine dataset contains 13 features and 3 target labels. Apply one of the classifiers in scikit-learn to this dataset. Fit the classifier and score it.
dataset = sklearn.datasets.load_wine()
print(dataset['DESCR'])
X = dataset.data
y = dataset.target
We'll work with the diabetes dataset:
Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.
diabetes = sklearn.datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
Let's look at the features (X
):
df = pd.DataFrame(data=X, columns=diabetes.feature_names)
sns.pairplot(df, plot_kws=dict(alpha=0.25));