View this IPython Notebook:
j.mp/sklearn
Everything is in a Github repo:
github.com/tdhopper/
View slides with:
ipython nbconvert Intro\ to\ Scikit-Learn.ipynb --to slides --post serve
Research Triangle Analysts (1/16/13) <br ><br >
Software Engineer at parse.ly <br > @tdhopper <br > tdhopper@gmail.com <br >
"Machine Learning in Python"
Six reasons why Ben Lorica (@bigdata) recommends scikit-learn
One: Commitment to documentation and usability
One of the reasons I started using scikit-learn was because of its nice documentation (which I hold up as an example for other communities and projects to emulate).
Six reasons why Ben Lorica (@bigdata) recommends scikit-learn
Two: Models are chosen and implemented by a dedicated team of experts
Scikit-learn’s stable of contributors includes experts in machine-learning and software development.
Six reasons why Ben Lorica (@bigdata) recommends scikit-learn
Three: Covers most machine-learning tasks
Scan the list of things available in scikit-learn and you quickly realize that it includes tools for many of the standard machine-learning tasks (such as clustering, classification, regression, etc.).
Six reasons why Ben Lorica (@bigdata) recommends scikit-learn
Four: Python and Pydata
An impressive set of Python data tools (pydata) have emerged over the last few years.
Six reasons why Ben Lorica (@bigdata) recommends scikit-learn
Five: Focus
Scikit-learn is a machine-learning library. Its goal is to provide a set of common algorithms to Python users through a consistent interface.
Six reasons why Ben Lorica (@bigdata) recommends scikit-learn
Six: scikit-learn scales to most data problems
Many problems can be tackled using a single (big memory) server, and well-designed software that runs on a single machine can blow away distributed systems.
...an introduction to Python
...an introduction to machine learning
from sklearn import datasets
from numpy import logical_or
from sklearn.lda import LDA
from sklearn.metrics import confusion_matrix
iris = datasets.load_iris()
subset = logical_or(iris.target == 0, iris.target == 1)
X = iris.data[subset]
y = iris.target[subset]
print X[0:5,:]
[[ 5.1 3.5 1.4 0.2] [ 4.9 3. 1.4 0.2] [ 4.7 3.2 1.3 0.2] [ 4.6 3.1 1.5 0.2] [ 5. 3.6 1.4 0.2]]
print y[0:5]
[0 0 0 0 0]
# Linear Discriminant Analysis
lda = LDA(2)
lda.fit(X, y)
confusion_matrix(y, lda.predict(X))
array([[50, 0], [ 0, 50]])
The main "interfaces" in scikit-learn are (one class can implement multiple interfaces):
Estimator:
estimator = obj.fit(data, targets)
Predictor:
prediction = obj.predict(data)
Transformer:
new_data = obj.transform(data)
Model:
score = obj.score(data)
All estimators implement the fit method:
estimator.fit(X, y)
A estimator is an object that fits a model based on some training data and is capable of inferring some properties on new data.
from sklearn.linear_model import LogisticRegression
# Create Model
model = LogisticRegression()
# Fit Model
model.fit(X, y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)
from sklearn.cluster import KMeans
# Create Model
kmeans = KMeans(n_clusters = 2)
# Fit Model
kmeans.fit(X)
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10, n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001, verbose=0)
from sklearn.decomposition import PCA
# Create Model
pca = PCA(n_components=2)
# Fit Model
pca.fit(X)
PCA(copy=True, n_components=2, whiten=False)
The fit method takes a $y$ parameter even if it isn't needed (though $y$ is ignored). This is important later.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X, y)
PCA(copy=True, n_components=2, whiten=False)
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import matthews_corrcoef
# Create Model
kbest = SelectKBest(k = 3)
# Fit Model
kbest.fit(X, y)
SelectKBest(k=1, score_func=<function f_classif at 0x1139f3398>)
model = LogisticRegression()
model.fit(X, y)
kbest = SelectKBest(k = 1)
kbest.fit(X, y)
kmeans = KMeans(n_clusters = 2)
kmeans.fit(X, y)
pca = PCA(n_components=2)
pca.fit(X, y)
PCA(copy=True, n_components=2, whiten=False)
What can we do with an estimator?
Inference!
model = LogisticRegression()
model.fit(X, y)
print model.coef_
[[-0.40731745 -1.46092371 2.24004724 1.00841492]]
kmeans = KMeans(n_clusters = 2)
kmeans.fit(X)
print kmeans.cluster_centers_
[[ 5.936 2.77 4.26 1.326] [ 5.006 3.418 1.464 0.244]]
pca = PCA(n_components=2)
pca.fit(X, y)
print pca.explained_variance_
[ 2.73946394 0.22599044]
kbest = SelectKBest(k = 1)
kbest.fit(X, y)
print kbest.get_support()
[False False True False]
Is that it?
model = LogisticRegression()
model.fit(X, y)
X_test = [[ 5.006, 3.418, 1.464, 0.244], [ 5.936, 2.77 , 4.26 , 1.326]]
model.predict(X_test)
array([0, 1])
print model.predict_proba(X_test)
[[ 0.97741151 0.02258849] [ 0.01544837 0.98455163]]
pca = PCA(n_components=2)
pca.fit(X)
print pca.transform(X)[0:5,:]
[[-1.65441341 -0.20660719] [-1.63509488 0.2988347 ] [-1.82037547 0.27141696] [-1.66207305 0.43021683] [-1.70358916 -0.21574051]]
fit_transform is also available (and is sometimes faster).
pca = PCA(n_components=2)
print pca.fit_transform(X)[0:5,:]
[[-1.65441341 -0.20660719] [-1.63509488 0.2988347 ] [-1.82037547 0.27141696] [-1.66207305 0.43021683] [-1.70358916 -0.21574051]]
kbest = SelectKBest(k = 1)
kbest.fit(X, y)
print kbest.transform(X)[0:5,:]
[[ 1.4] [ 1.4] [ 1.3] [ 1.5] [ 1.4]]
from sklearn.cross_validation import KFold
from numpy import arange
from random import shuffle
from sklearn.dummy import DummyClassifier
model = DummyClassifier()
model.fit(X, y)
model.score(X, y)
0.48999999999999999
from sklearn.pipeline import Pipeline
pipe = Pipeline([
("select", SelectKBest(k = 3)),
("pca", PCA(n_components = 1)),
("classify", LogisticRegression())
])
pipe.fit(X, y)
pipe.predict(X)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
Intermediate steps of the pipeline must be Estimators and Transformers.
The final estimator needs only to be an Estimator.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
news = fetch_20newsgroups()
data = news.data
category = news.target
len(data)
11314
print " ".join(news.target_names)
alt.atheism comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.crypt sci.electronics sci.med sci.space soc.religion.christian talk.politics.guns talk.politics.mideast talk.politics.misc talk.religion.misc
print data[8]
From: holmes7000@iscsvax.uni.edu Subject: WIn 3.0 ICON HELP PLEASE! Organization: University of Northern Iowa Lines: 10 I have win 3.0 and downloaded several icons and BMP's but I can't figure out how to change the "wallpaper" or use the icons. Any help would be appreciated. Thanx, -Brando PS Please E-mail me
pipe = Pipeline([
('vect', CountVectorizer(max_features = 100)),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
pipe.fit(data, category)
Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, charset=None, charset_error=None, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0, max_features=100, min_df=1, ngram_range=(1, 1), prepr..., penalty='l2', power_t=0.5, random_state=None, shuffle=False, verbose=0, warm_start=False))])
import pandas as pd
import numpy as np
import sklearn.preprocessing, sklearn.decomposition, sklearn.linear_model, sklearn.pipeline, sklearn.metrics
from sklearn_pandas import DataFrameMapper, cross_val_score
data = pd.DataFrame({
'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
'children': [4., 6, 3, 3, 2, 3, 5, 4],
'salary': [90, 24, 44, 27, 32, 59, 36, 27]
})
mapper = DataFrameMapper([
('pet', sklearn.preprocessing.LabelBinarizer()),
('children', sklearn.preprocessing.StandardScaler()),
('salary', None)
])
mapper.fit_transform(data)
array([[ 1. , 0. , 0. , 0.20851441, 90. ], [ 0. , 1. , 0. , 1.87662973, 24. ], [ 0. , 1. , 0. , -0.62554324, 44. ], [ 0. , 0. , 1. , -0.62554324, 27. ], [ 1. , 0. , 0. , -1.4596009 , 32. ], [ 0. , 1. , 0. , -0.62554324, 59. ], [ 1. , 0. , 0. , 1.04257207, 36. ], [ 0. , 0. , 1. , 0.20851441, 27. ]])
mapper = DataFrameMapper([
('pet', sklearn.preprocessing.LabelBinarizer()),
('children', sklearn.preprocessing.StandardScaler()),
('salary', None)
])
pipe = Pipeline([
("mapper", mapper),
("pca", PCA(n_components=2))
])
pipe.fit_transform(data) # 'data' is a data frame, not a numpy array!
array([[ -4.76269151e+01, 4.25991055e-01], [ 1.83856756e+01, 1.86178138e+00], [ -1.62747544e+00, -5.06199939e-01], [ 1.53796381e+01, -8.10331853e-01], [ 1.03575109e+01, -1.52528125e+00], [ -1.66260441e+01, -4.27845667e-01], [ 6.37295205e+00, 9.68066902e-01], [ 1.53846579e+01, 1.38193738e-02]])
Pandas pipelines require sklearn-pandas module by @paulgb.
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
# Create sample dataset
X, y = datasets.make_classification(n_samples = 1000, n_features = 40, n_informative = 6, n_classes = 2)
# Pipeline for Feature Selection to Random Forest
pipe = Pipeline([
("select", SelectKBest()),
("classify", RandomForestClassifier())
])
# Define parameter grid
param_grid = {
"select__k" : [1, 6, 20, 40],
"classify__n_estimators" : [1, 10, 100],
}
gs = GridSearchCV(pipe, param_grid)
# Search over grid
gs.fit(X, y)
gs.best_params_
{'classify__n_estimators': 10, 'select__k': 6}
print gs.best_estimator_.predict(X.mean(axis = 0))
[1]
Search space grows exponentially with number of parameters.
gs.grid_scores_
[mean: 0.72600, std: 0.02773, params: {'classify__n_estimators': 1, 'select__k': 1}, mean: 0.78200, std: 0.00631, params: {'classify__n_estimators': 1, 'select__k': 6}, mean: 0.74400, std: 0.02580, params: {'classify__n_estimators': 1, 'select__k': 20}, mean: 0.70600, std: 0.05772, params: {'classify__n_estimators': 1, 'select__k': 40}, mean: 0.73800, std: 0.02372, params: {'classify__n_estimators': 10, 'select__k': 1}, mean: 0.90000, std: 0.01539, params: {'classify__n_estimators': 10, 'select__k': 6}, mean: 0.86400, std: 0.01047, params: {'classify__n_estimators': 10, 'select__k': 20}, mean: 0.81200, std: 0.02247, params: {'classify__n_estimators': 10, 'select__k': 40}, mean: 0.73600, std: 0.02229, params: {'classify__n_estimators': 100, 'select__k': 1}, mean: 0.89200, std: 0.01520, params: {'classify__n_estimators': 100, 'select__k': 6}, mean: 0.89000, std: 0.01769, params: {'classify__n_estimators': 100, 'select__k': 20}, mean: 0.87000, std: 0.02366, params: {'classify__n_estimators': 100, 'select__k': 40}]
GridSearch on 1 core:
param_grid = {
"select__k" : [1, 5, 10, 15, 20, 25, 30, 35, 40],
"classify__n_estimators" : [1, 5, 10, 25, 50, 75, 100],
}
gs = GridSearchCV(pipe, param_grid, n_jobs = 1)
%timeit gs.fit(X, y)
print
1 loops, best of 3: 6.31 s per loop
GridSearch on 7 cores:
gs = GridSearchCV(pipe, param_grid, n_jobs = 7)
%timeit gs.fit(X, y)
print
1 loops, best of 3: 1.81 s per loop
GridSearchCV might be very slow:
param_grid = {
"select__k" : range(1, 40),
"classify__n_estimators" : range(1, 100),
}
gs = GridSearchCV(pipe, param_grid, n_jobs = 7)
gs.fit(X, y)
print "Best CV score", gs.best_score_
print gs.best_params_
0.924 {'classify__n_estimators': 59, 'select__k': 9}
We can instead randomly sample from the parameter space with RandomizedSearchCV:
gs = RandomizedSearchCV(pipe, param_grid, n_jobs = 7, n_iter = 10)
gs.fit(X, y)
print "Best CV score", gs.best_score_
print gs.best_params_
0.894 {'classify__n_estimators': 58, 'select__k': 7}