This notebook was put together by [Jake Vanderplas]( for PyCon 2015. Source and license info is on [GitHub](

An Introduction to scikit-learn: Machine Learning in Python

Goals of this Tutorial

  • Introduce the basics of Machine Learning, and some skills useful in practice.
  • Introduce the syntax of scikit-learn, so that you can make use of the rich toolset available.



9:00 - 9:15 Preliminaries: Setup & introduction

  • Making sure your computer is set-up

9:15 - 10:00 Basic Principles of Machine Learning and the Scikit-learn Interface

  • What is Machine Learning?
  • Machine learning data layout
  • Supervised Learning
    • Classification
    • Regression
    • Measuring performance
  • Unsupervised Learning
    • Clustering
    • Dimensionality Reduction
    • Density Estimation
  • Evaluation of Learning Models
  • Choosing the right algorithm for your dataset

10:00 - 10:45 Supervised learning in-depth

  • Support Vector Machines
  • Decision Trees and Random Forests

10:45 - 11:00: break

11:00 - 11:45 Unsupervised learning in-depth

  • Dimensionality Reduction: Principal Component Analysis
  • Clustering: K Means
  • Density Estimation: Gaussian Mixture Models
  • Application: image color compression

11:45 - 12:20 Validation and Model Selection

  • Overfitting, Underfitting, bias, and variance
  • Improving your fit: validation curves and learning curves
  • Application: facial recognition


This tutorial requires the following packages:

The easiest way to get these is to use the conda environment manager. I suggest downloading and installing miniconda.

The following command will install all required packages:

$ conda install numpy scipy matplotlib scikit-learn ipython-notebook

Alternatively, you can download and install the (very large) Anaconda software distribution, found at

Checking your installation

You can run the following code to check the versions of the packages on your system:

(in IPython notebook, press shift and return together to execute the contents of a cell)

In [1]:
from __future__ import print_function

import IPython
print('IPython:', IPython.__version__)

import numpy
print('numpy:', numpy.__version__)

import scipy
print('scipy:', scipy.__version__)

import matplotlib
print('matplotlib:', matplotlib.__version__)

import sklearn
print('scikit-learn:', sklearn.__version__)

import seaborn
print('seaborn', seaborn.__version__)
IPython: 2.4.1
numpy: 1.9.2
scipy: 0.15.1
matplotlib: 1.4.3
scikit-learn: 0.15.2
seaborn 0.5.1

Useful Resources