Pandas and the Data Ecosystem

with Some Things I Learned from PyData

Today I'm assuming you're familiar with Python and Pandas. I want to talk about some of the high level things in the SciPy ecosystem.

In [1]:
from IPython.display import YouTubeVideo
In [2]:
YouTubeVideo("RrPZza_vZ3w")
Out[2]:
In [3]:
YouTubeVideo("p8hle-ni-DM")
Out[3]:

Anaconda - you should be using it!

In [4]:
from IPython.core.display import HTML
HTML('<iframe src=http://continuum.io/downloads width=700 height=350></iframe>')
Out[4]:

Some conda commands

conda update conda

conda create -n demo python pip

source activate demo

conda info

conda info -e

Conda is also used for preparing packages. It is a full replacement for virtualenv and fills in the gaps where pip fails, like with the scipy stack.

NumPy

Numpy is the basis for the scipy stack. Let's take a look at some of the features.

In [5]:
from IPython.core.display import Image 
Image(filename='stack.png') 
Out[5]:
In [6]:
import numpy as np
The basic datastructure in numpy is the ndarray, an N dimensional array.
In [7]:
x = np.array([[1, 2, 3], 
              [4, 5, 6],
              [7, 8, 9]])

I = np.eye(3)
z = np.zeros((3,3))
In [8]:
print "x = \n", x, '\n\nI = \n', I, '\n\n z = \n', z
x = 
[[1 2 3]
 [4 5 6]
 [7 8 9]] 

I = 
[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]] 

 z = 
[[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]
In a lot of ways, the ndarray can be treated like a list, but there are some interesting differences. You can make an ndarray from a list, but an ndarray can only have one type of data.
In [9]:
integer_list = [1, 2, 3]
integer_array = np.array(integer_list)
print integer_array
[1 2 3]
In [10]:
my_list = [1, 1.0, None]
print my_list
my_list[-1] = "one"
print my_list
[1, 1.0, None]
[1, 1.0, 'one']
In [11]:
my_array = np.array([1, 2, 3])
print my_array
my_array[-1] = "three"
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-a1dd1105d9ae> in <module>()
      1 my_array = np.array([1, 2, 3])
      2 print my_array
----> 3 my_array[-1] = "three"

ValueError: invalid literal for long() with base 10: 'three'
[1 2 3]
The other interesting difference is that the ndarray supports "fancy indexing"!
In [12]:
print "x = \n", x
y = x[x > 3]
print "y = \n", y
x = 
[[1 2 3]
 [4 5 6]
 [7 8 9]]
y = 
[4 5 6 7 8 9]
In [13]:
x[1,:]
Out[13]:
array([4, 5, 6])
Many of the functions that operate on ndarrays are written in C or Fortran, which makes them very fast. One common type of function is the ufunc - Universal Function. This class of functions operate elementwise on the array.
In [14]:
np.add(x,x)
Out[14]:
array([[ 2,  4,  6],
       [ 8, 10, 12],
       [14, 16, 18]])
In [15]:
np.add(x,1)
Out[15]:
array([[ 2,  3,  4],
       [ 5,  6,  7],
       [ 8,  9, 10]])
In [16]:
np.log(x)
Out[16]:
array([[ 0.        ,  0.69314718,  1.09861229],
       [ 1.38629436,  1.60943791,  1.79175947],
       [ 1.94591015,  2.07944154,  2.19722458]])
In [17]:
np.greater(x, 3)
Out[17]:
array([[False, False, False],
       [ True,  True,  True],
       [ True,  True,  True]], dtype=bool)

Numpy also provides a Matrix data structure. It's mostly there to make Matlab users feel more at home.

In [18]:
x = np.matrix([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
x
Out[18]:
matrix([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])
In [19]:
x * x
Out[19]:
matrix([[ 30,  36,  42],
        [ 66,  81,  96],
        [102, 126, 150]])
In [20]:
y = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
y * y
Out[20]:
array([[ 1,  4,  9],
       [16, 25, 36],
       [49, 64, 81]])

SciPy

SciPy is the core library for scientific computing. It has modules for:

Integration (scipy.integrate)

Optimization (scipy.optimize)

Fourier Transforms (scipy.fftpack)

Linear Algebra (scipy.linalg)

Spatial data structures and algorithms (scipy.spatial)

Statistics (scipy.stats)

And more!

Linear Algebra

Scipy and Numpy both have the same set of linear algebra functions. They are mostly identical in name, function and performance. Numpy does contain some additional functions not exposed by SciPy.

In [21]:
import numpy as np
from scipy import linalg as LA

#Solve Ax = b
A = np.random.rand(5,5)
b = np.random.rand(5)

x = LA.solve(A,b)
print A
print x
print b
[[ 0.92922464  0.49781215  0.01937578  0.62212312  0.35897636]
 [ 0.96401938  0.80590361  0.91703878  0.79958259  0.36575606]
 [ 0.18171917  0.97812798  0.61268685  0.07005332  0.24989442]
 [ 0.75688208  0.31959706  0.49975434  0.84763035  0.78576249]
 [ 0.54993672  0.0328287   0.48166429  0.73207162  0.57560584]]
[-20.4381976   10.82454188  -9.06526981  29.23856419 -10.32956975]
[ 0.70318052  0.30805896  0.78656003  0.1266843   0.20818588]
In [22]:
# Find eigenvalues of A
e = LA.eig(A)
print e
(array([ 2.80077761+0.j, -0.40934814+0.j, -0.01012884+0.j,  0.85934335+0.j,
        0.53040732+0.j]), array([[ 0.39087844,  0.32950893,  0.51420274, -0.52548141, -0.77733774],
       [ 0.60909269, -0.57222447, -0.25281505,  0.32953648, -0.00487729],
       [ 0.35910487,  0.60007381,  0.21969001,  0.7588856 ,  0.43043274],
       [ 0.48035397, -0.00928752, -0.73993232, -0.16311976,  0.27844853],
       [ 0.34135596, -0.4514512 ,  0.27552959, -0.11295835,  0.36457691]]))
In [23]:
# QR Decomposition
Q, R = LA.qr(A)
print Q
print R
[[ 0.56539143 -0.08272076 -0.78802579  0.02830417 -0.22738521]
 [ 0.58656246  0.21663769  0.33182691 -0.69157816  0.14360758]
 [ 0.11056795  0.90417212  0.05219594  0.3626531  -0.18975197]
 [ 0.46052872 -0.15712373  0.18134556  0.56206705  0.64375449]
 [ 0.33461177 -0.32250627  0.48300685  0.2710754  -0.69083079]]
[[ 1.64350677  1.02048963  1.00792072  1.46381176  0.99960193]
 [-0.          0.957002    0.51717374 -0.18418304 -0.03360911]
 [-0.         -0.          0.64428468  0.28604006  0.27204465]
 [-0.          0.          0.          0.16491158  0.4455204 ]
 [-0.          0.         -0.         -0.          0.03167336]]
In [24]:
# LU Decomposition
p, l, u = LA.lu(A)
print p  # Permutation Matrix
print l  # Lower triangular matrix 
print u  # Upper triangluar matrix
[[ 0.  0.  1.  0.  0.]
 [ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.]]
[[ 1.          0.          0.          0.          0.        ]
 [ 0.18850158  1.          0.          0.          0.        ]
 [ 0.9639066  -0.33768939  1.          0.          0.        ]
 [ 0.7851316  -0.37901001  0.07477825  1.          0.        ]
 [ 0.57046231 -0.51670511 -0.25946494  0.93185554  1.        ]]
[[ 0.96401938  0.80590361  0.91703878  0.79958259  0.36575606]
 [ 0.          0.82621388  0.4398236  -0.08066926  0.18094882]
 [ 0.          0.         -0.71604019 -0.17584096  0.06752617]
 [ 0.          0.          0.          0.20242742  0.56212777]
 [ 0.          0.          0.          0.         -0.04584822]]
In [48]:
# Cholesky Factorization
B = np.tril(A)  # Make sure B is positive definite!
L = LA.cholesky(B)
print B
print L
[[ 0.92922464  0.          0.          0.          0.        ]
 [ 0.96401938  0.80590361  0.          0.          0.        ]
 [ 0.18171917  0.97812798  0.61268685  0.          0.        ]
 [ 0.75688208  0.31959706  0.49975434  0.84763035  0.        ]
 [ 0.54993672  0.0328287   0.48166429  0.73207162  0.57560584]]
[[ 0.96396299  0.          0.          0.          0.        ]
 [ 0.          0.89772135  0.          0.          0.        ]
 [ 0.          0.          0.78274316  0.          0.        ]
 [ 0.          0.          0.          0.92066842  0.        ]
 [ 0.          0.          0.          0.          0.75868692]]
In [25]:
U, s, Vh = LA.svd(A)
print U
print s
print Vh
[[-0.39813995  0.20643095 -0.84627344 -0.17386545  0.22904785]
 [-0.60336264 -0.25342941  0.01908682  0.73012951 -0.19563655]
 [-0.29637964 -0.79807961  0.10000792 -0.46554494  0.22021438]
 [-0.50019004  0.32392179  0.24745742 -0.46897601 -0.6030845 ]
 [-0.37337265  0.38897739  0.46067364  0.00333472  0.7050235 ]]
[ 2.89926195  1.020993    0.58722569  0.35114502  0.00867173]
[[-0.54820462 -0.39543335 -0.4043862  -0.49950798 -0.36064904]
 [ 0.25618887 -0.75005975 -0.36057043  0.4203798   0.25504396]
 [-0.52648798 -0.3642084   0.69468631  0.0728504   0.31979116]
 [ 0.29781095 -0.29410851  0.42201493  0.13653362 -0.79250918]
 [-0.5176466   0.24878758 -0.21395113  0.74150117 -0.27303419]]
In [30]:
#Calculate the condition number
min(LA.svd(A, compute_uv=0))*min(LA.svd(LA.pinv(A), compute_uv=0))
Out[30]:
0.0029910123570078698
In [32]:
#Or an easier way
np.linalg.cond(A, -2)  # -2 finds the smallest singular value
Out[32]:
0.0029910123570078616

Statistics

Unlike the linear algebra functions, there is not a lot of overlap with scipy and numpy with statistical functions.

Numpy has mostly summary and order statistics that are applied to an ndarray.

SciPy's stats package is much richer.

In [53]:
from scipy import stats
import matplotlib.pyplot as plt

numargs = stats.lognorm.numargs
[ s ] = [0.9,] * numargs
rv = stats.lognorm(s)

#Display frozen pdf
x = np.linspace(0, np.minimum(rv.dist.b, 3))
h = plt.plot(x, rv.pdf(x))
/Users/johndowns/Applications/anaconda/lib/python2.7/site-packages/scipy/stats/distributions.py:4658: RuntimeWarning: divide by zero encountered in log
  return -log(x)**2 / (2*s**2) + np.where(x == 0, 0, -log(s*x*sqrt(2*pi)))
In [60]:
stats.describe?

stats.describe(A)
Out[60]:
(5,
 (array([ 0.18171917,  0.0328287 ,  0.01937578,  0.07005332,  0.24989442]),
  array([ 0.96401938,  0.97812798,  0.91703878,  0.84763035,  0.78576249])),
 array([ 0.6763564 ,  0.5268539 ,  0.50610401,  0.6142922 ,  0.46719903]),
 array([ 0.10345616,  0.14234419,  0.10444232,  0.09972712,  0.04561471]),
 array([-0.69335701, -0.08259186, -0.37031295, -1.24696622,  0.6033557 ]),
 array([-0.92424084, -1.31969207, -0.57151457, -0.1071196 , -1.06108566]))
In [61]:
z = stats.zscore(A)  # Calculates the z score of each value in the sample, relative to the sample mean and standard deviation.
print z
[[ 0.87896406 -0.0860613  -1.68384956  0.02772435 -0.56652728]
 [ 0.99990975  0.82692613  1.42164003  0.6559961  -0.53103669]
 [-1.71934735  1.33728977  0.36872625 -1.92680571 -1.13755266]
 [ 0.27990536 -0.61417765 -0.02196684  0.82610283  1.66762543]
 [-0.43943183 -1.46397695 -0.08454988  0.41698244  0.5674912 ]]
In [73]:
x = np.random.randn(10)
y = np.random.randn(10)
coef, p = stats.pearsonr(x, y)
print coef
print p
0.585566174753
0.0753042348576

Scikits

Scikits are extensions to SciPy that are not appropriate for the core library for various reasons. Two of the most interesting modules are scikit-learn and scikit-image.

In [26]:
from IPython.display import Image
Image(url="http://scikit-learn.org/stable/_static/ml_map.png")
Out[26]:

Let's take a look at how scikit-learn supports supervised and unsupervised learning.

Scikit-learn estimators have a uniform interface

  • Available in all Estimators
    • model.fit() : fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).
  • Available in supervised estimators
    • model.predict() : given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model.predict(X_new)), and returns the learned label for each object in the array.
    • model.predict_proba() : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().
    • model.score() : for classification or regression problems, most (all?) estimators implement a score method. Scores are between 0 and 1, with a larger score indicating a better fit.
  • Available in unsupervised estimators
    • model.transform() : given an unsupervised model, transform new data into the new basis. This also accepts one argument X_new, and returns the new representation of the data based on the unsupervised model.
    • model.fit_transform() : some estimators implement this method, which more efficiently performs a fit and a transform on the same input data.
In [220]:
# The famous Iris dataset
from sklearn import neighbors, datasets

iris = datasets.load_iris()
X, y = iris.data, iris.target

print X[:10], "..."
print y
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]] ...
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
In [234]:
# Visualize the data
import numpy as np
import matplotlib.pyplot as plt

x_index = 2
y_index = 3

# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])

plt.scatter(iris.data[:, x_index], iris.data[:, y_index],
            c=iris.target)
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index]);
In [226]:
# Make a prediction
# create the model
knn = neighbors.KNeighborsClassifier(n_neighbors=3)

# fit the model
knn.fit(X, y)

# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?
# call the "predict" method:
result = knn.predict([[3, 5, 4, 2],])

print iris.target_names[result]
['versicolor']
In [221]:
#Exercise: try this with a SVC classifier
from sklearn.svm import SVC
In [235]:
# Dimensionality Reduction - Principal Component Analysis
X, y = iris.data, iris.target
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
X_reduced = pca.transform(X)
print "Reduced dataset shape:", X_reduced.shape

import pylab as pl
pl.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y)

print "Meaning of the 2 components:"
for component in pca.components_:
    print " + ".join("%.3f x %s" % (value, name)
                     for value, name in zip(component,
                                            iris.feature_names))
Reduced dataset shape: (150, 2)
Meaning of the 2 components:
0.362 x sepal length (cm) + -0.082 x sepal width (cm) + 0.857 x petal length (cm) + 0.359 x petal width (cm)
-0.657 x sepal length (cm) + -0.730 x sepal width (cm) + 0.176 x petal length (cm) + 0.075 x petal width (cm)
In [236]:
# Clustering
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(X)
y_pred = k_means.predict(X)

pl.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_pred);
In [237]:
# Validation
# Generate an un-balanced 2D dataset
np.random.seed(0)
X = np.vstack([np.random.normal(0, 1, (950, 2)),
               np.random.normal(-1.8, 0.8, (50, 2))])
y = np.hstack([np.zeros(950), np.ones(50)])

plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='none',
            cmap=plt.cm.Accent)
Out[237]:
<matplotlib.collections.PathCollection at 0x11556b210>
In [239]:
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = SVC().fit(X_train, y_train)
y_pred = clf.predict(X_test)

print "accuracy:", metrics.accuracy_score(y_test, y_pred)
print "precision:", metrics.precision_score(y_test, y_pred)
print "recall:", metrics.recall_score(y_test, y_pred)
print "f1 score:", metrics.f1_score(y_test, y_pred)
accuracy: 0.968
precision: 0.833333333333
recall: 0.625
f1 score: 0.714285714286

What do these mean?

These are ways of taking into account not just the classification results, but the results relative to the true category.

  • $$ {\rm accuracy} \equiv \frac{\rm correct~labels}{\rm total~samples} $$

  • $$ {\rm precision} \equiv \frac{\rm true~positives}{\rm true~positives + false~positives} $$

  • $$ {\rm recall} \equiv \frac{\rm true~positives}{\rm true~positives + false~negatives} $$

  • $$ F_1 \equiv \frac{\rm precision \cdot recall}{\rm precision + recall} $$

The accuracy, precision, recall, and f1-score all range from 0 to 1, with 1 being optimal. Here we've used the following definitions:

  • True Positives are those which are labeled 1 which are actually 1
  • False Positives are those which are labeled 1 which are actually 0
  • True Negatives are those which are labeled 0 which are actually 0
  • False Negatives are those which are labeled 0 which are actually 1
In [240]:
X1, X2, y1, y2 = train_test_split(X, y, test_size=0.5)
print X1.shape
print X2.shape

y2_pred = SVC().fit(X1, y1).predict(X2)
y1_pred = SVC().fit(X2, y2).predict(X1)

print np.mean([metrics.precision_score(y1, y1_pred),
               metrics.precision_score(y2, y2_pred)])

from sklearn.cross_validation import cross_val_score

# Let's do a 2-fold cross-validation of the SVC estimator
print cross_val_score(SVC(), X, y, cv=2, scoring='precision')
(500, 2)
(500, 2)
0.794117647059
[ 0.90909091  0.76190476]

Matplotlib

Matplotlib is the plotting library for the SciPy ecosystem.

More interesting: there is a new library, prettyplotlib, that wraps matplotlib with some sensible defaults to make the plots a more readable.

https://github.com/olgabot/prettyplotlib

A boxplot example

In [66]:
Image("https://raw.github.com/olgabot/prettyplotlib/master/examples/boxplot_matplotlib_default.png")
Out[66]:
In [67]:
Image("https://raw.github.com/olgabot/prettyplotlib/master/examples/boxplot_prettyplotlib_default.png")
Out[67]:

Pandas

In [49]:
from IPython.display import VimeoVideo
VimeoVideo("79562736")
Out[49]:

Let's try something!

I want to try out scikit learn. Let's get a dataset and experiment with some of the algorithms.

I'm getting the movielens data set, which is a collection of movie ratings by a number of users. We're going to try to classify each user and predict whether they will like a particular movie. We will use a few classifiers and see which ones perform better.

Some of this tutorial comes from http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/

In [83]:
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip ml-100k.zip
!cd ml-100k
--2013-11-16 22:20:45--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org... 128.101.34.146
Connecting to files.grouplens.org|128.101.34.146|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4945826 (4.7M) [application/zip]
Saving to: `ml-100k.zip'

100%[======================================>] 4,945,826   1.13M/s   in 4.5s    

2013-11-16 22:20:50 (1.05 MB/s) - `ml-100k.zip' saved [4945826/4945826]

Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating: ml-100k/u3.test         
  inflating: ml-100k/u4.base         
  inflating: ml-100k/u4.test         
  inflating: ml-100k/u5.base         
  inflating: ml-100k/u5.test         
  inflating: ml-100k/ua.base         
  inflating: ml-100k/ua.test         
  inflating: ml-100k/ub.base         
  inflating: ml-100k/ub.test         
In [88]:
txt = open('ml-100k/README').read()
print txt
SUMMARY & USAGE LICENSE
=============================================

MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.
 
This data set consists of:
	* 100,000 ratings (1-5) from 943 users on 1682 movies. 
	* Each user has rated at least 20 movies. 
        * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 19th, 
1997 through April 22nd, 1998. This data has been cleaned up - users
who had less than 20 ratings or did not have complete demographic
information were removed from this data set. Detailed descriptions of
the data file can be found at the end of this file.

Neither the University of Minnesota nor any of the researchers
involved can guarantee the correctness of the data, its suitability
for any particular purpose, or the validity of results based on the
use of the data set.  The data set may be used for any research
purposes under the following conditions:

     * The user may not state or imply any endorsement from the
       University of Minnesota or the GroupLens Research Group.

     * The user must acknowledge the use of the data set in
       publications resulting from the use of the data set, and must
       send us an electronic or paper copy of those publications.

     * The user may not redistribute the data without separate
       permission.

     * The user may not use this information for any commercial or
       revenue-bearing purposes without first obtaining permission
       from a faculty member of the GroupLens Research Project at the
       University of Minnesota.

If you have any further questions or comments, please contact Jon Herlocker
<[email protected]>. 

ACKNOWLEDGEMENTS
==============================================

Thanks to Al Borchers for cleaning up this data and writing the
accompanying scripts.

PUBLISHED WORK THAT HAS USED THIS DATASET
==============================================

Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic
Framework for Performing Collaborative Filtering. Proceedings of the
1999 Conference on Research and Development in Information
Retrieval. Aug. 1999.

FURTHER INFORMATION ABOUT THE GROUPLENS RESEARCH PROJECT
==============================================

The GroupLens Research Project is a research group in the Department
of Computer Science and Engineering at the University of Minnesota.
Members of the GroupLens Research Project are involved in many
research projects related to the fields of information filtering,
collaborative filtering, and recommender systems. The project is lead
by professors John Riedl and Joseph Konstan. The project began to
explore automated collaborative filtering in 1992, but is most well
known for its world wide trial of an automated collaborative filtering
system for Usenet news in 1996.  The technology developed in the
Usenet trial formed the base for the formation of Net Perceptions,
Inc., which was founded by members of GroupLens Research. Since then
the project has expanded its scope to research overall information
filtering solutions, integrating in content-based methods as well as
improving current collaborative filtering technology.

Further information on the GroupLens Research project, including
research publications, can be found at the following web site:
        
        http://www.grouplens.org/

GroupLens Research currently operates a movie recommender based on
collaborative filtering:

        http://www.movielens.org/

DETAILED DESCRIPTIONS OF DATA FILES
==============================================

Here are brief descriptions of the data.

ml-data.tar.gz   -- Compressed tar file.  To rebuild the u data files do this:
                gunzip ml-data.tar.gz
                tar xvf ml-data.tar
                mku.sh

u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a tab separated list of 
	         user id | item id | rating | timestamp. 
              The time stamps are unix seconds since 1/1/1970 UTC   

u.info     -- The number of users, items, and ratings in the u data set.

u.item     -- Information about the items (movies); this is a tab separated
              list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.
              The movie ids are the ones used in the u.data data set.

u.genre    -- A list of the genres.

u.user     -- Demographic information about the users; this is a tab
              separated list of
              user id | age | gender | occupation | zip code
              The user ids are the ones used in the u.data data set.

u.occupation -- A list of the occupations.

u1.base    -- The data sets u1.base and u1.test through u5.base and u5.test
u1.test       are 80%/20% splits of the u data into training and test data.
u2.base       Each of u1, ..., u5 have disjoint test sets; this if for
u2.test       5 fold cross validation (where you repeat your experiment
u3.base       with each training and test set and average the results).
u3.test       These data sets can be generated from u.data by mku.sh.
u4.base
u4.test
u5.base
u5.test

ua.base    -- The data sets ua.base, ua.test, ub.base, and ub.test
ua.test       split the u data into a training set and a test set with
ub.base       exactly 10 ratings per user in the test set.  The sets
ub.test       ua.test and ub.test are disjoint.  These data sets can
              be generated from u.data by mku.sh.

allbut.pl  -- The script that generates training and test sets where
              all but n of a users ratings are in the training data.

mku.sh     -- A shell script to generate all the u data sets from u.data.

In [158]:
import pandas as pd

user_columns = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', names=user_columns, sep='|')

rating_columns = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', names=rating_columns, delim_whitespace=True)

movie_columns = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('ml-100k/u.item', names=movie_columns, sep='|', usecols=range(5))
In [159]:
users
/Users/johndowns/Applications/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)
Out[159]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 943 entries, 0 to 942
Data columns (total 5 columns):
user_id       943  non-null values
age           943  non-null values
sex           943  non-null values
occupation    943  non-null values
zip_code      943  non-null values
dtypes: int64(2), object(3)
In [160]:
ratings
/Users/johndowns/Applications/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)
Out[160]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 4 columns):
user_id           100000  non-null values
movie_id          100000  non-null values
rating            100000  non-null values
unix_timestamp    100000  non-null values
dtypes: int64(4)
In [161]:
movies
/Users/johndowns/Applications/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)
Out[161]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1682 entries, 0 to 1681
Data columns (total 5 columns):
movie_id              1682  non-null values
title                 1682  non-null values
release_date          1681  non-null values
video_release_date    0  non-null values
imdb_url              1679  non-null values
dtypes: float64(1), int64(1), object(3)
In [162]:
movie_ratings = pd.merge(movies, ratings)
movie_ratings
/Users/johndowns/Applications/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)
Out[162]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 8 columns):
movie_id              100000  non-null values
title                 100000  non-null values
release_date          99991  non-null values
video_release_date    0  non-null values
imdb_url              99987  non-null values
user_id               100000  non-null values
rating                100000  non-null values
unix_timestamp        100000  non-null values
dtypes: float64(1), int64(4), object(3)
In [163]:
data = pd.merge(movie_ratings, users)
data
/Users/johndowns/Applications/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)
Out[163]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 12 columns):
movie_id              100000  non-null values
title                 100000  non-null values
release_date          99991  non-null values
video_release_date    0  non-null values
imdb_url              99987  non-null values
user_id               100000  non-null values
rating                100000  non-null values
unix_timestamp        100000  non-null values
age                   100000  non-null values
sex                   100000  non-null values
occupation            100000  non-null values
zip_code              100000  non-null values
dtypes: float64(1), int64(5), object(6)
In [164]:
data.head()
/Users/johndowns/Applications/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)
/Users/johndowns/Applications/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)
/Users/johndowns/Applications/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)
Out[164]:
movie_id title release_date video_release_date imdb_url user_id rating unix_timestamp age sex occupation zip_code
0 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 308 4 887736532 60 M retired 95076
1 4 Get Shorty (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Get%20Shorty%... 308 5 887737890 60 M retired 95076
2 5 Copycat (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Copycat%20(1995) 308 4 887739608 60 M retired 95076
3 7 Twelve Monkeys (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Twelve%20Monk... 308 4 887738847 60 M retired 95076
4 8 Babe (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Babe%20(1995) 308 5 887736696 60 M retired 95076

Some questions

In [165]:
# How are ages distributed?
data.age.hist(bins=30)
plt.title("Distribution of users' ages")
plt.ylabel('count of users')
plt.xlabel('age');
In [166]:
# How are occupations distributed?
In [168]:
# Which movie is the highest rated?
most_rated = data.title.value_counts()[0:10]
print most_rated
Star Wars (1977)                 583
Contact (1997)                   509
Fargo (1996)                     508
Return of the Jedi (1983)        507
Liar Liar (1997)                 485
English Patient, The (1996)      481
Scream (1996)                    478
Toy Story (1995)                 452
Air Force One (1997)             431
Independence Day (ID4) (1996)    429
dtype: int64
In [ ]:
# What movie is the most popular
In [ ]:
# How many movies have less than 100 ratings
In [ ]:
# Remove movies with less than 100 ratings from the dataset
In [216]:
# Extra Credit: How would we go about making a recommender system based on this data set?
#Hint:
VimeoVideo('64445499')
Out[216]:

Generators

Watch this! It was my favorite talk from PyData.

In [74]:
VimeoVideo("79535180")
Out[74]:
Some material taken from: https://github.com/jakevdp/2013_fall_ASTR599/ http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/