How to Build a Minimal Recommendation System (1 hr)¶

The recommendation problem¶

Recommenders have been around since at least 1992. Today we see different flavours of recommenders, deployed across different verticals:

Amazon
Netflix
Facebook
Last.fm.

What exactly do they do?

Definitions from the literature¶

In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients. -- Resnick and Varian, 1997

Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read. -- Goldberg et al, 1992

In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user. Intuitively, this estimation is usually based on the ratings given by this user to other items and on some other information [...] Once we can estimate ratings for the yet unrated items, we can recommend to the user the item(s) with the highest estimated rating(s). -- Adomavicius and Tuzhilin, 2005

Driven by computer algorithms, recommenders help consumers by selecting products they will probably like and might buy based on their browsing, searches, purchases, and preferences. -- Konstan and Riedl, 2012

Notation¶

$U$ is the set of users in our domain. Its size is $|U|$.
$I$ is the set of items in our domain. Its size is $|I|$.
$I(u)$ is the set of items that user $u$ has rated.
$-I(u)$ is the complement of $I(u)$ i.e., the set of items not yet seen by user $u$.
$U(i)$ is the set of users that have rated item $i$.
$-U(i)$ is the complement of $U(i)$.

Goal of a recommendation system¶

$$ \newcommand{\argmax}{\mathop{\rm argmax}\nolimits} \forall{u \in U},\; i^* = \argmax_{i \in -I(u)} [S(u,i)] $$

Problem statement¶

The recommendation problem in its most basic form is quite simple to define:

|-------------------+-----+-----+-----+-----+-----|
| user_id, movie_id | m_1 | m_2 | m_3 | m_4 | m_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1               | ?   | ?   | 4   | ?   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_2               | 3   | ?   | ?   | 2   | 2   |
|-------------------+-----+-----+-----+-----+-----|
| u_3               | 3   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_4               | ?   | 1   | 2   | 1   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_5               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_6               | 2   | ?   | 2   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_7               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_8               | 3   | 1   | 5   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_9               | ?   | ?   | ?   | ?   | 2   |
|-------------------+-----+-----+-----+-----+-----|

Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.

Content-based filtering¶

Generic expression (notice how this is kind of a 'row-based' approach):

$$ \newcommand{\aggr}{\mathop{\rm aggr}\nolimits} r_{u,i} = \aggr_{i' \in I(u)} [r_{u,i'}] $$

Collaborative filtering¶

Generic expression (notice how this is kind of a 'col-based' approach):

$$ \newcommand{\aggr}{\mathop{\rm aggr}\nolimits} r_{u,i} = \aggr_{u' \in U(i)} [r_{u',i}] $$

Collaborative filtering: simple ratings-based recommendations¶

Also based solely on ratings information.

$$ r_{u,i} = \bar r_i = \frac{\sum_{u' \in U(i)} r_{u',i}}{|U(i)|} $$

Hybrid solutions¶

The literature has lots of examples of systems that try to combine the strengths of the two main approaches. This can be done in a number of ways:

Combine the predictions of a content-based system and a collaborative system.
Incorporate content-based techniques into a collaborative approach.
Incorporarte collaborative techniques into a content-based approach.
Unifying model.

Challenges¶

Availability of item metadata¶

Content-based techniques are limited by the amount of metadata that is available to describe an item. There are domains in which feature extraction methods are expensive or time consuming, e.g., processing multimedia data such as graphics, audio/video streams. In the context of grocery items for example, it's often the case that item information is only partial or completely missing. Examples include:

Ingredients
Nutrition facts
Brand
Description
County of origin

New user problem¶

A user has to have rated a sufficient number of items before a recommender system can have a good idea of what their preferences are. In a content-based system, the aggregation function needs ratings to aggregate.

New item problem¶

Collaborative filters rely on an item being rated by many users to compute aggregates of those ratings. Think of this as the exact counterpart of the new user problem for content-based systems.

Data sparsity¶

When looking at the more general versions of content-based and collaborative systems, the success of the recommender system depends on the availability of a critical mass of user/item iteractions. We get a first glance at the data sparsity problem by quantifying the ratio of existing ratings vs $|U|x|I|$. A highly sparse matrix of interactions makes it difficult to compute similarities between users and items. As an example, for a user whose tastes are unusual compared to the rest of the population, there will not be any other users who are particularly similar, leading to poor recommendations.

Dataset¶

MovieLens from GroupLens Research: grouplens.org

The MovieLens 1M data set contains 1 million ratings collected from 6000 users on 4000 movies.

Flow chart: the big picture¶

In [2]:

from IPython.core.display import Image 
Image(filename='./recsys_arch.png')

Out[2]:

The MovieLens dataset: loading and first look¶

Loading of the MovieLens dataset here is based on the intro chapter of 'Python for Data Analysis".

The MovieLens data is spread across three files. Using the pd.read_table method we load each file:

In [1]:

import pandas as pd

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('../data/ml-1m/users.dat',
                      sep='::', header=None, names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('../data/ml-1m/ratings.dat',
                        sep='::', header=None, names=rnames)

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('../data/ml-1m/movies.dat',
                       sep='::', header=None, names=mnames)

# show how one of them looks
ratings.head(5)

Out[1]:

	user_id	movie_id	rating	timestamp
0	1	1193	5	978300760
1	1	661	3	978302109
2	1	914	3	978301968
3	1	3408	4	978300275
4	1	2355	5	978824291

In [10]:

# show how one of them looks
users[:5]

Out[10]:

	user_id	gender	age	occupation	zip
0	1	F	1	10	48067
1	2	M	56	16	70072
2	3	M	25	15	55117
3	4	M	45	7	02460
4	5	M	25	20	55455

In [20]:

movies[:5]

Out[20]:

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy

Using pd.merge we get it all into one big DataFrame.

In [2]:

movielens = pd.merge(pd.merge(ratings, users), movies)
movielens

Out[2]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns (total 10 columns):
user_id       1000209  non-null values
movie_id      1000209  non-null values
rating        1000209  non-null values
timestamp     1000209  non-null values
gender        1000209  non-null values
age           1000209  non-null values
occupation    1000209  non-null values
zip           1000209  non-null values
title         1000209  non-null values
genres        1000209  non-null values
dtypes: int64(6), object(4)

In [12]:

movielens.ix[0]

Out[12]:

user_id                                            1
movie_id                                        1193
rating                                             5
timestamp                                  978300760
gender                                             F
age                                                1
occupation                                        10
zip                                            48067
title         One Flew Over the Cuckoo's Nest (1975)
genres                                         Drama
Name: 0, dtype: object

Collaborative filtering: generalizations of the aggregation function¶

Non-personalized recommendations¶

Groupby¶

The idea of groupby is that of split-apply-combine:

split data in an object according to a given key;
apply a function to each subset;
combine results into a new object.

To get mean movie ratings for each film grouped by gender, we can use the pivot_table method:

In [57]:

mean_ratings = movielens.pivot_table('rating', rows='title', cols='gender', aggfunc='mean')
mean_ratings[:5]

Out[57]:

gender	F	M
title
$1,000,000 Duck (1971)	NaN	2.0
'Til There Was You (1997)	1.0	NaN
'burbs, The (1989)	3.0	3.5
10 Things I Hate About You (1999)	3.5	3.0
101 Dalmatians (1961)	4.0	2.8

Now let's filter down to movies that received at least 250 ratings (a completely arbitrary number);

To do this, I group the data by title and use size() to get a Series of group sizes for each title:

In [24]:

ratings_by_title = movielens.groupby('title').size()

ratings_by_title[:10]

Out[24]:

title
$1,000,000 Duck (1971)                37
'Night Mother (1986)                  70
'Til There Was You (1997)             52
'burbs, The (1989)                   303
...And Justice for All (1979)        199
1-900 (1994)                           2
10 Things I Hate About You (1999)    700
101 Dalmatians (1961)                565
101 Dalmatians (1996)                364
12 Angry Men (1957)                  616
dtype: int64

In [27]:

active_titles = ratings_by_title.index[ratings_by_title >= 250]
active_titles[:10]

Out[27]:

Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)', u'101 Dalmatians (1961)', u'101 Dalmatians (1996)', u'12 Angry Men (1957)', u'13th Warrior, The (1999)', u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)', u'2001: A Space Odyssey (1968)', u'2010 (1984)'], dtype=object)

The index of titles receiving at least 250 ratings can then be used to select rows from mean_ratings above:

In [30]:

mean_ratings = mean_ratings.ix[active_titles]
mean_ratings

Out[30]:

<class 'pandas.core.frame.DataFrame'>
Index: 1216 entries, 'burbs, The (1989) to eXistenZ (1999)
Data columns (total 2 columns):
F    1216  non-null values
M    1216  non-null values
dtypes: float64(2)

To see the top films among female viewers, we can sort by the F column in descending order:

In [34]:

top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)
top_female_ratings[:10]

Out[34]:

gender	F	M
title
Close Shave, A (1995)	4.644444	4.473795
Wrong Trousers, The (1993)	4.588235	4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)	4.572650	4.464589
Wallace & Gromit: The Best of Aardman Animation (1996)	4.563107	4.385075
Schindler's List (1993)	4.562602	4.491415
Shawshank Redemption, The (1994)	4.539075	4.560625
Grand Day Out, A (1992)	4.537879	4.293255
To Kill a Mockingbird (1962)	4.536667	4.372611
Creature Comforts (1990)	4.513889	4.272277
Usual Suspects, The (1995)	4.513317	4.518248

Measuring rating disagreement¶

Suppose you wanted to find the movies that are most divisive between male and female viewers.

One way is to add a column to mean_ratings containing the difference in means, then sort by that:

In [37]:

mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
sorted_by_diff = mean_ratings.sort_index(by='diff')
sorted_by_diff[:15]

Out[37]:

gender	F	M	diff
title
Dirty Dancing (1987)	3.790378	2.959596	-0.830782
Jumpin' Jack Flash (1986)	3.254717	2.578358	-0.676359
Grease (1978)	3.975265	3.367041	-0.608224
Little Women (1994)	3.870588	3.321739	-0.548849
Steel Magnolias (1989)	3.901734	3.365957	-0.535777
Anastasia (1997)	3.800000	3.281609	-0.518391
Rocky Horror Picture Show, The (1975)	3.673016	3.160131	-0.512885
Color Purple, The (1985)	4.158192	3.659341	-0.498851
Age of Innocence, The (1993)	3.827068	3.339506	-0.487561
Free Willy (1993)	2.921348	2.438776	-0.482573
French Kiss (1995)	3.535714	3.056962	-0.478752
Little Shop of Horrors, The (1960)	3.650000	3.179688	-0.470312
Guys and Dolls (1955)	4.051724	3.583333	-0.468391
Mary Poppins (1964)	4.197740	3.730594	-0.467147
Patch Adams (1998)	3.473282	3.008746	-0.464536

Reversing the order of the rows and again slicing off the top 15 rows, we get the movies preferred by men that women didn’t rate as highly:

In [39]:

# Reverser order of rows, take first 15 rows 
sorted_by_diff[::-1][:15]

Out[39]:

gender	F	M	diff
title
Good, The Bad and The Ugly, The (1966)	3.494949	4.221300	0.726351
Kentucky Fried Movie, The (1977)	2.878788	3.555147	0.676359
Dumb & Dumber (1994)	2.697987	3.336595	0.638608
Longest Day, The (1962)	3.411765	4.031447	0.619682
Cable Guy, The (1996)	2.250000	2.863787	0.613787
Evil Dead II (Dead By Dawn) (1987)	3.297297	3.909283	0.611985
Hidden, The (1987)	3.137931	3.745098	0.607167
Rocky III (1982)	2.361702	2.943503	0.581801
Caddyshack (1980)	3.396135	3.969737	0.573602
For a Few Dollars More (1965)	3.409091	3.953795	0.544704
Porky's (1981)	2.296875	2.836364	0.539489
Animal House (1978)	3.628906	4.167192	0.538286
Exorcist, The (1973)	3.537634	4.067239	0.529605
Fright Night (1985)	2.973684	3.500000	0.526316
Barb Wire (1996)	1.585366	2.100386	0.515020

Suppose instead you wanted the movies that elicited the most disagreement among viewers, independent of gender. Disagreement can be measured by the variance or standard deviation of the ratings:

In [41]:

# Standard deviation of rating grouped by title
rating_std_by_title = movielens.groupby('title')['rating'].std()
# Filter down to active_titles
rating_std_by_title = rating_std_by_title.ix[active_titles]
# Order Series by value in descending order 
rating_std_by_title.order(ascending=False)[:10]

Out[41]:

title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64

Evaluation¶

Before we attempt to express the basic equations for content-based or collaborative filtering we need a basic mechanism to evaluate the performance of our engine.

Evaluation: split ratings into train and test sets¶

This subsection will generate training and testing sets for evaluation. You do not need to understand every single line of code, just the general gist:

take a smaller sample from the full 1M dataset for speed reasons;
make sure that we have at least 2 ratings per user in that subset;
split the result into training and testing sets.

In [4]:

# let's work with a smaller subset for speed reasons
import numpy as np
movielens = movielens.ix[np.random.choice(movielens.index, size=10000, replace=False)]
print movielens.shape
print movielens.user_id.nunique()
print movielens.movie_id.nunique()

(10000, 10)
3670
2274

In [6]:

user_ids_larger_1 = pd.value_counts(movielens.user_id, sort=False) > 1
user_ids_larger_1

Out[6]:

4098     True
8       False
2057    False
24       True
2073     True
2081    False
4130     True
2089    False
48       True
2097    False
2105     True
4162    False
72      False
2121     True
4170    False
...
4022     True
4030     True
1983     True
1999    False
4054     True
4062     True
2015     True
4070    False
2023     True
4078     True
2031     True
4086     True
2039     True
4094     True
2047     True
Length: 3670, dtype: bool

In [8]:

movielens = movielens[user_ids_larger_1[movielens.user_id].values]
print movielens.shape
np.all(movielens.user_id.value_counts() > 1)

                  

(8491, 10)

Out[8]:

True

We now generate train and test subsets using groupby and apply.

In [9]:

import numpy as np
def assign_to_set(df):
    sampled_ids = np.random.choice(df.index,
                                   size=np.int64(np.ceil(df.index.size * 0.2)),
                                   replace=False)
    df.ix[sampled_ids, 'for_testing'] = True
    return df

movielens['for_testing'] = False
grouped = movielens.groupby('user_id', group_keys=False).apply(assign_to_set)
movielens_train = movielens[grouped.for_testing == False]
movielens_test = movielens[grouped.for_testing == True]
print movielens_train.shape
print movielens_test.shape
print movielens_train.index & movielens_test.index

(5843, 11)
(2648, 11)
Int64Index([], dtype=int64)

Store these two sets in text files:

In [10]:

movielens_train.to_csv('../data/movielens_train.csv')
movielens_test.to_csv('../data/movielens_test.csv')

Evaluation: performance criterion¶

Performance evaluation of recommendation systems is an entire topic all in itself. Some of the options include:

RMSE: $\sqrt{\frac{\sum(\hat y - y)^2}{n}}$
Precision / Recall / F-scores
ROC curves
Cost curves

In [11]:

def compute_rmse(y_pred, y_true):
    """ Compute Root Mean Squared Error. """
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))

Evaluation: the 'evaluate' method¶

In [12]:

def evaluate(estimate_f):
    """ RMSE-based predictive performance evaluation with pandas. """
    ids_to_estimate = zip(movielens_test.user_id, movielens_test.movie_id)
    estimated = np.array([estimate_f(u,i) for (u,i) in ids_to_estimate])
    real = movielens_test.rating.values
    return compute_rmse(estimated, real)

Minimal reco engine v1.0: simple mean ratings¶

Content-based filtering using mean ratings¶

With this table-like representation of the ratings data, a basic content-based filter becomes a one-liner function.

In [13]:

def estimate1(user_id, item_id):
    """ Simple content-filtering based on mean ratings. """
    return movielens_train.ix[movielens_train.user_id == user_id, 'rating'].mean()

print 'RMSE for estimate1: %s' % evaluate(estimate1)

RMSE for estimate1: 1.28728396133

In [14]:

def estimate2(user_id, movie_id):
    """ Simple collaborative filter based on mean ratings. """
    ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
    if ratings_by_others.empty: return 3.0
    return ratings_by_others.rating.mean()

print 'RMSE for estimate2: %s' % evaluate(estimate2)

RMSE for estimate2: 1.14189918845

Pivoting¶

Let's start with a simple pivoting example that does not involve any aggregation. We can extract a ratings matrix as follows:

In [15]:

# transform the ratings frame into a ratings matrix
ratings_mtx_df = movielens_train.pivot_table(values='rating',
                                             rows='user_id',
                                             cols='movie_id')
ratings_mtx_df

Out[15]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2161 entries, 2 to 6040
Columns: 1972 entries, 1 to 3952
dtypes: float64(1972)

In [16]:

# with an integer axis index only label-based indexing is possible
ratings_mtx_df.ix[ratings_mtx_df.index[-15:],ratings_mtx_df.columns[:15]]

Out[16]:

movie_id	1	2	3	4	6	7	10	11	12	14	15	16	17	18	19
user_id
6003	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6007	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6010	NaN	NaN	NaN	NaN	NaN	NaN	4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6011	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6015	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6016	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6018	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6021	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6025	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6028	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6033	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6035	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6036	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6037	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6040	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Collaborative-based filtering using custom sim functions¶

Pearson correlation

$$ sim(x,y) = \frac{(x - \bar x).(y - \bar y)}{\sqrt{(x - \bar x).(x - \bar x) * (y - \bar y)(y - \bar y)}} $$

Let's go by a simple example first... Another movies ratings.¶

In [19]:

import numpy as np; import pandas as pd; from pandas import Series, DataFrame
rating = pd.read_csv('../data/movie_rating.csv')
rp = rating.pivot_table(cols=['critic'],rows=['title'],values='rating')
rp

Out[19]:

critic	Claudia Puig	Gene Seymour	Jack Matthews	Lisa Rose	Mick LaSalle	Toby
title
Just My Luck	3.0	1.5	NaN	3.0	2	NaN
Lady in the Water	NaN	3.0	3.0	2.5	3	NaN
Snakes on a Plane	3.5	3.5	4.0	3.5	4	4.5
Superman Returns	4.0	5.0	5.0	3.5	3	4.0
The Night Listener	4.5	3.0	3.0	3.0	3	NaN
You Me and Dupree	2.5	3.5	3.5	2.5	2	1.0

Pandas has nicely filled in NaN in the cells for movies not reviewed by a critic.

The next step is to find the similarity score between the critics. The author Toby is used as an example. We introduced a somewhat involving formula the Pearson correlation score. Turn out this is simply the correlation coefficient supported in most statistical packages. In Pandas, you can use corrwith() to calculate the correlation. A score close to 1 means their tastes are very similar. As you can see in the result below, Lisa Rose's taste is very similar to Toby but it is not so much with Gene Seymour.

In [20]:

rating_toby = rp['Toby']
sim_toby = rp.corrwith(rating_toby)
sim_toby

Out[20]:

critic
Claudia Puig     0.893405
Gene Seymour     0.381246
Jack Matthews    0.662849
Lisa Rose        0.991241
Mick LaSalle     0.924473
Toby             1.000000
dtype: float64

To make recommendation for Toby, we calculate a rating of others weighted by the similarity. Note that we only need to calculate rating for movies Toby has not yet seen. The first line below filter out irrelevant data. It then assign the similarity score and the weighted rating.

In [27]:

criteria = ((rating_toby[rating.title].isnull()) & (rating.critic != 'Toby')).values
rating_c = rating[criteria]
rating_c['similarity'] = rating_c['critic'].map(sim_toby.get)
rating_c['sim_rating'] = rating_c.similarity * rating_c.rating

rating_c

Out[27]:

	critic	title	rating	similarity	sim_rating
0	Jack Matthews	Lady in the Water	3.0	0.662849	1.988547
4	Jack Matthews	The Night Listener	3.0	0.662849	1.988547
5	Mick LaSalle	Lady in the Water	3.0	0.924473	2.773420
7	Mick LaSalle	Just My Luck	2.0	0.924473	1.848947
10	Mick LaSalle	The Night Listener	3.0	0.924473	2.773420
12	Claudia Puig	Just My Luck	3.0	0.893405	2.680215
15	Claudia Puig	The Night Listener	4.5	0.893405	4.020323
16	Lisa Rose	Lady in the Water	2.5	0.991241	2.478102
18	Lisa Rose	Just My Luck	3.0	0.991241	2.973722
20	Lisa Rose	The Night Listener	3.0	0.991241	2.973722
25	Gene Seymour	Lady in the Water	3.0	0.381246	1.143739
27	Gene Seymour	Just My Luck	1.5	0.381246	0.571870
30	Gene Seymour	The Night Listener	3.0	0.381246	1.143739

Lastly we add up the score for each title using groupby(). We also normalize the score by dividing it with the sum of the weights. Base on other critics' similarity and their rating, we have made a movie recommendation for Toby. The number matches the result of the book.

In [28]:

recommendation = rating_c.groupby('title').apply(lambda s: s.sim_rating.sum() / s.similarity.sum())
recommendation.order(ascending=False)

Out[28]:

title
The Night Listener    3.347790
Lady in the Water     2.832550
Just My Luck          2.530981
dtype: float64

Going back to our movielens data.¶

In [29]:

def pearson(s1, s2):
    """Take two pd.Series objects and return a pearson correlation."""
    s1_c = s1 - s1.mean()
    s2_c = s2 - s2.mean()
    return np.sum(s1_c * s2_c) / np.sqrt(np.sum(s1_c ** 2) * np.sum(s2_c ** 2))

In [47]:

class CollabFiltering:
    """ Collaborative filtering using a custom sim(u,u'). """

    def learn(self):
        """ Prepare datastructures for estimation. """
        self.all_user_profiles = movielens.pivot_table('rating', rows='movie_id', cols='user_id')

    def estimate(self, user_id, movie_id):
        """ Ratings weighted by correlation similarity. """
        ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
        if ratings_by_others.empty: return 3.0
        ratings_by_others.set_index('user_id', inplace=True)
        their_ids = ratings_by_others.index
        their_ratings = ratings_by_others.rating
        their_profiles = self.all_user_profiles[their_ids]
        user_profile = self.all_user_profiles[user_id]
        sims = their_profiles.apply(lambda profile: pearson(profile, user_profile), axis=0)
        ratings_sims = pd.DataFrame({'sim': sims, 'rating': their_ratings})
        ratings_sims = ratings_sims[ ratings_sims.sim > 0]
        if ratings_sims.empty:
            return their_ratings.mean()
        else:
            return np.average(ratings_sims.rating, weights=ratings_sims.sim)
        
reco = CollabFiltering()
reco.learn()
print 'RMSE for CollabFiltering: %s' % evaluate(reco.estimate)

RMSE for CollabFiltering: 1.10156350645

What are we going to do next ?¶

Introduction to Scikit-Learn

References and further reading¶

William Wesley McKinney. Python for Data Analysis. O’Reilly, 2012.
Goldberg, D., D. Nichols, B. M. Oki, and D. Terry. “Using Collaborative Filtering to Weave an Information Tapestry.” Communications of the ACM 35, no. 12 (1992): 61–70.
Resnick, Paul, and Hal R. Varian. “Recommender Systems.” Commun. ACM 40, no. 3 (March 1997): 56–58. doi:10.1145/245108.245121.
Adomavicius, Gediminas, and Alexander Tuzhilin. “Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions.” IEEE Transactions on Knowledge and Data Engineering 17, no. 6 (2005): 734–749. doi:http://doi.ieeecomputersociety.org/10.1109/TKDE.2005.99.
Adomavicius, Gediminas, Ramesh Sankaranarayanan, Shahana Sen, and Alexander Tuzhilin. “Incorporating Contextual Information in Recommender Systems Using a Multidimensional Approach.” ACM Trans. Inf. Syst. 23, no. 1 (2005): 103–145. doi:10.1145/1055709.1055714.
Koren, Y., R. Bell, and C. Volinsky. “Matrix Factorization Techniques for Recommender Systems.” Computer 42, no. 8 (2009): 30–37.
Toby Segaran. Programming Collective Intelligence. O’Reilly, 2007.

In [ ]: