Recommenders have been around since at least 1992. Today we see different flavours of recommenders, deployed across different verticals:
What exactly do they do?
In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients. -- Resnick and Varian, 1997
Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read. -- Goldberg et al, 1992
In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user. Intuitively, this estimation is usually based on the ratings given by this user to other items and on some other information [...] Once we can estimate ratings for the yet unrated items, we can recommend to the user the item(s) with the highest estimated rating(s). -- Adomavicius and Tuzhilin, 2005
Driven by computer algorithms, recommenders help consumers by selecting products they will probably like and might buy based on their browsing, searches, purchases, and preferences. -- Konstan and Riedl, 2012
The recommendation problem in its most basic form is quite simple to define:
|-------------------+-----+-----+-----+-----+-----|
| user_id, movie_id | m_1 | m_2 | m_3 | m_4 | m_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1 | ? | ? | 4 | ? | 1 |
|-------------------+-----+-----+-----+-----+-----|
| u_2 | 3 | ? | ? | 2 | 2 |
|-------------------+-----+-----+-----+-----+-----|
| u_3 | 3 | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_4 | ? | 1 | 2 | 1 | 1 |
|-------------------+-----+-----+-----+-----+-----|
| u_5 | ? | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_6 | 2 | ? | 2 | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_7 | ? | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_8 | 3 | 1 | 5 | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_9 | ? | ? | ? | ? | 2 |
|-------------------+-----+-----+-----+-----+-----|
Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.
Generic expression (notice how this is kind of a 'row-based' approach):
$$ \newcommand{\aggr}{\mathop{\rm aggr}\nolimits} r_{u,i} = \aggr_{i' \in I(u)} [r_{u,i'}] $$Generic expression (notice how this is kind of a 'col-based' approach):
$$ \newcommand{\aggr}{\mathop{\rm aggr}\nolimits} r_{u,i} = \aggr_{u' \in U(i)} [r_{u',i}] $$Also based solely on ratings information.
$$ r_{u,i} = \bar r_i = \frac{\sum_{u' \in U(i)} r_{u',i}}{|U(i)|} $$The literature has lots of examples of systems that try to combine the strengths of the two main approaches. This can be done in a number of ways:
Content-based techniques are limited by the amount of metadata that is available to describe an item. There are domains in which feature extraction methods are expensive or time consuming, e.g., processing multimedia data such as graphics, audio/video streams. In the context of grocery items for example, it's often the case that item information is only partial or completely missing. Examples include:
A user has to have rated a sufficient number of items before a recommender system can have a good idea of what their preferences are. In a content-based system, the aggregation function needs ratings to aggregate.
Collaborative filters rely on an item being rated by many users to compute aggregates of those ratings. Think of this as the exact counterpart of the new user problem for content-based systems.
When looking at the more general versions of content-based and collaborative systems, the success of the recommender system depends on the availability of a critical mass of user/item iteractions. We get a first glance at the data sparsity problem by quantifying the ratio of existing ratings vs $|U|x|I|$. A highly sparse matrix of interactions makes it difficult to compute similarities between users and items. As an example, for a user whose tastes are unusual compared to the rest of the population, there will not be any other users who are particularly similar, leading to poor recommendations.
MovieLens from GroupLens Research: grouplens.org
The MovieLens 1M data set contains 1 million ratings collected from 6000 users on 4000 movies.
from IPython.core.display import Image
Image(filename='./recsys_arch.png')
Loading of the MovieLens dataset here is based on the intro chapter of 'Python for Data Analysis".
The MovieLens data is spread across three files. Using the pd.read_table
method we load each file:
import pandas as pd
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('../data/ml-1m/users.dat',
sep='::', header=None, names=unames)
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('../data/ml-1m/ratings.dat',
sep='::', header=None, names=rnames)
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('../data/ml-1m/movies.dat',
sep='::', header=None, names=mnames)
# show how one of them looks
ratings.head(5)
user_id | movie_id | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 |
1 | 1 | 661 | 3 | 978302109 |
2 | 1 | 914 | 3 | 978301968 |
3 | 1 | 3408 | 4 | 978300275 |
4 | 1 | 2355 | 5 | 978824291 |
# show how one of them looks
users[:5]
user_id | gender | age | occupation | zip | |
---|---|---|---|---|---|
0 | 1 | F | 1 | 10 | 48067 |
1 | 2 | M | 56 | 16 | 70072 |
2 | 3 | M | 25 | 15 | 55117 |
3 | 4 | M | 45 | 7 | 02460 |
4 | 5 | M | 25 | 20 | 55455 |
movies[:5]
movie_id | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children's|Comedy |
1 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
Using pd.merge
we get it all into one big DataFrame.
movielens = pd.merge(pd.merge(ratings, users), movies)
movielens
<class 'pandas.core.frame.DataFrame'> Int64Index: 1000209 entries, 0 to 1000208 Data columns (total 10 columns): user_id 1000209 non-null values movie_id 1000209 non-null values rating 1000209 non-null values timestamp 1000209 non-null values gender 1000209 non-null values age 1000209 non-null values occupation 1000209 non-null values zip 1000209 non-null values title 1000209 non-null values genres 1000209 non-null values dtypes: int64(6), object(4)
movielens.ix[0]
user_id 1 movie_id 1193 rating 5 timestamp 978300760 gender F age 1 occupation 10 zip 48067 title One Flew Over the Cuckoo's Nest (1975) genres Drama Name: 0, dtype: object
The idea of groupby is that of split-apply-combine:
To get mean movie ratings for each film grouped by gender, we can use the pivot_table method:
mean_ratings = movielens.pivot_table('rating', rows='title', cols='gender', aggfunc='mean')
mean_ratings[:5]
gender | F | M |
---|---|---|
title | ||
$1,000,000 Duck (1971) | NaN | 2.0 |
'Til There Was You (1997) | 1.0 | NaN |
'burbs, The (1989) | 3.0 | 3.5 |
10 Things I Hate About You (1999) | 3.5 | 3.0 |
101 Dalmatians (1961) | 4.0 | 2.8 |
Now let's filter down to movies that received at least 250 ratings (a completely arbitrary number);
To do this, I group the data by title and use size() to get a Series of group sizes for each title:
ratings_by_title = movielens.groupby('title').size()
ratings_by_title[:10]
title $1,000,000 Duck (1971) 37 'Night Mother (1986) 70 'Til There Was You (1997) 52 'burbs, The (1989) 303 ...And Justice for All (1979) 199 1-900 (1994) 2 10 Things I Hate About You (1999) 700 101 Dalmatians (1961) 565 101 Dalmatians (1996) 364 12 Angry Men (1957) 616 dtype: int64
active_titles = ratings_by_title.index[ratings_by_title >= 250]
active_titles[:10]
Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)', u'101 Dalmatians (1961)', u'101 Dalmatians (1996)', u'12 Angry Men (1957)', u'13th Warrior, The (1999)', u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)', u'2001: A Space Odyssey (1968)', u'2010 (1984)'], dtype=object)
The index of titles receiving at least 250 ratings can then be used to select rows from mean_ratings above:
mean_ratings = mean_ratings.ix[active_titles]
mean_ratings
<class 'pandas.core.frame.DataFrame'> Index: 1216 entries, 'burbs, The (1989) to eXistenZ (1999) Data columns (total 2 columns): F 1216 non-null values M 1216 non-null values dtypes: float64(2)
To see the top films among female viewers, we can sort by the F column in descending order:
top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)
top_female_ratings[:10]
gender | F | M |
---|---|---|
title | ||
Close Shave, A (1995) | 4.644444 | 4.473795 |
Wrong Trousers, The (1993) | 4.588235 | 4.478261 |
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) | 4.572650 | 4.464589 |
Wallace & Gromit: The Best of Aardman Animation (1996) | 4.563107 | 4.385075 |
Schindler's List (1993) | 4.562602 | 4.491415 |
Shawshank Redemption, The (1994) | 4.539075 | 4.560625 |
Grand Day Out, A (1992) | 4.537879 | 4.293255 |
To Kill a Mockingbird (1962) | 4.536667 | 4.372611 |
Creature Comforts (1990) | 4.513889 | 4.272277 |
Usual Suspects, The (1995) | 4.513317 | 4.518248 |
Suppose you wanted to find the movies that are most divisive between male and female viewers.
One way is to add a column to mean_ratings containing the difference in means, then sort by that:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
sorted_by_diff = mean_ratings.sort_index(by='diff')
sorted_by_diff[:15]
gender | F | M | diff |
---|---|---|---|
title | |||
Dirty Dancing (1987) | 3.790378 | 2.959596 | -0.830782 |
Jumpin' Jack Flash (1986) | 3.254717 | 2.578358 | -0.676359 |
Grease (1978) | 3.975265 | 3.367041 | -0.608224 |
Little Women (1994) | 3.870588 | 3.321739 | -0.548849 |
Steel Magnolias (1989) | 3.901734 | 3.365957 | -0.535777 |
Anastasia (1997) | 3.800000 | 3.281609 | -0.518391 |
Rocky Horror Picture Show, The (1975) | 3.673016 | 3.160131 | -0.512885 |
Color Purple, The (1985) | 4.158192 | 3.659341 | -0.498851 |
Age of Innocence, The (1993) | 3.827068 | 3.339506 | -0.487561 |
Free Willy (1993) | 2.921348 | 2.438776 | -0.482573 |
French Kiss (1995) | 3.535714 | 3.056962 | -0.478752 |
Little Shop of Horrors, The (1960) | 3.650000 | 3.179688 | -0.470312 |
Guys and Dolls (1955) | 4.051724 | 3.583333 | -0.468391 |
Mary Poppins (1964) | 4.197740 | 3.730594 | -0.467147 |
Patch Adams (1998) | 3.473282 | 3.008746 | -0.464536 |
Reversing the order of the rows and again slicing off the top 15 rows, we get the movies preferred by men that women didn’t rate as highly:
# Reverser order of rows, take first 15 rows
sorted_by_diff[::-1][:15]
gender | F | M | diff |
---|---|---|---|
title | |||
Good, The Bad and The Ugly, The (1966) | 3.494949 | 4.221300 | 0.726351 |
Kentucky Fried Movie, The (1977) | 2.878788 | 3.555147 | 0.676359 |
Dumb & Dumber (1994) | 2.697987 | 3.336595 | 0.638608 |
Longest Day, The (1962) | 3.411765 | 4.031447 | 0.619682 |
Cable Guy, The (1996) | 2.250000 | 2.863787 | 0.613787 |
Evil Dead II (Dead By Dawn) (1987) | 3.297297 | 3.909283 | 0.611985 |
Hidden, The (1987) | 3.137931 | 3.745098 | 0.607167 |
Rocky III (1982) | 2.361702 | 2.943503 | 0.581801 |
Caddyshack (1980) | 3.396135 | 3.969737 | 0.573602 |
For a Few Dollars More (1965) | 3.409091 | 3.953795 | 0.544704 |
Porky's (1981) | 2.296875 | 2.836364 | 0.539489 |
Animal House (1978) | 3.628906 | 4.167192 | 0.538286 |
Exorcist, The (1973) | 3.537634 | 4.067239 | 0.529605 |
Fright Night (1985) | 2.973684 | 3.500000 | 0.526316 |
Barb Wire (1996) | 1.585366 | 2.100386 | 0.515020 |
Suppose instead you wanted the movies that elicited the most disagreement among viewers, independent of gender. Disagreement can be measured by the variance or standard deviation of the ratings:
# Standard deviation of rating grouped by title
rating_std_by_title = movielens.groupby('title')['rating'].std()
# Filter down to active_titles
rating_std_by_title = rating_std_by_title.ix[active_titles]
# Order Series by value in descending order
rating_std_by_title.order(ascending=False)[:10]
title Dumb & Dumber (1994) 1.321333 Blair Witch Project, The (1999) 1.316368 Natural Born Killers (1994) 1.307198 Tank Girl (1995) 1.277695 Rocky Horror Picture Show, The (1975) 1.260177 Eyes Wide Shut (1999) 1.259624 Evita (1996) 1.253631 Billy Madison (1995) 1.249970 Fear and Loathing in Las Vegas (1998) 1.246408 Bicentennial Man (1999) 1.245533 Name: rating, dtype: float64
Before we attempt to express the basic equations for content-based or collaborative filtering we need a basic mechanism to evaluate the performance of our engine.
This subsection will generate training and testing sets for evaluation. You do not need to understand every single line of code, just the general gist:
# let's work with a smaller subset for speed reasons
import numpy as np
movielens = movielens.ix[np.random.choice(movielens.index, size=10000, replace=False)]
print movielens.shape
print movielens.user_id.nunique()
print movielens.movie_id.nunique()
(10000, 10) 3670 2274
user_ids_larger_1 = pd.value_counts(movielens.user_id, sort=False) > 1
user_ids_larger_1
4098 True 8 False 2057 False 24 True 2073 True 2081 False 4130 True 2089 False 48 True 2097 False 2105 True 4162 False 72 False 2121 True 4170 False ... 4022 True 4030 True 1983 True 1999 False 4054 True 4062 True 2015 True 4070 False 2023 True 4078 True 2031 True 4086 True 2039 True 4094 True 2047 True Length: 3670, dtype: bool
movielens = movielens[user_ids_larger_1[movielens.user_id].values]
print movielens.shape
np.all(movielens.user_id.value_counts() > 1)
(8491, 10)
True
We now generate train and test subsets using groupby and apply.
import numpy as np
def assign_to_set(df):
sampled_ids = np.random.choice(df.index,
size=np.int64(np.ceil(df.index.size * 0.2)),
replace=False)
df.ix[sampled_ids, 'for_testing'] = True
return df
movielens['for_testing'] = False
grouped = movielens.groupby('user_id', group_keys=False).apply(assign_to_set)
movielens_train = movielens[grouped.for_testing == False]
movielens_test = movielens[grouped.for_testing == True]
print movielens_train.shape
print movielens_test.shape
print movielens_train.index & movielens_test.index
(5843, 11) (2648, 11) Int64Index([], dtype=int64)
Store these two sets in text files:
movielens_train.to_csv('../data/movielens_train.csv')
movielens_test.to_csv('../data/movielens_test.csv')
Performance evaluation of recommendation systems is an entire topic all in itself. Some of the options include:
def compute_rmse(y_pred, y_true):
""" Compute Root Mean Squared Error. """
return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))
def evaluate(estimate_f):
""" RMSE-based predictive performance evaluation with pandas. """
ids_to_estimate = zip(movielens_test.user_id, movielens_test.movie_id)
estimated = np.array([estimate_f(u,i) for (u,i) in ids_to_estimate])
real = movielens_test.rating.values
return compute_rmse(estimated, real)
With this table-like representation of the ratings data, a basic content-based filter becomes a one-liner function.
def estimate1(user_id, item_id):
""" Simple content-filtering based on mean ratings. """
return movielens_train.ix[movielens_train.user_id == user_id, 'rating'].mean()
print 'RMSE for estimate1: %s' % evaluate(estimate1)
RMSE for estimate1: 1.28728396133
def estimate2(user_id, movie_id):
""" Simple collaborative filter based on mean ratings. """
ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
if ratings_by_others.empty: return 3.0
return ratings_by_others.rating.mean()
print 'RMSE for estimate2: %s' % evaluate(estimate2)
RMSE for estimate2: 1.14189918845
Let's start with a simple pivoting example that does not involve any aggregation. We can extract a ratings matrix as follows:
# transform the ratings frame into a ratings matrix
ratings_mtx_df = movielens_train.pivot_table(values='rating',
rows='user_id',
cols='movie_id')
ratings_mtx_df
<class 'pandas.core.frame.DataFrame'> Int64Index: 2161 entries, 2 to 6040 Columns: 1972 entries, 1 to 3952 dtypes: float64(1972)
# with an integer axis index only label-based indexing is possible
ratings_mtx_df.ix[ratings_mtx_df.index[-15:],ratings_mtx_df.columns[:15]]
movie_id | 1 | 2 | 3 | 4 | 6 | 7 | 10 | 11 | 12 | 14 | 15 | 16 | 17 | 18 | 19 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | |||||||||||||||
6003 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6007 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6010 | NaN | NaN | NaN | NaN | NaN | NaN | 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6011 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6015 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6016 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6018 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6021 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6025 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6028 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6033 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6035 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6036 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6037 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6040 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
import numpy as np; import pandas as pd; from pandas import Series, DataFrame
rating = pd.read_csv('../data/movie_rating.csv')
rp = rating.pivot_table(cols=['critic'],rows=['title'],values='rating')
rp
critic | Claudia Puig | Gene Seymour | Jack Matthews | Lisa Rose | Mick LaSalle | Toby |
---|---|---|---|---|---|---|
title | ||||||
Just My Luck | 3.0 | 1.5 | NaN | 3.0 | 2 | NaN |
Lady in the Water | NaN | 3.0 | 3.0 | 2.5 | 3 | NaN |
Snakes on a Plane | 3.5 | 3.5 | 4.0 | 3.5 | 4 | 4.5 |
Superman Returns | 4.0 | 5.0 | 5.0 | 3.5 | 3 | 4.0 |
The Night Listener | 4.5 | 3.0 | 3.0 | 3.0 | 3 | NaN |
You Me and Dupree | 2.5 | 3.5 | 3.5 | 2.5 | 2 | 1.0 |
Pandas has nicely filled in NaN in the cells for movies not reviewed by a critic.
The next step is to find the similarity score between the critics. The author Toby is used as an example. We introduced a somewhat involving formula the Pearson correlation score. Turn out this is simply the correlation coefficient supported in most statistical packages. In Pandas, you can use corrwith() to calculate the correlation. A score close to 1 means their tastes are very similar. As you can see in the result below, Lisa Rose's taste is very similar to Toby but it is not so much with Gene Seymour.
rating_toby = rp['Toby']
sim_toby = rp.corrwith(rating_toby)
sim_toby
critic Claudia Puig 0.893405 Gene Seymour 0.381246 Jack Matthews 0.662849 Lisa Rose 0.991241 Mick LaSalle 0.924473 Toby 1.000000 dtype: float64
To make recommendation for Toby, we calculate a rating of others weighted by the similarity. Note that we only need to calculate rating for movies Toby has not yet seen. The first line below filter out irrelevant data. It then assign the similarity score and the weighted rating.
criteria = ((rating_toby[rating.title].isnull()) & (rating.critic != 'Toby')).values
rating_c = rating[criteria]
rating_c['similarity'] = rating_c['critic'].map(sim_toby.get)
rating_c['sim_rating'] = rating_c.similarity * rating_c.rating
rating_c
critic | title | rating | similarity | sim_rating | |
---|---|---|---|---|---|
0 | Jack Matthews | Lady in the Water | 3.0 | 0.662849 | 1.988547 |
4 | Jack Matthews | The Night Listener | 3.0 | 0.662849 | 1.988547 |
5 | Mick LaSalle | Lady in the Water | 3.0 | 0.924473 | 2.773420 |
7 | Mick LaSalle | Just My Luck | 2.0 | 0.924473 | 1.848947 |
10 | Mick LaSalle | The Night Listener | 3.0 | 0.924473 | 2.773420 |
12 | Claudia Puig | Just My Luck | 3.0 | 0.893405 | 2.680215 |
15 | Claudia Puig | The Night Listener | 4.5 | 0.893405 | 4.020323 |
16 | Lisa Rose | Lady in the Water | 2.5 | 0.991241 | 2.478102 |
18 | Lisa Rose | Just My Luck | 3.0 | 0.991241 | 2.973722 |
20 | Lisa Rose | The Night Listener | 3.0 | 0.991241 | 2.973722 |
25 | Gene Seymour | Lady in the Water | 3.0 | 0.381246 | 1.143739 |
27 | Gene Seymour | Just My Luck | 1.5 | 0.381246 | 0.571870 |
30 | Gene Seymour | The Night Listener | 3.0 | 0.381246 | 1.143739 |
Lastly we add up the score for each title using groupby(). We also normalize the score by dividing it with the sum of the weights. Base on other critics' similarity and their rating, we have made a movie recommendation for Toby. The number matches the result of the book.
recommendation = rating_c.groupby('title').apply(lambda s: s.sim_rating.sum() / s.similarity.sum())
recommendation.order(ascending=False)
title The Night Listener 3.347790 Lady in the Water 2.832550 Just My Luck 2.530981 dtype: float64
def pearson(s1, s2):
"""Take two pd.Series objects and return a pearson correlation."""
s1_c = s1 - s1.mean()
s2_c = s2 - s2.mean()
return np.sum(s1_c * s2_c) / np.sqrt(np.sum(s1_c ** 2) * np.sum(s2_c ** 2))
class CollabFiltering:
""" Collaborative filtering using a custom sim(u,u'). """
def learn(self):
""" Prepare datastructures for estimation. """
self.all_user_profiles = movielens.pivot_table('rating', rows='movie_id', cols='user_id')
def estimate(self, user_id, movie_id):
""" Ratings weighted by correlation similarity. """
ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
if ratings_by_others.empty: return 3.0
ratings_by_others.set_index('user_id', inplace=True)
their_ids = ratings_by_others.index
their_ratings = ratings_by_others.rating
their_profiles = self.all_user_profiles[their_ids]
user_profile = self.all_user_profiles[user_id]
sims = their_profiles.apply(lambda profile: pearson(profile, user_profile), axis=0)
ratings_sims = pd.DataFrame({'sim': sims, 'rating': their_ratings})
ratings_sims = ratings_sims[ ratings_sims.sim > 0]
if ratings_sims.empty:
return their_ratings.mean()
else:
return np.average(ratings_sims.rating, weights=ratings_sims.sim)
reco = CollabFiltering()
reco.learn()
print 'RMSE for CollabFiltering: %s' % evaluate(reco.estimate)
RMSE for CollabFiltering: 1.10156350645