In this notebook, I'll detail a basic version of model-based collaborative filtering for recommendations by employing it on the MovieLens 1M dataset.
In my previous attempt, I used user-based and item-based collaborative filtering to make movie recommendations from users' ratings data. I can only try them on a very small data sample (20,000 ratings), and ended up getting pretty high Root Mean Squared Error (bad recommendations). Memory-based collaborative filtering approaches that compute distance relationships between items or users have these two major issues:
Thus I'd need to apply Dimensionality Reduction technique to derive the tastes and preferences from the raw data, otherwise known as doing low-rank matrix factorization. Why reduce dimensions?
With that goal in mind, I'll introduce Singular Vector Decomposition (SVD) to you, a powerful dimensionality reduction technique that is used heavily in modern model-based CF recommender system.
Let's load the 3 data files just like last time.
# Import libraries
import numpy as np
import pandas as pd
# Reading ratings file
ratings = pd.read_csv('ratings.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'movie_id', 'rating', 'timestamp'])
# Reading users file
users = pd.read_csv('users.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])
# Reading movies file
movies = pd.read_csv('movies.csv', sep='\t', encoding='latin-1', usecols=['movie_id', 'title', 'genres'])
Let's take a look at the movies and ratings dataframes.
movies.head()
movie_id | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children's|Comedy |
1 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
ratings.head()
user_id | movie_id | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 |
1 | 1 | 661 | 3 | 978302109 |
2 | 1 | 914 | 3 | 978301968 |
3 | 1 | 3408 | 4 | 978300275 |
4 | 1 | 2355 | 5 | 978824291 |
Also let's count the number of unique users and movies.
n_users = ratings.user_id.unique().shape[0]
n_movies = ratings.movie_id.unique().shape[0]
print 'Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_movies)
Number of users = 6040 | Number of movies = 3706
Now I want the format of my ratings matrix to be one row per user and one column per movie. To do so, I'll pivot ratings to get that and call the new variable Ratings (with a capital *R).
Ratings = ratings.pivot(index = 'user_id', columns ='movie_id', values = 'rating').fillna(0)
Ratings.head()
movie_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 3943 | 3944 | 3945 | 3946 | 3947 | 3948 | 3949 | 3950 | 3951 | 3952 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | |||||||||||||||||||||
1 | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 3706 columns
Last but not least, I need to de-normalize the data (normalize by each users mean) and convert it from a dataframe to a numpy array.
R = Ratings.as_matrix()
user_ratings_mean = np.mean(R, axis = 1)
Ratings_demeaned = R - user_ratings_mean.reshape(-1, 1)
With my ratings matrix properly formatted and normalized, I'm ready to do some dimensionality reduction. But first, let's go over the math.
Model-based Collaborative Filtering is based on matrix factorization (MF) which has received greater exposure, mainly as an unsupervised learning method for latent variable decomposition and dimensionality reduction. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF:
For example, let's check the sparsity of the ratings dataset:
sparsity = round(1.0 - len(ratings) / float(n_users * n_movies), 3)
print 'The sparsity level of MovieLens1M dataset is ' + str(sparsity * 100) + '%'
The sparsity level of MovieLens1M dataset is 95.5%
A well-known matrix factorization method is Singular value decomposition (SVD). At a high level, SVD is an algorithm that decomposes a matrix $A$ into the best lower rank (i.e. smaller/simpler) approximation of the original matrix $A$. Mathematically, it decomposes A into a two unitary matrices and a diagonal matrix:
where $A$ is the input data matrix (users's ratings), $U$ is the left singular vectors (user "features" matrix), $\Sigma$ is the diagonal matrix of singular values (essentially weights/strengths of each concept), and $V^{T}$ is the right singluar vectors (movie "features" matrix). $U$ and $V^{T}$ are column orthonomal, and represent different things. $U$ represents how much users "like" each feature and $V^{T}$ represents how relevant each feature is to each movie.
To get the lower rank approximation, I take these matrices and keep only the top $k$ features, which can be thought of as the underlying tastes and preferences vectors.
Scipy and Numpy both have functions to do the singular value decomposition. I'm going to use the Scipy function svds because it let's me choose how many latent factors I want to use to approximate the original ratings matrix (instead of having to truncate it after).
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(Ratings_demeaned, k = 50)
As I'm going to leverage matrix multiplication to get predictions, I'll convert the $\Sigma$ (now are values) to the diagonal matrix form.
sigma = np.diag(sigma)
I now have everything I need to make movie ratings predictions for every user. I can do it all at once by following the math and matrix multiply $U$, $\Sigma$, and $V^{T}$ back to get the rank $k=50$ approximation of $A$.
But first, I need to add the user means back to get the actual star ratings prediction.
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
With the predictions matrix for every user, I can build a function to recommend movies for any user. I return the list of movies the user has already rated, for the sake of comparison.
preds = pd.DataFrame(all_user_predicted_ratings, columns = Ratings.columns)
preds.head()
movie_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 3943 | 3944 | 3945 | 3946 | 3947 | 3948 | 3949 | 3950 | 3951 | 3952 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4.288861 | 0.143055 | -0.195080 | -0.018843 | 0.012232 | -0.176604 | -0.074120 | 0.141358 | -0.059553 | -0.195950 | ... | 0.027807 | 0.001640 | 0.026395 | -0.022024 | -0.085415 | 0.403529 | 0.105579 | 0.031912 | 0.050450 | 0.088910 |
1 | 0.744716 | 0.169659 | 0.335418 | 0.000758 | 0.022475 | 1.353050 | 0.051426 | 0.071258 | 0.161601 | 1.567246 | ... | -0.056502 | -0.013733 | -0.010580 | 0.062576 | -0.016248 | 0.155790 | -0.418737 | -0.101102 | -0.054098 | -0.140188 |
2 | 1.818824 | 0.456136 | 0.090978 | -0.043037 | -0.025694 | -0.158617 | -0.131778 | 0.098977 | 0.030551 | 0.735470 | ... | 0.040481 | -0.005301 | 0.012832 | 0.029349 | 0.020866 | 0.121532 | 0.076205 | 0.012345 | 0.015148 | -0.109956 |
3 | 0.408057 | -0.072960 | 0.039642 | 0.089363 | 0.041950 | 0.237753 | -0.049426 | 0.009467 | 0.045469 | -0.111370 | ... | 0.008571 | -0.005425 | -0.008500 | -0.003417 | -0.083982 | 0.094512 | 0.057557 | -0.026050 | 0.014841 | -0.034224 |
4 | 1.574272 | 0.021239 | -0.051300 | 0.246884 | -0.032406 | 1.552281 | -0.199630 | -0.014920 | -0.060498 | 0.450512 | ... | 0.110151 | 0.046010 | 0.006934 | -0.015940 | -0.050080 | -0.052539 | 0.507189 | 0.033830 | 0.125706 | 0.199244 |
5 rows × 3706 columns
Now I write a function to return the movies with the highest predicted rating that the specified user hasn't already rated. Though I didn't use any explicit movie content features (such as genre or title), I'll merge in that information to get a more complete picture of the recommendations.
def recommend_movies(predictions, userID, movies, original_ratings, num_recommendations):
# Get and sort the user's predictions
user_row_number = userID - 1 # User ID starts at 1, not 0
sorted_user_predictions = preds.iloc[user_row_number].sort_values(ascending=False) # User ID starts at 1
# Get the user's data and merge in the movie information.
user_data = original_ratings[original_ratings.user_id == (userID)]
user_full = (user_data.merge(movies, how = 'left', left_on = 'movie_id', right_on = 'movie_id').
sort_values(['rating'], ascending=False)
)
print 'User {0} has already rated {1} movies.'.format(userID, user_full.shape[0])
print 'Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations)
# Recommend the highest predicted rating movies that the user hasn't seen yet.
recommendations = (movies[~movies['movie_id'].isin(user_full['movie_id'])].
merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
left_on = 'movie_id',
right_on = 'movie_id').
rename(columns = {user_row_number: 'Predictions'}).
sort_values('Predictions', ascending = False).
iloc[:num_recommendations, :-1]
)
return user_full, recommendations
Let's try to recommend 20 movies for user with ID 1310.
already_rated, predictions = recommend_movies(preds, 1310, movies, ratings, 20)
User 1310 has already rated 24 movies. Recommending highest 20 predicted ratings movies not already rated.
# Top 20 movies that User 1310 has rated
already_rated.head(20)
user_id | movie_id | rating | timestamp | title | genres | |
---|---|---|---|---|---|---|
5 | 1310 | 2248 | 5 | 974781573 | Say Anything... (1989) | Comedy|Drama|Romance |
6 | 1310 | 2620 | 5 | 974781573 | This Is My Father (1998) | Drama|Romance |
7 | 1310 | 3683 | 5 | 974781935 | Blood Simple (1984) | Drama|Film-Noir |
15 | 1310 | 1704 | 5 | 974781573 | Good Will Hunting (1997) | Drama |
1 | 1310 | 1293 | 5 | 974781839 | Gandhi (1982) | Drama |
12 | 1310 | 3101 | 4 | 974781573 | Fatal Attraction (1987) | Thriller |
11 | 1310 | 1343 | 4 | 974781534 | Cape Fear (1991) | Thriller |
20 | 1310 | 2000 | 4 | 974781892 | Lethal Weapon (1987) | Action|Comedy|Crime|Drama |
18 | 1310 | 3526 | 4 | 974781892 | Parenthood (1989) | Comedy|Drama |
17 | 1310 | 3360 | 4 | 974781935 | Hoosiers (1986) | Drama |
13 | 1310 | 3111 | 4 | 974782001 | Places in the Heart (1984) | Drama |
23 | 1310 | 1097 | 4 | 974781534 | E.T. the Extra-Terrestrial (1982) | Children's|Drama|Fantasy|Sci-Fi |
10 | 1310 | 1196 | 4 | 974781701 | Star Wars: Episode V - The Empire Strikes Back... | Action|Adventure|Drama|Sci-Fi|War |
9 | 1310 | 1185 | 4 | 974781839 | My Left Foot (1989) | Drama |
8 | 1310 | 3685 | 4 | 974781935 | Prizzi's Honor (1985) | Comedy|Drama|Romance |
4 | 1310 | 2243 | 4 | 974782001 | Broadcast News (1987) | Comedy|Drama|Romance |
3 | 1310 | 1299 | 4 | 974781701 | Killing Fields, The (1984) | Drama|War |
16 | 1310 | 144 | 3 | 974781573 | Brothers McMullen, The (1995) | Comedy |
19 | 1310 | 1960 | 3 | 974782001 | Last Emperor, The (1987) | Drama|War |
0 | 1310 | 2988 | 3 | 974781935 | Melvin and Howard (1980) | Drama |
# Top 20 movies that User 1310 hopefully will enjoy
predictions
movie_id | title | genres | |
---|---|---|---|
1618 | 1674 | Witness (1985) | Drama|Romance|Thriller |
1880 | 1961 | Rain Man (1988) | Drama |
1187 | 1210 | Star Wars: Episode VI - Return of the Jedi (1983) | Action|Adventure|Romance|Sci-Fi|War |
1216 | 1242 | Glory (1989) | Action|Drama|War |
1202 | 1225 | Amadeus (1984) | Drama |
1273 | 1302 | Field of Dreams (1989) | Drama |
1220 | 1246 | Dead Poets Society (1989) | Drama |
1881 | 1962 | Driving Miss Daisy (1989) | Drama |
1877 | 1957 | Chariots of Fire (1981) | Drama |
1938 | 2020 | Dangerous Liaisons (1988) | Drama|Romance |
1233 | 1259 | Stand by Me (1986) | Adventure|Comedy|Drama |
3011 | 3098 | Natural, The (1984) | Drama |
2112 | 2194 | Untouchables, The (1987) | Action|Crime|Drama |
1876 | 1956 | Ordinary People (1980) | Drama |
1268 | 1296 | Room with a View, A (1986) | Drama|Romance |
2267 | 2352 | Big Chill, The (1983) | Comedy|Drama |
1278 | 1307 | When Harry Met Sally... (1989) | Comedy|Romance |
1165 | 1186 | Sex, Lies, and Videotape (1989) | Drama |
1199 | 1222 | Full Metal Jacket (1987) | Action|Drama|War |
2833 | 2919 | Year of Living Dangerously (1982) | Drama|Romance |
These look like pretty good recommendations. It's good to see that, although I didn't actually use the genre of the movie as a feature, the truncated matrix factorization features "picked up" on the underlying tastes and preferences of the user. I've recommended some comedy, drama, and romance movies - all of which were genres of some of this user's top rated movies.
Can't forget to evaluate our model, can we?
Instead of doing manually like the last time, I will use the Surprise library that provided various ready-to-use powerful prediction algorithms including (SVD) to evaluate its RMSE (Root Mean Squared Error) on the MovieLens dataset. It is a Python scikit building and analyzing recommender systems.
# Import libraries from Surprise package
from surprise import Reader, Dataset, SVD, evaluate
# Load Reader library
reader = Reader()
# Load ratings dataset with Dataset library
data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'rating']], reader)
# Split the dataset for 5-fold evaluation
data.split(n_folds=5)
# Use the SVD algorithm.
svd = SVD()
# Compute the RMSE of the SVD algorithm.
evaluate(svd, data, measures=['RMSE'])
/Users/khanhnamle/anaconda2/lib/python2.7/site-packages/surprise/evaluate.py:66: UserWarning: The evaluate() method is deprecated. Please use model_selection.cross_validate() instead. 'model_selection.cross_validate() instead.', UserWarning) /Users/khanhnamle/anaconda2/lib/python2.7/site-packages/surprise/dataset.py:193: UserWarning: Using data.split() or using load_from_folds() without using a CV iterator is now deprecated. UserWarning)
Evaluating RMSE of algorithm SVD. ------------ Fold 1 RMSE: 0.8731 ------------ Fold 2 RMSE: 0.8743 ------------ Fold 3 RMSE: 0.8742 ------------ Fold 4 RMSE: 0.8724 ------------ Fold 5 RMSE: 0.8737 ------------ ------------ Mean RMSE: 0.8736 ------------ ------------
CaseInsensitiveDefaultDict(list, {'rmse': [0.873147136065789, 0.8742574033899374, 0.8742215891785942, 0.8724048167991113, 0.8737344531771998]})
I get a mean Root Mean Square Error of 0.8736 which is pretty good. Let's now train on the dataset and arrive at predictions.
trainset = data.build_full_trainset()
svd.train(trainset)
/Users/khanhnamle/anaconda2/lib/python2.7/site-packages/surprise/prediction_algorithms/algo_base.py:51: UserWarning: train() is deprecated. Use fit() instead warnings.warn('train() is deprecated. Use fit() instead', UserWarning)
<surprise.prediction_algorithms.matrix_factorization.SVD at 0x10df4b1d0>
I'll pick again user with ID 1310 and check the ratings he has given.
ratings[ratings['user_id'] == 1310]
user_id | movie_id | rating | timestamp | |
---|---|---|---|---|
215928 | 1310 | 2988 | 3 | 974781935 |
215929 | 1310 | 1293 | 5 | 974781839 |
215930 | 1310 | 1295 | 2 | 974782001 |
215931 | 1310 | 1299 | 4 | 974781701 |
215932 | 1310 | 2243 | 4 | 974782001 |
215933 | 1310 | 2248 | 5 | 974781573 |
215934 | 1310 | 2620 | 5 | 974781573 |
215935 | 1310 | 3683 | 5 | 974781935 |
215936 | 1310 | 3685 | 4 | 974781935 |
215937 | 1310 | 1185 | 4 | 974781839 |
215938 | 1310 | 1196 | 4 | 974781701 |
215939 | 1310 | 1343 | 4 | 974781534 |
215940 | 1310 | 3101 | 4 | 974781573 |
215941 | 1310 | 3111 | 4 | 974782001 |
215942 | 1310 | 2313 | 2 | 974781839 |
215943 | 1310 | 1704 | 5 | 974781573 |
215944 | 1310 | 144 | 3 | 974781573 |
215945 | 1310 | 3360 | 4 | 974781935 |
215946 | 1310 | 3526 | 4 | 974781892 |
215947 | 1310 | 1960 | 3 | 974782001 |
215948 | 1310 | 2000 | 4 | 974781892 |
215949 | 1310 | 1231 | 2 | 974781963 |
215950 | 1310 | 1090 | 2 | 974781839 |
215951 | 1310 | 1097 | 4 | 974781534 |
Now let's use SVD to predict the rating that User with ID 1310 will give to a random movie (let's say with Movie ID 1994).
svd.predict(1310, 1994)
Prediction(uid=1310, iid=1994, r_ui=None, est=3.348657218865324, details={u'was_impossible': False})
For movie with ID 1994, I get an estimated prediction of 3.349. The recommender system works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.
In this notebook, I attempted to build a model-based Collaborative Filtering movie recommendation sytem based on latent features from a low rank matrix factorization method called SVD. As it captures the underlying features driving the raw data, it can scale significantly better to massive datasets as well as make better recommendations based on user's tastes.
However, we still likely lose some meaningful signals by using a low-rank approximation. Specifically, there's an interpretability problem as a singular vector specifies a linear combination of all input columns or rows. There's also a lack of sparsity when the singular vectors are quite dense. Thus, SVD approach is limited to linear projections.