SVD for Movie Recommendations

In this notebook, I'll detail a basic version of model-based collaborative filtering for recommendations by employing it on the MovieLens 1M dataset.

In my previous attempt, I used user-based and item-based collaborative filtering to make movie recommendations from users' ratings data. I can only try them on a very small data sample (20,000 ratings), and ended up getting pretty high Root Mean Squared Error (bad recommendations). Memory-based collaborative filtering approaches that compute distance relationships between items or users have these two major issues:

  1. It doesn't scale particularly well to massive datasets, especially for real-time recommendations based on user behavior similarities - which takes a lot of computations.
  2. Ratings matrices may be overfitting to noisy representations of user tastes and preferences. When we use distance based "neighborhood" approaches on raw data, we match to sparse low-level details that we assume represent the user's preference vector instead of the vector itself.

Thus I'd need to apply Dimensionality Reduction technique to derive the tastes and preferences from the raw data, otherwise known as doing low-rank matrix factorization. Why reduce dimensions?

  • I can discover hidden correlations / features in the raw data.
  • I can remove redundant and noisy features that are not useful.
  • I can interpret and visualize the data easier.
  • I can also access easier data storage and processing.

With that goal in mind, I'll introduce Singular Vector Decomposition (SVD) to you, a powerful dimensionality reduction technique that is used heavily in modern model-based CF recommender system.

dim-red

Loading the Dataset

Let's load the 3 data files just like last time.

In [1]:
# Import libraries
import numpy as np
import pandas as pd

# Reading ratings file
ratings = pd.read_csv('ratings.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'movie_id', 'rating', 'timestamp'])

# Reading users file
users = pd.read_csv('users.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])

# Reading movies file
movies = pd.read_csv('movies.csv', sep='\t', encoding='latin-1', usecols=['movie_id', 'title', 'genres'])

Let's take a look at the movies and ratings dataframes.

In [2]:
movies.head()
Out[2]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
In [3]:
ratings.head()
Out[3]:
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291

Also let's count the number of unique users and movies.

In [4]:
n_users = ratings.user_id.unique().shape[0]
n_movies = ratings.movie_id.unique().shape[0]
print 'Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_movies)
Number of users = 6040 | Number of movies = 3706

Now I want the format of my ratings matrix to be one row per user and one column per movie. To do so, I'll pivot ratings to get that and call the new variable Ratings (with a capital *R).

In [12]:
Ratings = ratings.pivot(index = 'user_id', columns ='movie_id', values = 'rating').fillna(0)
Ratings.head()
Out[12]:
movie_id 1 2 3 4 5 6 7 8 9 10 ... 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952
user_id
1 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 3706 columns

Last but not least, I need to de-normalize the data (normalize by each users mean) and convert it from a dataframe to a numpy array.

In [13]:
R = Ratings.as_matrix()
user_ratings_mean = np.mean(R, axis = 1)
Ratings_demeaned = R - user_ratings_mean.reshape(-1, 1)

With my ratings matrix properly formatted and normalized, I'm ready to do some dimensionality reduction. But first, let's go over the math.

Model-Based Collaborative Filtering

Model-based Collaborative Filtering is based on matrix factorization (MF) which has received greater exposure, mainly as an unsupervised learning method for latent variable decomposition and dimensionality reduction. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF:

  • The goal of MF is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items.
  • When you have a very sparse matrix, with a lot of dimensions, by doing matrix factorization, you can restructure the user-item matrix into low-rank structure, and you can represent the matrix by the multiplication of two low-rank matrices, where the rows contain the latent vector.
  • You fit this matrix to approximate your original matrix, as closely as possible, by multiplying the low-rank matrices together, which fills in the entries missing in the original matrix.

For example, let's check the sparsity of the ratings dataset:

In [7]:
sparsity = round(1.0 - len(ratings) / float(n_users * n_movies), 3)
print 'The sparsity level of MovieLens1M dataset is ' +  str(sparsity * 100) + '%'
The sparsity level of MovieLens1M dataset is 95.5%

Support Vector Decomposition (SVD)

A well-known matrix factorization method is Singular value decomposition (SVD). At a high level, SVD is an algorithm that decomposes a matrix $A$ into the best lower rank (i.e. smaller/simpler) approximation of the original matrix $A$. Mathematically, it decomposes A into a two unitary matrices and a diagonal matrix:

svd

where $A$ is the input data matrix (users's ratings), $U$ is the left singular vectors (user "features" matrix), $\Sigma$ is the diagonal matrix of singular values (essentially weights/strengths of each concept), and $V^{T}$ is the right singluar vectors (movie "features" matrix). $U$ and $V^{T}$ are column orthonomal, and represent different things. $U$ represents how much users "like" each feature and $V^{T}$ represents how relevant each feature is to each movie.

To get the lower rank approximation, I take these matrices and keep only the top $k$ features, which can be thought of as the underlying tastes and preferences vectors.

Setting Up SVD

Scipy and Numpy both have functions to do the singular value decomposition. I'm going to use the Scipy function svds because it let's me choose how many latent factors I want to use to approximate the original ratings matrix (instead of having to truncate it after).

In [14]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(Ratings_demeaned, k = 50)

As I'm going to leverage matrix multiplication to get predictions, I'll convert the $\Sigma$ (now are values) to the diagonal matrix form.

In [15]:
sigma = np.diag(sigma)

Making Predictions from the Decomposed Matrices

I now have everything I need to make movie ratings predictions for every user. I can do it all at once by following the math and matrix multiply $U$, $\Sigma$, and $V^{T}$ back to get the rank $k=50$ approximation of $A$.

But first, I need to add the user means back to get the actual star ratings prediction.

In [16]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)

With the predictions matrix for every user, I can build a function to recommend movies for any user. I return the list of movies the user has already rated, for the sake of comparison.

In [17]:
preds = pd.DataFrame(all_user_predicted_ratings, columns = Ratings.columns)
preds.head()
Out[17]:
movie_id 1 2 3 4 5 6 7 8 9 10 ... 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952
0 4.288861 0.143055 -0.195080 -0.018843 0.012232 -0.176604 -0.074120 0.141358 -0.059553 -0.195950 ... 0.027807 0.001640 0.026395 -0.022024 -0.085415 0.403529 0.105579 0.031912 0.050450 0.088910
1 0.744716 0.169659 0.335418 0.000758 0.022475 1.353050 0.051426 0.071258 0.161601 1.567246 ... -0.056502 -0.013733 -0.010580 0.062576 -0.016248 0.155790 -0.418737 -0.101102 -0.054098 -0.140188
2 1.818824 0.456136 0.090978 -0.043037 -0.025694 -0.158617 -0.131778 0.098977 0.030551 0.735470 ... 0.040481 -0.005301 0.012832 0.029349 0.020866 0.121532 0.076205 0.012345 0.015148 -0.109956
3 0.408057 -0.072960 0.039642 0.089363 0.041950 0.237753 -0.049426 0.009467 0.045469 -0.111370 ... 0.008571 -0.005425 -0.008500 -0.003417 -0.083982 0.094512 0.057557 -0.026050 0.014841 -0.034224
4 1.574272 0.021239 -0.051300 0.246884 -0.032406 1.552281 -0.199630 -0.014920 -0.060498 0.450512 ... 0.110151 0.046010 0.006934 -0.015940 -0.050080 -0.052539 0.507189 0.033830 0.125706 0.199244

5 rows × 3706 columns

Now I write a function to return the movies with the highest predicted rating that the specified user hasn't already rated. Though I didn't use any explicit movie content features (such as genre or title), I'll merge in that information to get a more complete picture of the recommendations.

In [18]:
def recommend_movies(predictions, userID, movies, original_ratings, num_recommendations):
    
    # Get and sort the user's predictions
    user_row_number = userID - 1 # User ID starts at 1, not 0
    sorted_user_predictions = preds.iloc[user_row_number].sort_values(ascending=False) # User ID starts at 1
    
    # Get the user's data and merge in the movie information.
    user_data = original_ratings[original_ratings.user_id == (userID)]
    user_full = (user_data.merge(movies, how = 'left', left_on = 'movie_id', right_on = 'movie_id').
                     sort_values(['rating'], ascending=False)
                 )

    print 'User {0} has already rated {1} movies.'.format(userID, user_full.shape[0])
    print 'Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations)
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies[~movies['movie_id'].isin(user_full['movie_id'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movie_id',
               right_on = 'movie_id').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations

Let's try to recommend 20 movies for user with ID 1310.

In [19]:
already_rated, predictions = recommend_movies(preds, 1310, movies, ratings, 20)
User 1310 has already rated 24 movies.
Recommending highest 20 predicted ratings movies not already rated.
In [20]:
# Top 20 movies that User 1310 has rated 
already_rated.head(20)
Out[20]:
user_id movie_id rating timestamp title genres
5 1310 2248 5 974781573 Say Anything... (1989) Comedy|Drama|Romance
6 1310 2620 5 974781573 This Is My Father (1998) Drama|Romance
7 1310 3683 5 974781935 Blood Simple (1984) Drama|Film-Noir
15 1310 1704 5 974781573 Good Will Hunting (1997) Drama
1 1310 1293 5 974781839 Gandhi (1982) Drama
12 1310 3101 4 974781573 Fatal Attraction (1987) Thriller
11 1310 1343 4 974781534 Cape Fear (1991) Thriller
20 1310 2000 4 974781892 Lethal Weapon (1987) Action|Comedy|Crime|Drama
18 1310 3526 4 974781892 Parenthood (1989) Comedy|Drama
17 1310 3360 4 974781935 Hoosiers (1986) Drama
13 1310 3111 4 974782001 Places in the Heart (1984) Drama
23 1310 1097 4 974781534 E.T. the Extra-Terrestrial (1982) Children's|Drama|Fantasy|Sci-Fi
10 1310 1196 4 974781701 Star Wars: Episode V - The Empire Strikes Back... Action|Adventure|Drama|Sci-Fi|War
9 1310 1185 4 974781839 My Left Foot (1989) Drama
8 1310 3685 4 974781935 Prizzi's Honor (1985) Comedy|Drama|Romance
4 1310 2243 4 974782001 Broadcast News (1987) Comedy|Drama|Romance
3 1310 1299 4 974781701 Killing Fields, The (1984) Drama|War
16 1310 144 3 974781573 Brothers McMullen, The (1995) Comedy
19 1310 1960 3 974782001 Last Emperor, The (1987) Drama|War
0 1310 2988 3 974781935 Melvin and Howard (1980) Drama
In [21]:
# Top 20 movies that User 1310 hopefully will enjoy
predictions
Out[21]:
movie_id title genres
1618 1674 Witness (1985) Drama|Romance|Thriller
1880 1961 Rain Man (1988) Drama
1187 1210 Star Wars: Episode VI - Return of the Jedi (1983) Action|Adventure|Romance|Sci-Fi|War
1216 1242 Glory (1989) Action|Drama|War
1202 1225 Amadeus (1984) Drama
1273 1302 Field of Dreams (1989) Drama
1220 1246 Dead Poets Society (1989) Drama
1881 1962 Driving Miss Daisy (1989) Drama
1877 1957 Chariots of Fire (1981) Drama
1938 2020 Dangerous Liaisons (1988) Drama|Romance
1233 1259 Stand by Me (1986) Adventure|Comedy|Drama
3011 3098 Natural, The (1984) Drama
2112 2194 Untouchables, The (1987) Action|Crime|Drama
1876 1956 Ordinary People (1980) Drama
1268 1296 Room with a View, A (1986) Drama|Romance
2267 2352 Big Chill, The (1983) Comedy|Drama
1278 1307 When Harry Met Sally... (1989) Comedy|Romance
1165 1186 Sex, Lies, and Videotape (1989) Drama
1199 1222 Full Metal Jacket (1987) Action|Drama|War
2833 2919 Year of Living Dangerously (1982) Drama|Romance

These look like pretty good recommendations. It's good to see that, although I didn't actually use the genre of the movie as a feature, the truncated matrix factorization features "picked up" on the underlying tastes and preferences of the user. I've recommended some comedy, drama, and romance movies - all of which were genres of some of this user's top rated movies.

Model Evaluation

Can't forget to evaluate our model, can we?

Instead of doing manually like the last time, I will use the Surprise library that provided various ready-to-use powerful prediction algorithms including (SVD) to evaluate its RMSE (Root Mean Squared Error) on the MovieLens dataset. It is a Python scikit building and analyzing recommender systems.

In [22]:
# Import libraries from Surprise package
from surprise import Reader, Dataset, SVD, evaluate

# Load Reader library
reader = Reader()

# Load ratings dataset with Dataset library
data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'rating']], reader)

# Split the dataset for 5-fold evaluation
data.split(n_folds=5)
In [23]:
# Use the SVD algorithm.
svd = SVD()

# Compute the RMSE of the SVD algorithm.
evaluate(svd, data, measures=['RMSE'])
/Users/khanhnamle/anaconda2/lib/python2.7/site-packages/surprise/evaluate.py:66: UserWarning: The evaluate() method is deprecated. Please use model_selection.cross_validate() instead.
  'model_selection.cross_validate() instead.', UserWarning)
/Users/khanhnamle/anaconda2/lib/python2.7/site-packages/surprise/dataset.py:193: UserWarning: Using data.split() or using load_from_folds() without using a CV iterator is now deprecated. 
  UserWarning)
Evaluating RMSE of algorithm SVD.

------------
Fold 1
RMSE: 0.8731
------------
Fold 2
RMSE: 0.8743
------------
Fold 3
RMSE: 0.8742
------------
Fold 4
RMSE: 0.8724
------------
Fold 5
RMSE: 0.8737
------------
------------
Mean RMSE: 0.8736
------------
------------
Out[23]:
CaseInsensitiveDefaultDict(list,
                           {'rmse': [0.873147136065789,
                             0.8742574033899374,
                             0.8742215891785942,
                             0.8724048167991113,
                             0.8737344531771998]})

I get a mean Root Mean Square Error of 0.8736 which is pretty good. Let's now train on the dataset and arrive at predictions.

In [24]:
trainset = data.build_full_trainset()
svd.train(trainset)
/Users/khanhnamle/anaconda2/lib/python2.7/site-packages/surprise/prediction_algorithms/algo_base.py:51: UserWarning: train() is deprecated. Use fit() instead
  warnings.warn('train() is deprecated. Use fit() instead', UserWarning)
Out[24]:
<surprise.prediction_algorithms.matrix_factorization.SVD at 0x10df4b1d0>

I'll pick again user with ID 1310 and check the ratings he has given.

In [25]:
ratings[ratings['user_id'] == 1310]
Out[25]:
user_id movie_id rating timestamp
215928 1310 2988 3 974781935
215929 1310 1293 5 974781839
215930 1310 1295 2 974782001
215931 1310 1299 4 974781701
215932 1310 2243 4 974782001
215933 1310 2248 5 974781573
215934 1310 2620 5 974781573
215935 1310 3683 5 974781935
215936 1310 3685 4 974781935
215937 1310 1185 4 974781839
215938 1310 1196 4 974781701
215939 1310 1343 4 974781534
215940 1310 3101 4 974781573
215941 1310 3111 4 974782001
215942 1310 2313 2 974781839
215943 1310 1704 5 974781573
215944 1310 144 3 974781573
215945 1310 3360 4 974781935
215946 1310 3526 4 974781892
215947 1310 1960 3 974782001
215948 1310 2000 4 974781892
215949 1310 1231 2 974781963
215950 1310 1090 2 974781839
215951 1310 1097 4 974781534

Now let's use SVD to predict the rating that User with ID 1310 will give to a random movie (let's say with Movie ID 1994).

In [26]:
svd.predict(1310, 1994)
Out[26]:
Prediction(uid=1310, iid=1994, r_ui=None, est=3.348657218865324, details={u'was_impossible': False})

For movie with ID 1994, I get an estimated prediction of 3.349. The recommender system works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.

Conclusion

In this notebook, I attempted to build a model-based Collaborative Filtering movie recommendation sytem based on latent features from a low rank matrix factorization method called SVD. As it captures the underlying features driving the raw data, it can scale significantly better to massive datasets as well as make better recommendations based on user's tastes.

However, we still likely lose some meaningful signals by using a low-rank approximation. Specifically, there's an interpretability problem as a singular vector specifies a linear combination of all input columns or rows. There's also a lack of sparsity when the singular vectors are quite dense. Thus, SVD approach is limited to linear projections.