#!/usr/bin/env python # coding: utf-8 # # SVD for Movie Recommendations # In this notebook, I'll detail a basic version of model-based collaborative filtering for recommendations by employing it on the MovieLens 1M dataset. # # [In my previous attempt](https://github.com/khanhnamle1994/movielens/blob/master/Content_Based_and_Collaborative_Filtering_Models.ipynb), I used user-based and item-based collaborative filtering to make movie recommendations from users' ratings data. I can only try them on a very small data sample (20,000 ratings), and ended up getting pretty high Root Mean Squared Error (bad recommendations). Memory-based collaborative filtering approaches that compute distance relationships between items or users have these two major issues: # # 1. It doesn't scale particularly well to massive datasets, especially for real-time recommendations based on user behavior similarities - which takes a lot of computations. # 2. Ratings matrices may be overfitting to noisy representations of user tastes and preferences. When we use distance based "neighborhood" approaches on raw data, we match to sparse low-level details that we assume represent the user's preference vector instead of the vector itself. # # Thus I'd need to apply **Dimensionality Reduction** technique to derive the tastes and preferences from the raw data, otherwise known as doing low-rank matrix factorization. Why reduce dimensions? # # * I can discover hidden correlations / features in the raw data. # * I can remove redundant and noisy features that are not useful. # * I can interpret and visualize the data easier. # * I can also access easier data storage and processing. # # With that goal in mind, I'll introduce Singular Vector Decomposition (SVD) to you, a powerful dimensionality reduction technique that is used heavily in modern model-based CF recommender system. # # ![dim-red](images/dimensionality-reduction.jpg) # ## Loading the Dataset # Let's load the 3 data files just like last time. # In[1]: # Import libraries import numpy as np import pandas as pd # Reading ratings file ratings = pd.read_csv('ratings.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'movie_id', 'rating', 'timestamp']) # Reading users file users = pd.read_csv('users.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc']) # Reading movies file movies = pd.read_csv('movies.csv', sep='\t', encoding='latin-1', usecols=['movie_id', 'title', 'genres']) # Let's take a look at the movies and ratings dataframes. # In[2]: movies.head() # In[3]: ratings.head() # Also let's count the number of unique users and movies. # In[4]: n_users = ratings.user_id.unique().shape[0] n_movies = ratings.movie_id.unique().shape[0] print 'Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_movies) # Now I want the format of my ratings matrix to be one row per user and one column per movie. To do so, I'll pivot *ratings* to get that and call the new variable *Ratings* (with a capital *R). # In[12]: Ratings = ratings.pivot(index = 'user_id', columns ='movie_id', values = 'rating').fillna(0) Ratings.head() # Last but not least, I need to de-normalize the data (normalize by each users mean) and convert it from a dataframe to a numpy array. # In[13]: R = Ratings.as_matrix() user_ratings_mean = np.mean(R, axis = 1) Ratings_demeaned = R - user_ratings_mean.reshape(-1, 1) # With my ratings matrix properly formatted and normalized, I'm ready to do some dimensionality reduction. But first, let's go over the math. # ## Model-Based Collaborative Filtering # *Model-based Collaborative Filtering* is based on *matrix factorization (MF)* which has received greater exposure, mainly as an unsupervised learning method for latent variable decomposition and dimensionality reduction. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF: # # * The goal of MF is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items. # * When you have a very sparse matrix, with a lot of dimensions, by doing matrix factorization, you can restructure the user-item matrix into low-rank structure, and you can represent the matrix by the multiplication of two low-rank matrices, where the rows contain the latent vector. # * You fit this matrix to approximate your original matrix, as closely as possible, by multiplying the low-rank matrices together, which fills in the entries missing in the original matrix. # # For example, let's check the sparsity of the ratings dataset: # In[7]: sparsity = round(1.0 - len(ratings) / float(n_users * n_movies), 3) print 'The sparsity level of MovieLens1M dataset is ' + str(sparsity * 100) + '%' # ## Support Vector Decomposition (SVD) # A well-known matrix factorization method is *Singular value decomposition (SVD)*. At a high level, SVD is an algorithm that decomposes a matrix $A$ into the best lower rank (i.e. smaller/simpler) approximation of the original matrix $A$. Mathematically, it decomposes A into a two unitary matrices and a diagonal matrix: # # ![svd](images/svd.png) # # where $A$ is the input data matrix (users's ratings), $U$ is the left singular vectors (user "features" matrix), $\Sigma$ is the diagonal matrix of singular values (essentially weights/strengths of each concept), and $V^{T}$ is the right singluar vectors (movie "features" matrix). $U$ and $V^{T}$ are column orthonomal, and represent different things. $U$ represents how much users "like" each feature and $V^{T}$ represents how relevant each feature is to each movie. # # To get the lower rank approximation, I take these matrices and keep only the top $k$ features, which can be thought of as the underlying tastes and preferences vectors. # ### Setting Up SVD # Scipy and Numpy both have functions to do the singular value decomposition. I'm going to use the Scipy function *svds* because it let's me choose how many latent factors I want to use to approximate the original ratings matrix (instead of having to truncate it after). # In[14]: from scipy.sparse.linalg import svds U, sigma, Vt = svds(Ratings_demeaned, k = 50) # As I'm going to leverage matrix multiplication to get predictions, I'll convert the $\Sigma$ (now are values) to the diagonal matrix form. # In[15]: sigma = np.diag(sigma) # ### Making Predictions from the Decomposed Matrices # I now have everything I need to make movie ratings predictions for every user. I can do it all at once by following the math and matrix multiply $U$, $\Sigma$, and $V^{T}$ back to get the rank $k=50$ approximation of $A$. # # But first, I need to add the user means back to get the actual star ratings prediction. # In[16]: all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1) # With the predictions matrix for every user, I can build a function to recommend movies for any user. I return the list of movies the user has already rated, for the sake of comparison. # In[17]: preds = pd.DataFrame(all_user_predicted_ratings, columns = Ratings.columns) preds.head() # Now I write a function to return the movies with the highest predicted rating that the specified user hasn't already rated. Though I didn't use any explicit movie content features (such as genre or title), I'll merge in that information to get a more complete picture of the recommendations. # In[18]: def recommend_movies(predictions, userID, movies, original_ratings, num_recommendations): # Get and sort the user's predictions user_row_number = userID - 1 # User ID starts at 1, not 0 sorted_user_predictions = preds.iloc[user_row_number].sort_values(ascending=False) # User ID starts at 1 # Get the user's data and merge in the movie information. user_data = original_ratings[original_ratings.user_id == (userID)] user_full = (user_data.merge(movies, how = 'left', left_on = 'movie_id', right_on = 'movie_id'). sort_values(['rating'], ascending=False) ) print 'User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]) print 'Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations) # Recommend the highest predicted rating movies that the user hasn't seen yet. recommendations = (movies[~movies['movie_id'].isin(user_full['movie_id'])]. merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left', left_on = 'movie_id', right_on = 'movie_id'). rename(columns = {user_row_number: 'Predictions'}). sort_values('Predictions', ascending = False). iloc[:num_recommendations, :-1] ) return user_full, recommendations # Let's try to recommend 20 movies for user with ID 1310. # In[19]: already_rated, predictions = recommend_movies(preds, 1310, movies, ratings, 20) # In[20]: # Top 20 movies that User 1310 has rated already_rated.head(20) # In[21]: # Top 20 movies that User 1310 hopefully will enjoy predictions # These look like pretty good recommendations. It's good to see that, although I didn't actually use the genre of the movie as a feature, the truncated matrix factorization features "picked up" on the underlying tastes and preferences of the user. I've recommended some comedy, drama, and romance movies - all of which were genres of some of this user's top rated movies. # ### Model Evaluation # Can't forget to evaluate our model, can we? # # Instead of doing manually like the last time, I will use the *[Surprise](https://pypi.python.org/pypi/scikit-surprise)* library that provided various ready-to-use powerful prediction algorithms including (SVD) to evaluate its RMSE (Root Mean Squared Error) on the MovieLens dataset. It is a Python scikit building and analyzing recommender systems. # In[22]: # Import libraries from Surprise package from surprise import Reader, Dataset, SVD, evaluate # Load Reader library reader = Reader() # Load ratings dataset with Dataset library data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'rating']], reader) # Split the dataset for 5-fold evaluation data.split(n_folds=5) # In[23]: # Use the SVD algorithm. svd = SVD() # Compute the RMSE of the SVD algorithm. evaluate(svd, data, measures=['RMSE']) # I get a mean *Root Mean Square Error* of 0.8736 which is pretty good. Let's now train on the dataset and arrive at predictions. # In[24]: trainset = data.build_full_trainset() svd.train(trainset) # I'll pick again user with ID 1310 and check the ratings he has given. # In[25]: ratings[ratings['user_id'] == 1310] # Now let's use SVD to predict the rating that User with ID 1310 will give to a random movie (let's say with Movie ID 1994). # In[26]: svd.predict(1310, 1994) # For movie with ID 1994, I get an estimated prediction of 3.349. The recommender system works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie. # # ## Conclusion # In this notebook, I attempted to build a model-based Collaborative Filtering movie recommendation sytem based on latent features from a low rank matrix factorization method called SVD. As it captures the underlying features driving the raw data, it can scale significantly better to massive datasets as well as make better recommendations based on user's tastes. # # However, we still likely lose some meaningful signals by using a low-rank approximation. Specifically, there's an interpretability problem as a singular vector specifies a linear combination of all input columns or rows. There's also a lack of sparsity when the singular vectors are quite dense. Thus, SVD approach is limited to linear projections.