Note: if you are visualizing this notebook directly from GitHub, some mathematical symbols might display incorrectly or not display at all. This same notebook can be rendered from nbviewer by following this link.
This project consists of recommending movies to users based on their demographic information (in this case: age, gender, occupation and geographical region) and on movie information (in this case: year of production, genres, and user tags), using the MovieLens 1M dataset, which contains 1,000,209 ratings of 3,900 movies from 6,040 users, along with the users’ demographic information and some basic movie information; enhancing the movie info with the MovieLens 20M dataset, which contains more detailed movie information in the form of tag genomes as described in Vig, J., Sen, S., & Riedl, J. (2012). The tag genome: Encoding community knowledge to support novel interaction. ACM Transactions on Interactive Intelligent Systems (TiiS), 2(3), 13..
The formula used is an implementation of what’s described in Park, S. T., & Chu, W. (2009, October). Pairwise preference regression for cold-start recommendation. In Proceedings of the third ACM conference on Recommender systems (pp. 21-28). ACM., with some slight modifications – the general idea is to produce a regression on differences between ratings of two movies from the same user using the outer products of the user and movie attribute vectors.
In comparison to recommendations based on past user ratings, these kinds of models are able to provide quality recommendations to new users (for whom there is demographic information available), and are able to recommend both old and new movies (as long as there is information about them).
The idea implemented here differs from what the paper above describes in that:
This model consists on ranking movies for a user by a linear combination of the outer product of movie and user features multiplied by some weights that minimize the squares of differences between ratings of each two movies minus predicted differences, plus a regularization term - this is given by the following formula:
$$ \min_w \sum_{u \in users} \bigg( \frac{1}{\left\vert\ M_u \right\vert} \sum_{i \in M_u} \sum_{j \in M_u} (\:(R_{ui} - R_{uj}) - (w^\mathsf{T}(z_i \otimes x_u) - w^\mathsf{T}(z_j \otimes x_u)\:)^2 \bigg) + \lambda \lVert w \rVert^2_2 $$Where $R_{ui}$ are the ratings given by users to movies, $x$ are the user features, $z$ are the movie features, $M_{user}$ are the movies that have been rated by a given user, $\lambda$ is a regularization term, and $w$ are the coefficients assigned to each combination of user-movie feature. Note that the optimization problem is convex with respect to $w$.
Recommendations are then produced for each user by calculating, for each movie, $w^\mathsf{T}(z_j \otimes x_u)$ and taking the movies with highest such score for that user.
Intuitively, the parameters that minimize such a loss function would be exactly the same that minimize a regression of the centered ratings for each user, that is:
$$ \min_w \sum_{u \in users} \frac{1}{\left\vert\ M_u \right\vert} \sum_{i \in M_u} \bigg( (R_{ui} - \overline{R_u}) - w^\mathsf{T}(z_i \otimes x_u) \bigg)^2 + \lambda \lVert w \rVert^2_2 $$This is far easier and faster to work with, and can be easily computed with standard libraries. I found the formula without weights by movies rated per user to be slightly more accurate after optimizing the coefficients for half an hour.
3. Loading ratings and generating test set
4. Fitting the model with Spark
5. Evaluating the model and checking some recommendations
The movie data needs some processing in order to put it in the right format. As explained before, the movie data can be enhanced with the tag genome information, which contains 1128 tags for each movie in a relative scale from 0 to 1 (with values closer to 1 indicating that the movie has more of that tag). Although these are way too many tags to use with this small ratings data, many tags are relate to each other and it’s possible to make some good feature reduction using principal components.
A small look at the data:
import pandas as pd, numpy as np, re
from collections import defaultdict
movies=pd.read_csv('/home/david/movielens/ml-latest/ml-latest/movies.csv')
movies_humanreadable=movies.copy()
movies['hasYear']=movies.title.map(lambda x: bool(re.search("\s\((\d{4})\)$",x.strip())))
movies['Year']='unknown'
movies['Year'].loc[movies.hasYear]=movies.title.loc[movies.hasYear].map(lambda x: re.search("\s\((\d{4})\)$",x.strip()).group(1))
del movies['hasYear']
movies.head()
/home/david/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.py:179: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self._setitem_with_indexer(indexer, value)
movieId | title | genres | Year | |
---|---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 1995 |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy | 1995 |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance | 1995 |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance | 1995 |
4 | 5 | Father of the Bride Part II (1995) | Comedy | 1995 |
movies['genres']=movies.genres.map(lambda x: set(x.split('|')))
present_genres=set()
for movie in movies.itertuples():
present_genres=present_genres.union(movie.genres)
for genre in present_genres:
movies['genre'+genre]=movies.genres.map(lambda x: 1.0*(genre in x))
present_genres
{'(no genres listed)', 'Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western'}
Processing the tag genome info:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
%matplotlib inline
tags=pd.read_csv('/home/david/movielens/ml-latest/ml-latest/genome-scores.csv')
tags_wide=tags.pivot(index='movieId', columns='tagId', values='relevance')
tags_wide=tags_wide.fillna(0)
pca=PCA(svd_solver='full')
pca.fit(tags_wide)
plt.figure(figsize=(20,8))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.yticks(np.arange(0.2, 1.1, .1))
plt.xticks(np.arange(0, 1128, 50))
plt.grid()
From this figure, it seems that 50 tags or so would be a good number to include.
tags_pca=pd.DataFrame(pca.transform(tags_wide)[:,:50])
tags_pca.columns=["pc"+str(x) for x in tags_pca.columns.values]
tags_pca['movieId']=tags_wide.index
movies=pd.merge(movies,tags_pca,how='inner',on='movieId')
The year is converted into a discrete variable using the same criteria as in the original paper - the ratings were taken around the year 2000 so it makes sense to use these limits, in order to identify what were more recent movies at that time (which comprise the majority of the ratings).
## these criteria for making year discrete were taken from the same paper describing the method
def discretize_year(x):
if x=='unknown':
return x
else:
x=int(x)
if x>=2000:
return '>=2000'
if x>=1995 and x<=1999:
return str(x)
if x>=1990 and x<=1994:
return 'low90s'
if x>=1980 and x<=1989:
return '80s'
if x>=1970 and x<=1979:
return '70s'
if x>=1960 and x<=1969:
return '60s'
if x>=1950 and x<=1959:
return '50s'
if x>=1940 and x<=1959:
return '40s'
if x<1940:
return '<1940'
else:
return 'unknown'
movies_features=movies.copy()
del movies_features['title']
del movies_features['genres']
del movies_features['genre(no genres listed)']
movies_features['Year']=movies_features.Year.map(lambda x: discretize_year(x))
movies_features=pd.get_dummies(movies_features, columns=['Year'])
movies_features.set_index('movieId',inplace=True)
movies_features.head()
genreMystery | genreSci-Fi | genreCrime | genreDrama | genreAnimation | genreIMAX | genreAction | genreComedy | genreDocumentary | genreWar | ... | Year_1999 | Year_40s | Year_50s | Year_60s | Year_70s | Year_80s | Year_<1940 | Year_>=2000 | Year_low90s | Year_unknown | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
movieId | |||||||||||||||||||||
1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 83 columns
The dataset contains demographic info with zip codes. As there are way too many of them, I’ll try to guess the US region from these zipcodes. In order to do so, I’m using a publicly available table mapping zip codes to states, another one mapping state names to their abbreviations, and finally classifying the states into regions according to usual definitions.
zipcode_abbs=pd.read_csv("/home/david/movielens/zips/states.csv")
zipcode_abbs_dct={z.State:z.Abbreviation for z in zipcode_abbs.itertuples()}
us_regs_table=[
('New England', 'Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, Vermont'),
('Middle Atlantic', 'Delaware, Maryland, New Jersey, New York, Pennsylvania'),
('South', 'Alabama, Arkansas, Florida, Georgia, Kentucky, Louisiana, Mississippi, Missouri, North Carolina, South Carolina, Tennessee, Virginia, West Virginia'),
('Midwest', 'Illinois, Indiana, Iowa, Kansas, Michigan, Minnesota, Nebraska, North Dakota, Ohio, South Dakota, Wisconsin'),
('Southwest', 'Arizona, New Mexico, Oklahoma, Texas'),
('West', 'Alaska, California, Colorado, Hawaii, Idaho, Montana, Nevada, Oregon, Utah, Washington, Wyoming')
]
us_regs_table=[(x[0],[i.strip() for i in x[1].split(",")]) for x in us_regs_table]
us_regs_dct=dict()
for r in us_regs_table:
for s in r[1]:
us_regs_dct[zipcode_abbs_dct[s]]=r[0]
zipcode_info=pd.read_csv("/home/david/movielens/free-zipcode-database.csv")
zipcode_info=zipcode_info.groupby('Zipcode').first().reset_index()
zipcode_info['State'].loc[zipcode_info.Country!="US"]='UnknownOrNonUS'
zipcode_info['Region']=zipcode_info['State'].copy()
zipcode_info['Region'].loc[zipcode_info.Country=="US"]=zipcode_info.Region.loc[zipcode_info.Country=="US"].map(lambda x: us_regs_dct[x] if x in us_regs_dct else 'UsOther')
zipcode_info=zipcode_info[['Zipcode', 'Region']]
zipcode_info.head()
/home/david/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (11) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
Zipcode | Region | |
---|---|---|
0 | 501 | Middle Atlantic |
1 | 544 | Middle Atlantic |
2 | 601 | UsOther |
3 | 602 | UsOther |
4 | 603 | UsOther |
A small look the the demographic data provided in the dataset:
users=pd.read_table("/home/david/movielens/ml-1m/ml-1m/users.dat",sep='::',names=["userId","Gender","Age","Occupation","Zipcode"], engine='python')
users["Zipcode"]=users.Zipcode.map(lambda x: np.int(re.sub("-.*","",x)))
users=pd.merge(users,zipcode_info,on='Zipcode',how='left')
users['Region']=users.Region.fillna('UnknownOrNonUS')
users.head()
userId | Gender | Age | Occupation | Zipcode | Region | |
---|---|---|---|---|---|---|
0 | 1 | F | 1 | 10 | 48067 | Midwest |
1 | 2 | M | 56 | 16 | 70072 | South |
2 | 3 | M | 25 | 15 | 55117 | Midwest |
3 | 4 | M | 45 | 7 | 2460 | New England |
4 | 5 | M | 25 | 20 | 55455 | Midwest |
users.Region.value_counts()
West 1652 Midwest 1546 South 887 Middle Atlantic 872 New England 507 Southwest 462 UnknownOrNonUS 73 UsOther 41 Name: Region, dtype: int64
users_features=users.copy()
users_features['Gender']=users_features.Gender.map(lambda x: 1.0*(x=='M'))
del users_features['Zipcode']
users_features['Age']=users_features.Age.map(lambda x: str(x))
users_features['Occupation']=users_features.Occupation.map(lambda x: str(x))
users_features=pd.get_dummies(users_features, columns=['Age', 'Occupation', 'Region'])
users_features.set_index('userId',inplace=True)
users_features.head()
Gender | Age_1 | Age_18 | Age_25 | Age_35 | Age_45 | Age_50 | Age_56 | Occupation_0 | Occupation_1 | ... | Occupation_8 | Occupation_9 | Region_Middle Atlantic | Region_Midwest | Region_New England | Region_South | Region_Southwest | Region_UnknownOrNonUS | Region_UsOther | Region_West | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
userId | |||||||||||||||||||||
1 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 1.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 1.0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
5 | 1.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 37 columns
ratings=pd.read_table("/home/david/movielens/ml-1m/ml-1m/ratings.dat", sep="::", names=["userId","movieId","Rating","Timestamp"], engine='python')
movies_w_sideinfo=set(list(movies.movieId))
ratings=ratings.loc[ratings.movieId.map(lambda x: x in movies_w_sideinfo)]
ratings.head()
userId | movieId | Rating | Timestamp | |
---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 |
1 | 1 | 661 | 3 | 978302109 |
2 | 1 | 914 | 3 | 978301968 |
3 | 1 | 3408 | 4 | 978300275 |
4 | 1 | 2355 | 5 | 978824291 |
Generating a train and test set - for computational reasons I'll just take 100 random users as test with all the movies they rated:
userids_present=list(set(list(ratings.userId)))
np.random.seed(1)
users_testset=set(list(np.random.choice(userids_present,replace=False,size=100)))
ratings_train=ratings.loc[ratings.userId.map(lambda x: x not in users_testset)]
ratings_test=ratings.loc[ratings.userId.map(lambda x: x in users_testset)]
users_trainset=set(list(ratings.userId.loc[ratings.userId.map(lambda x: x not in users_testset)]))
# now centering the ratings
avg_rating_by_user=ratings_train.groupby('userId')['Rating'].mean().to_frame().rename(columns={'Rating':'AvgRating'})
ratings_train=pd.merge(ratings_train, avg_rating_by_user, left_on='userId',right_index=True)
ratings_train['RatingCentered']=ratings_train.Rating-ratings_train.AvgRating
print(ratings_train.shape[0])
print(ratings_test.shape[0])
978017 18866
The data is very high-dimensional and doesn't fit in a computer's RAM memory, thus Spark comes very handy for the computations, even when run locally. As it takes a long time to compute the coefficients, this will be done without any hyperparameter tuning.
Starting Spark:
import findspark
findspark.init("/home/david/Downloads/spark-2.1.1-bin-hadoop2.7/")
import pyspark
sc = pyspark.SparkContext()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
Now fitting the model:
from pyspark.mllib.regression import (LabeledPoint, RidgeRegressionWithSGD)
from pyspark.ml.regression import LinearRegression
from pyspark.ml.linalg import Vectors, VectorUDT
from scipy.sparse import csc_matrix
def generate_features(user,movie,users_features_bc,movies_features_bc):
user_feats=users_features_bc.value.loc[user].as_matrix()
movie_feats=movies_features_bc.value.loc[movie].as_matrix()
return csc_matrix(np.kron(user_feats,movie_feats).reshape(-1,1))
users_features_bc=sc.broadcast(users_features)
movies_features_bc=sc.broadcast(movies_features)
trainset=sc.parallelize([(i.userId,i.movieId,i.RatingCentered) for i in ratings_train.itertuples()])\
.map(lambda x: LabeledPoint(x[2],generate_features(x[0],x[1],users_features_bc,movies_features_bc)))\
.map(lambda x: (float(x.label),x.features.asML())).toDF(['label','features'])
trainset.repartition(50)
recommender=LinearRegression(regParam=1e-4).fit(trainset)
formula_coeffs=recommender.coefficients.toArray()
Finally, evaluating what this system recommends to users. Due to the computational time it takes, the results won’t be evaluated with the metrics proposed in the paper at the beginning. I’ll just take average ratings for top-5 recommendations for each user in the test set and compare them to average ratings (the expected value for random recommendations) and to the maximum possible ratings from 5 movies each.
This is not a really good measure, but it’s a good sense check to see if the recommendations are making sense and if the system is better than nothing.
Getting scores for the test set:
def generate_features_series(user,movie):
user_feats=users_features.loc[user].as_matrix()
movie_feats=movies_features.loc[movie].as_matrix()
return pd.Series(np.kron(user_feats,movie_feats).astype('float64'))
X_test=ratings_test.apply(lambda x: generate_features_series(x['userId'],x['movieId']), axis=1)
ratings_test['score']=X_test.dot(formula_coeffs)
/home/david/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Comparing the model to recommending the most popular movies too:
avg_ratings=ratings.groupby('movieId')['Rating'].mean().to_frame().rename(columns={"Rating":"AvgRating"})
ratings_test=pd.merge(ratings_test,avg_ratings,left_on='movieId',right_index=True)
Now comparing it to no model (random recommendation) and best possible recommendations (in terms of ratings):
print 'Averge movie rating:',ratings_test.groupby('userId')['Rating'].mean().mean()
print 'Average rating for top-5 rated by each user:',ratings_test.sort_values(['userId','Rating'],ascending=False).groupby('userId')['Rating'].head(5).mean()
print 'Average rating for bottom-5 rated by each user:',ratings_test.sort_values(['userId','Rating'],ascending=True).groupby('userId')['Rating'].head(5).mean()
print 'Average rating for top-5 recommendations of best-rated movies:',ratings_test.sort_values(['userId','AvgRating'],ascending=False).groupby('userId')['Rating'].head(5).mean()
print '----------------------'
print 'Average rating for top-5 recommendations from this model:',ratings_test.sort_values(['userId','score'],ascending=False).groupby('userId')['Rating'].head(5).mean()
print 'Average rating for bottom-5 (non-)recommendations from this model:',ratings_test.sort_values(['userId','score'],ascending=True).groupby('userId')['Rating'].head(5).mean()
Averge movie rating: 3.68455396497 Average rating for top-5 rated by each user: 4.96 Average rating for bottom-5 rated by each user: 1.61 Average rating for top-5 recommendations of best-rated movies: 4.416 ---------------------- Average rating for top-5 recommendations from this model: 4.338 Average rating for bottom-5 (non-)recommendations from this model: 2.554
Examining some recommendations (3 per user):
top3=ratings_test.sort_values(['userId','score'],ascending=False).groupby('userId').head(3)
top3=top3[['userId','movieId','Rating']]
top3=pd.merge(top3,users,on='userId',how='left')
top3=pd.merge(top3,movies_humanreadable,on='movieId',how='left')
top3.rename(columns={'title':'Recommended Movie', 'genres':"Movie's Genres", 'Rating':'Rating by user'},inplace=True)
age_mapping={
1: "Under 18",
18: "18-24",
25: "25-34",
35: "35-44",
45: "45-49",
50: "50-55",
56: "56+"
}
top3['Age']=top3.Age.map(lambda x: age_mapping[x])
occupations_mapping={
0: "other or not specified",
1: "academic/educator",
2: "artist",
3: "clerical/admin",
4: "college/grad student",
5: "customer service",
6: "doctor/health care",
7: "executive/managerial",
8: "farmer",
9: "homemaker",
10: "K-12 student",
11: "lawyer",
12: "programmer",
13: "retired",
14: "sales/marketing",
15: "scientist",
16: "self-employed",
17: "technician/engineer",
18: "tradesman/craftsman",
19: "unemployed",
20: "writer"
}
top3['Occupation']=top3.Occupation.map(lambda x: occupations_mapping[x])
del top3['Zipcode']
del top3['movieId']
top3[['userId','Recommended Movie','Rating by user', 'Gender','Age','Occupation','Region',"Movie's Genres"]]
userId | Recommended Movie | Rating by user | Gender | Age | Occupation | Region | Movie's Genres | |
---|---|---|---|---|---|---|---|---|
0 | 5938 | Raiders of the Lost Ark (Indiana Jones and the... | 5 | M | 25-34 | academic/educator | South | Action|Adventure |
1 | 5938 | Usual Suspects, The (1995) | 5 | M | 25-34 | academic/educator | South | Crime|Mystery|Thriller |
2 | 5938 | North by Northwest (1959) | 5 | M | 25-34 | academic/educator | South | Action|Adventure|Mystery|Romance|Thriller |
3 | 5798 | Raiders of the Lost Ark (Indiana Jones and the... | 4 | M | 35-44 | other or not specified | West | Action|Adventure |
4 | 5798 | Saving Private Ryan (1998) | 5 | M | 35-44 | other or not specified | West | Action|Drama|War |
5 | 5798 | Godfather, The (1972) | 5 | M | 35-44 | other or not specified | West | Crime|Drama |
6 | 5693 | North by Northwest (1959) | 4 | F | 25-34 | college/grad student | West | Action|Adventure|Mystery|Romance|Thriller |
7 | 5693 | Schindler's List (1993) | 3 | F | 25-34 | college/grad student | West | Drama|War |
8 | 5693 | Notorious (1946) | 3 | F | 25-34 | college/grad student | West | Film-Noir|Romance|Thriller |
9 | 5692 | Shawshank Redemption, The (1994) | 5 | F | 25-34 | executive/managerial | South | Crime|Drama |
10 | 5692 | Schindler's List (1993) | 5 | F | 25-34 | executive/managerial | South | Drama|War |
11 | 5692 | Raiders of the Lost Ark (Indiana Jones and the... | 4 | F | 25-34 | executive/managerial | South | Action|Adventure |
12 | 5582 | Schindler's List (1993) | 5 | M | 45-49 | academic/educator | Midwest | Drama|War |
13 | 5582 | To Kill a Mockingbird (1962) | 5 | M | 45-49 | academic/educator | Midwest | Drama |
14 | 5582 | Third Man, The (1949) | 5 | M | 45-49 | academic/educator | Midwest | Film-Noir|Mystery|Thriller |
15 | 5560 | Sting, The (1973) | 5 | F | 35-44 | technician/engineer | West | Comedy|Crime |
16 | 5560 | Guess Who's Coming to Dinner (1967) | 5 | F | 35-44 | technician/engineer | West | Drama |
17 | 5560 | Sixth Sense, The (1999) | 5 | F | 35-44 | technician/engineer | West | Drama|Horror|Mystery |
18 | 5392 | Fargo (1996) | 4 | M | 50-55 | academic/educator | West | Comedy|Crime|Drama|Thriller |
19 | 5392 | Close Encounters of the Third Kind (1977) | 5 | M | 50-55 | academic/educator | West | Adventure|Drama|Sci-Fi |
20 | 5392 | American Beauty (1999) | 4 | M | 50-55 | academic/educator | West | Drama|Romance |
21 | 5352 | It Happened One Night (1934) | 5 | F | 35-44 | customer service | Midwest | Comedy|Romance |
22 | 5352 | Thin Man, The (1934) | 4 | F | 35-44 | customer service | Midwest | Comedy|Crime |
23 | 5352 | My Man Godfrey (1936) | 5 | F | 35-44 | customer service | Midwest | Comedy|Romance |
24 | 5235 | Wallace & Gromit: A Close Shave (1995) | 5 | M | 25-34 | college/grad student | Middle Atlantic | Animation|Children|Comedy |
25 | 5235 | Monty Python and the Holy Grail (1975) | 5 | M | 25-34 | college/grad student | Middle Atlantic | Adventure|Comedy|Fantasy |
26 | 5235 | Dr. Strangelove or: How I Learned to Stop Worr... | 5 | M | 25-34 | college/grad student | Middle Atlantic | Comedy|War |
27 | 5203 | Forrest Gump (1994) | 2 | F | 45-49 | doctor/health care | New England | Comedy|Drama|Romance|War |
28 | 5203 | Shakespeare in Love (1998) | 5 | F | 45-49 | doctor/health care | New England | Comedy|Drama|Romance |
29 | 5203 | Braveheart (1995) | 5 | F | 45-49 | doctor/health care | New England | Action|Drama|War |
... | ... | ... | ... | ... | ... | ... | ... | ... |
270 | 286 | Raiders of the Lost Ark (Indiana Jones and the... | 4 | M | 25-34 | academic/educator | Midwest | Action|Adventure |
271 | 286 | Rear Window (1954) | 4 | M | 25-34 | academic/educator | Midwest | Mystery|Thriller |
272 | 286 | North by Northwest (1959) | 5 | M | 25-34 | academic/educator | Midwest | Action|Adventure|Mystery|Romance|Thriller |
273 | 277 | Princess Bride, The (1987) | 4 | F | 35-44 | academic/educator | West | Action|Adventure|Comedy|Fantasy|Romance |
274 | 277 | Sixth Sense, The (1999) | 5 | F | 35-44 | academic/educator | West | Drama|Horror|Mystery |
275 | 277 | Shakespeare in Love (1998) | 4 | F | 35-44 | academic/educator | West | Comedy|Drama|Romance |
276 | 249 | Shawshank Redemption, The (1994) | 5 | F | 18-24 | sales/marketing | Midwest | Crime|Drama |
277 | 249 | Princess Bride, The (1987) | 5 | F | 18-24 | sales/marketing | Midwest | Action|Adventure|Comedy|Fantasy|Romance |
278 | 249 | Usual Suspects, The (1995) | 4 | F | 18-24 | sales/marketing | Midwest | Crime|Mystery|Thriller |
279 | 235 | M (1931) | 5 | M | 25-34 | other or not specified | UnknownOrNonUS | Crime|Film-Noir|Thriller |
280 | 235 | Reservoir Dogs (1992) | 5 | M | 25-34 | other or not specified | UnknownOrNonUS | Crime|Mystery|Thriller |
281 | 235 | Princess Mononoke (Mononoke-hime) (1997) | 4 | M | 25-34 | other or not specified | UnknownOrNonUS | Action|Adventure|Animation|Drama|Fantasy |
282 | 180 | Shadow of a Doubt (1943) | 4 | M | 45-49 | programmer | New England | Crime|Drama|Thriller |
283 | 180 | Silence of the Lambs, The (1991) | 4 | M | 45-49 | programmer | New England | Crime|Horror|Thriller |
284 | 180 | One Flew Over the Cuckoo's Nest (1975) | 5 | M | 45-49 | programmer | New England | Drama |
285 | 170 | Dr. Strangelove or: How I Learned to Stop Worr... | 5 | M | 25-34 | lawyer | Middle Atlantic | Comedy|War |
286 | 170 | North by Northwest (1959) | 5 | M | 25-34 | lawyer | Middle Atlantic | Action|Adventure|Mystery|Romance|Thriller |
287 | 170 | Raiders of the Lost Ark (Indiana Jones and the... | 5 | M | 25-34 | lawyer | Middle Atlantic | Action|Adventure |
288 | 148 | Raiders of the Lost Ark (Indiana Jones and the... | 5 | M | 50-55 | technician/engineer | Midwest | Action|Adventure |
289 | 148 | Schindler's List (1993) | 5 | M | 50-55 | technician/engineer | Midwest | Drama|War |
290 | 148 | North by Northwest (1959) | 5 | M | 50-55 | technician/engineer | Midwest | Action|Adventure|Mystery|Romance|Thriller |
291 | 126 | Charade (1963) | 5 | M | 18-24 | homemaker | West | Comedy|Crime|Mystery|Romance|Thriller |
292 | 126 | Gladiator (2000) | 4 | M | 18-24 | homemaker | West | Action|Adventure|Drama |
293 | 126 | Die Hard (1988) | 1 | M | 18-24 | homemaker | West | Action|Crime|Thriller |
294 | 80 | Schindler's List (1993) | 5 | M | 56+ | academic/educator | Midwest | Drama|War |
295 | 80 | One Flew Over the Cuckoo's Nest (1975) | 4 | M | 56+ | academic/educator | Midwest | Drama |
296 | 80 | Green Mile, The (1999) | 5 | M | 56+ | academic/educator | Midwest | Crime|Drama |
297 | 46 | Evil Dead II (Dead by Dawn) (1987) | 5 | M | 18-24 | unemployed | Southwest | Action|Comedy|Fantasy|Horror |
298 | 46 | Rosemary's Baby (1968) | 5 | M | 18-24 | unemployed | Southwest | Drama|Horror|Thriller |
299 | 46 | Night on Earth (1991) | 1 | M | 18-24 | unemployed | Southwest | Comedy|Drama |
300 rows × 8 columns
Note that this was a complete-user hold-out test set (the model's parameters were not calculated using any information from these users and might not even have used any information from the movies that are being recommended), but the recommendations shown here are limited to only the movies that each of these users have rated in order to be able to see how they would have rated the recommendations.
As the top-5 recommendations for each user seem to have been well rated by them, we might guess they are good. They might not be rated as highly as simply recommending the best-rated movies, but the recommendations are personalized and can recommend newer movies too (as long as they have tags).
As for implementing such a model, since it generates the same recommendations for any user with the same combination of gender, age, occupation and location, the recommendations could be pre-computed for each of the $2 \times 7 \times 21 \times 8=2352$ theoreticall buckets (in reality, not all of them make sense though, as people of certain ages - such as under 18 - shouldn't fall into certain occupations).