# Recommending movies (based on user demographics and movie info)¶

Note: if you are visualizing this notebook directly from GitHub, some mathematical symbols might display incorrectly or not display at all. This same notebook can be rendered from nbviewer by following this link.

This project consists of recommending movies to users based on their demographic information (in this case: age, gender, occupation and geographical region) and on movie information (in this case: year of production, genres, and user tags), using the MovieLens 1M dataset, which contains 1,000,209 ratings of 3,900 movies from 6,040 users, along with the users’ demographic information and some basic movie information; enhancing the movie info with the MovieLens 20M dataset, which contains more detailed movie information in the form of tag genomes as described in Vig, J., Sen, S., & Riedl, J. (2012). The tag genome: Encoding community knowledge to support novel interaction. ACM Transactions on Interactive Intelligent Systems (TiiS), 2(3), 13..

The formula used is an implementation of what’s described in Park, S. T., & Chu, W. (2009, October). Pairwise preference regression for cold-start recommendation. In Proceedings of the third ACM conference on Recommender systems (pp. 21-28). ACM., with some slight modifications – the general idea is to produce a regression on differences between ratings of two movies from the same user using the outer products of the user and movie attribute vectors.

In comparison to recommendations based on past user ratings, these kinds of models are able to provide quality recommendations to new users (for whom there is demographic information available), and are able to recommend both old and new movies (as long as there is information about them).

The idea implemented here differs from what the paper above describes in that:

• More movie information is added through the use of the tag genome info.
• No rating bots are used to enhance movie features.
• For computational reasons, the results will only be evaluated with rating averages of the movies that would be recommended for a hold-out user set.

## Recommendation formula¶

This model consists on ranking movies for a user by a linear combination of the outer product of movie and user features multiplied by some weights that minimize the squares of differences between ratings of each two movies minus predicted differences, plus a regularization term - this is given by the following formula:

$$\min_w \sum_{u \in users} \bigg( \frac{1}{\left\vert\ M_u \right\vert} \sum_{i \in M_u} \sum_{j \in M_u} (\:(R_{ui} - R_{uj}) - (w^\mathsf{T}(z_i \otimes x_u) - w^\mathsf{T}(z_j \otimes x_u)\:)^2 \bigg) + \lambda \lVert w \rVert^2_2$$

Where $R_{ui}$ are the ratings given by users to movies, $x$ are the user features, $z$ are the movie features, $M_{user}$ are the movies that have been rated by a given user, $\lambda$ is a regularization term, and $w$ are the coefficients assigned to each combination of user-movie feature. Note that the optimization problem is convex with respect to $w$.

Recommendations are then produced for each user by calculating, for each movie, $w^\mathsf{T}(z_j \otimes x_u)$ and taking the movies with highest such score for that user.

Intuitively, the parameters that minimize such a loss function would be exactly the same that minimize a regression of the centered ratings for each user, that is:

$$\min_w \sum_{u \in users} \frac{1}{\left\vert\ M_u \right\vert} \sum_{i \in M_u} \bigg( (R_{ui} - \overline{R_u}) - w^\mathsf{T}(z_i \otimes x_u) \bigg)^2 + \lambda \lVert w \rVert^2_2$$

This is far easier and faster to work with, and can be easily computed with standard libraries. I found the formula without weights by movies rated per user to be slightly more accurate after optimizing the coefficients for half an hour.

## Sections¶

1. Processing the movie data

2. Processing the user data

4. Fitting the model with Spark

5. Evaluating the model and checking some recommendations

## 1. Processing the movie data¶

The movie data needs some processing in order to put it in the right format. As explained before, the movie data can be enhanced with the tag genome information, which contains 1128 tags for each movie in a relative scale from 0 to 1 (with values closer to 1 indicating that the movie has more of that tag). Although these are way too many tags to use with this small ratings data, many tags are relate to each other and it’s possible to make some good feature reduction using principal components.

A small look at the data:

In [1]:
import pandas as pd, numpy as np, re
from collections import defaultdict

movies['hasYear']=movies.title.map(lambda x: bool(re.search("\s$(\d{4})$$",x.strip()))) movies['Year']='unknown' movies['Year'].loc[movies.hasYear]=movies.title.loc[movies.hasYear].map(lambda x: re.search("\s$(\d{4})$$",x.strip()).group(1))
del movies['hasYear']

/home/david/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.py:179: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)

Out[1]:
movieId title genres Year
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1995
1 2 Jumanji (1995) Adventure|Children|Fantasy 1995
2 3 Grumpier Old Men (1995) Comedy|Romance 1995
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 1995
4 5 Father of the Bride Part II (1995) Comedy 1995
In [2]:
movies['genres']=movies.genres.map(lambda x: set(x.split('|')))
present_genres=set()
for movie in movies.itertuples():
present_genres=present_genres.union(movie.genres)
for genre in present_genres:
movies['genre'+genre]=movies.genres.map(lambda x: 1.0*(genre in x))
present_genres

Out[2]:
{'(no genres listed)',
'Action',
'Animation',
'Children',
'Comedy',
'Crime',
'Documentary',
'Drama',
'Fantasy',
'Film-Noir',
'Horror',
'IMAX',
'Musical',
'Mystery',
'Romance',
'Sci-Fi',
'Thriller',
'War',
'Western'}

Processing the tag genome info:

In [3]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
%matplotlib inline

tags_wide=tags.pivot(index='movieId', columns='tagId', values='relevance')
tags_wide=tags_wide.fillna(0)
pca=PCA(svd_solver='full')
pca.fit(tags_wide)

plt.figure(figsize=(20,8))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.yticks(np.arange(0.2, 1.1, .1))
plt.xticks(np.arange(0, 1128, 50))
plt.grid()


From this figure, it seems that 50 tags or so would be a good number to include.

In [4]:
tags_pca=pd.DataFrame(pca.transform(tags_wide)[:,:50])
tags_pca.columns=["pc"+str(x) for x in tags_pca.columns.values]
tags_pca['movieId']=tags_wide.index
movies=pd.merge(movies,tags_pca,how='inner',on='movieId')


The year is converted into a discrete variable using the same criteria as in the original paper - the ratings were taken around the year 2000 so it makes sense to use these limits, in order to identify what were more recent movies at that time (which comprise the majority of the ratings).

In [5]:
## these criteria for making year discrete were taken from the same paper describing the method
def discretize_year(x):
if x=='unknown':
return x
else:
x=int(x)
if x>=2000:
return '>=2000'
if x>=1995 and x<=1999:
return str(x)
if x>=1990 and x<=1994:
return 'low90s'
if x>=1980 and x<=1989:
return '80s'
if x>=1970 and x<=1979:
return '70s'
if x>=1960 and x<=1969:
return '60s'
if x>=1950 and x<=1959:
return '50s'
if x>=1940 and x<=1959:
return '40s'
if x<1940:
return '<1940'
else:
return 'unknown'

movies_features=movies.copy()
del movies_features['title']
del movies_features['genres']
del movies_features['genre(no genres listed)']
movies_features['Year']=movies_features.Year.map(lambda x: discretize_year(x))
movies_features=pd.get_dummies(movies_features, columns=['Year'])
movies_features.set_index('movieId',inplace=True)

Out[5]:
genreMystery genreSci-Fi genreCrime genreDrama genreAnimation genreIMAX genreAction genreComedy genreDocumentary genreWar ... Year_1999 Year_40s Year_50s Year_60s Year_70s Year_80s Year_<1940 Year_>=2000 Year_low90s Year_unknown
movieId
1 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0
4 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 83 columns

## 2. Processing the user data¶

The dataset contains demographic info with zip codes. As there are way too many of them, I’ll try to guess the US region from these zipcodes. In order to do so, I’m using a publicly available table mapping zip codes to states, another one mapping state names to their abbreviations, and finally classifying the states into regions according to usual definitions.

In [6]:
zipcode_abbs=pd.read_csv("/home/david/movielens/zips/states.csv")
zipcode_abbs_dct={z.State:z.Abbreviation for z in zipcode_abbs.itertuples()}
us_regs_table=[
('New England', 'Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, Vermont'),
('Middle Atlantic', 'Delaware, Maryland, New Jersey, New York, Pennsylvania'),
('South', 'Alabama, Arkansas, Florida, Georgia, Kentucky, Louisiana, Mississippi, Missouri, North Carolina, South Carolina, Tennessee, Virginia, West Virginia'),
('Midwest', 'Illinois, Indiana, Iowa, Kansas, Michigan, Minnesota, Nebraska, North Dakota, Ohio, South Dakota, Wisconsin'),
('Southwest', 'Arizona, New Mexico, Oklahoma, Texas'),
]
us_regs_table=[(x[0],[i.strip() for i in x[1].split(",")]) for x in us_regs_table]
us_regs_dct=dict()
for r in us_regs_table:
for s in r[1]:
us_regs_dct[zipcode_abbs_dct[s]]=r[0]

In [7]:
zipcode_info=pd.read_csv("/home/david/movielens/free-zipcode-database.csv")
zipcode_info=zipcode_info.groupby('Zipcode').first().reset_index()
zipcode_info['State'].loc[zipcode_info.Country!="US"]='UnknownOrNonUS'
zipcode_info['Region']=zipcode_info['State'].copy()
zipcode_info['Region'].loc[zipcode_info.Country=="US"]=zipcode_info.Region.loc[zipcode_info.Country=="US"].map(lambda x: us_regs_dct[x] if x in us_regs_dct else 'UsOther')
zipcode_info=zipcode_info[['Zipcode', 'Region']]

/home/david/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (11) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)

Out[7]:
Zipcode Region
0 501 Middle Atlantic
1 544 Middle Atlantic
2 601 UsOther
3 602 UsOther
4 603 UsOther

A small look the the demographic data provided in the dataset:

In [8]:
users=pd.read_table("/home/david/movielens/ml-1m/ml-1m/users.dat",sep='::',names=["userId","Gender","Age","Occupation","Zipcode"], engine='python')
users["Zipcode"]=users.Zipcode.map(lambda x: np.int(re.sub("-.*","",x)))
users=pd.merge(users,zipcode_info,on='Zipcode',how='left')
users['Region']=users.Region.fillna('UnknownOrNonUS')

Out[8]:
userId Gender Age Occupation Zipcode Region
0 1 F 1 10 48067 Midwest
1 2 M 56 16 70072 South
2 3 M 25 15 55117 Midwest
3 4 M 45 7 2460 New England
4 5 M 25 20 55455 Midwest
In [9]:
users.Region.value_counts()

Out[9]:
West               1652
Midwest            1546
South               887
Middle Atlantic     872
New England         507
Southwest           462
UnknownOrNonUS       73
UsOther              41
Name: Region, dtype: int64
In [10]:
users_features=users.copy()
users_features['Gender']=users_features.Gender.map(lambda x: 1.0*(x=='M'))
del users_features['Zipcode']
users_features['Age']=users_features.Age.map(lambda x: str(x))
users_features['Occupation']=users_features.Occupation.map(lambda x: str(x))
users_features=pd.get_dummies(users_features, columns=['Age', 'Occupation', 'Region'])
users_features.set_index('userId',inplace=True)

Out[10]:
Gender Age_1 Age_18 Age_25 Age_35 Age_45 Age_50 Age_56 Occupation_0 Occupation_1 ... Occupation_8 Occupation_9 Region_Middle Atlantic Region_Midwest Region_New England Region_South Region_Southwest Region_UnknownOrNonUS Region_UsOther Region_West
userId
1 0.0 1 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
2 1.0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 1 0 0 0 0
3 1.0 0 0 1 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
4 1.0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
5 1.0 0 0 1 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0

5 rows × 37 columns

A small look at the ratings provided:

In [11]:
ratings=pd.read_table("/home/david/movielens/ml-1m/ml-1m/ratings.dat", sep="::", names=["userId","movieId","Rating","Timestamp"], engine='python')
movies_w_sideinfo=set(list(movies.movieId))
ratings=ratings.loc[ratings.movieId.map(lambda x: x in movies_w_sideinfo)]

Out[11]:
userId movieId Rating Timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291

Generating a train and test set - for computational reasons I'll just take 100 random users as test with all the movies they rated:

In [12]:
userids_present=list(set(list(ratings.userId)))
np.random.seed(1)
users_testset=set(list(np.random.choice(userids_present,replace=False,size=100)))

ratings_train=ratings.loc[ratings.userId.map(lambda x: x not in users_testset)]
ratings_test=ratings.loc[ratings.userId.map(lambda x: x in users_testset)]
users_trainset=set(list(ratings.userId.loc[ratings.userId.map(lambda x: x not in users_testset)]))

# now centering the ratings
avg_rating_by_user=ratings_train.groupby('userId')['Rating'].mean().to_frame().rename(columns={'Rating':'AvgRating'})
ratings_train=pd.merge(ratings_train, avg_rating_by_user, left_on='userId',right_index=True)
ratings_train['RatingCentered']=ratings_train.Rating-ratings_train.AvgRating

print(ratings_train.shape[0])
print(ratings_test.shape[0])

978017
18866


## 4. Fitting the model with Spark¶

The data is very high-dimensional and doesn't fit in a computer's RAM memory, thus Spark comes very handy for the computations, even when run locally. As it takes a long time to compute the coefficients, this will be done without any hyperparameter tuning.

Starting Spark:

In [13]:
import findspark

import pyspark
sc = pyspark.SparkContext()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)


Now fitting the model:

In [14]:
from pyspark.mllib.regression import (LabeledPoint, RidgeRegressionWithSGD)
from pyspark.ml.regression import LinearRegression
from pyspark.ml.linalg import Vectors, VectorUDT
from scipy.sparse import csc_matrix

def generate_features(user,movie,users_features_bc,movies_features_bc):
user_feats=users_features_bc.value.loc[user].as_matrix()
movie_feats=movies_features_bc.value.loc[movie].as_matrix()
return csc_matrix(np.kron(user_feats,movie_feats).reshape(-1,1))

trainset=sc.parallelize([(i.userId,i.movieId,i.RatingCentered) for i in ratings_train.itertuples()])\
.map(lambda x: LabeledPoint(x[2],generate_features(x[0],x[1],users_features_bc,movies_features_bc)))\
.map(lambda x: (float(x.label),x.features.asML())).toDF(['label','features'])
trainset.repartition(50)

recommender=LinearRegression(regParam=1e-4).fit(trainset)
formula_coeffs=recommender.coefficients.toArray()


## 5. Evaluating the model and checking some recommendations¶

Finally, evaluating what this system recommends to users. Due to the computational time it takes, the results won’t be evaluated with the metrics proposed in the paper at the beginning. I’ll just take average ratings for top-5 recommendations for each user in the test set and compare them to average ratings (the expected value for random recommendations) and to the maximum possible ratings from 5 movies each.

This is not a really good measure, but it’s a good sense check to see if the recommendations are making sense and if the system is better than nothing.

Getting scores for the test set:

In [15]:
def generate_features_series(user,movie):
user_feats=users_features.loc[user].as_matrix()
movie_feats=movies_features.loc[movie].as_matrix()
return pd.Series(np.kron(user_feats,movie_feats).astype('float64'))

X_test=ratings_test.apply(lambda x: generate_features_series(x['userId'],x['movieId']), axis=1)
ratings_test['score']=X_test.dot(formula_coeffs)

/home/david/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:7: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Comparing the model to recommending the most popular movies too:

In [16]:
avg_ratings=ratings.groupby('movieId')['Rating'].mean().to_frame().rename(columns={"Rating":"AvgRating"})
ratings_test=pd.merge(ratings_test,avg_ratings,left_on='movieId',right_index=True)


Now comparing it to no model (random recommendation) and best possible recommendations (in terms of ratings):

In [17]:
print 'Averge movie rating:',ratings_test.groupby('userId')['Rating'].mean().mean()
print 'Average rating for top-5 rated by each user:',ratings_test.sort_values(['userId','Rating'],ascending=False).groupby('userId')['Rating'].head(5).mean()
print 'Average rating for bottom-5 rated by each user:',ratings_test.sort_values(['userId','Rating'],ascending=True).groupby('userId')['Rating'].head(5).mean()
print 'Average rating for top-5 recommendations of best-rated movies:',ratings_test.sort_values(['userId','AvgRating'],ascending=False).groupby('userId')['Rating'].head(5).mean()
print '----------------------'
print 'Average rating for top-5 recommendations from this model:',ratings_test.sort_values(['userId','score'],ascending=False).groupby('userId')['Rating'].head(5).mean()
print 'Average rating for bottom-5 (non-)recommendations from this model:',ratings_test.sort_values(['userId','score'],ascending=True).groupby('userId')['Rating'].head(5).mean()

Averge movie rating: 3.68455396497
Average rating for top-5 rated by each user: 4.96
Average rating for bottom-5 rated by each user: 1.61
Average rating for top-5 recommendations of best-rated movies: 4.416
----------------------
Average rating for top-5 recommendations from this model: 4.338
Average rating for bottom-5 (non-)recommendations from this model: 2.554


Examining some recommendations (3 per user):

In [18]:
top3=ratings_test.sort_values(['userId','score'],ascending=False).groupby('userId').head(3)
top3=top3[['userId','movieId','Rating']]
top3=pd.merge(top3,users,on='userId',how='left')
top3.rename(columns={'title':'Recommended Movie', 'genres':"Movie's Genres", 'Rating':'Rating by user'},inplace=True)
age_mapping={
1:  "Under 18",
18:  "18-24",
25:  "25-34",
35:  "35-44",
45:  "45-49",
50:  "50-55",
56:  "56+"
}
top3['Age']=top3.Age.map(lambda x: age_mapping[x])
occupations_mapping={
0:  "other or not specified",
2:  "artist",
5:  "customer service",
6:  "doctor/health care",
7:  "executive/managerial",
8:  "farmer",
9:  "homemaker",
10:  "K-12 student",
11:  "lawyer",
12:  "programmer",
13:  "retired",
14:  "sales/marketing",
15:  "scientist",
16:  "self-employed",
17:  "technician/engineer",
19:  "unemployed",
20:  "writer"
}
top3['Occupation']=top3.Occupation.map(lambda x: occupations_mapping[x])
del top3['Zipcode']
del top3['movieId']
top3[['userId','Recommended Movie','Rating by user', 'Gender','Age','Occupation','Region',"Movie's Genres"]]

Out[18]:
userId Recommended Movie Rating by user Gender Age Occupation Region Movie's Genres
0 5938 Raiders of the Lost Ark (Indiana Jones and the... 5 M 25-34 academic/educator South Action|Adventure
1 5938 Usual Suspects, The (1995) 5 M 25-34 academic/educator South Crime|Mystery|Thriller
2 5938 North by Northwest (1959) 5 M 25-34 academic/educator South Action|Adventure|Mystery|Romance|Thriller
3 5798 Raiders of the Lost Ark (Indiana Jones and the... 4 M 35-44 other or not specified West Action|Adventure
4 5798 Saving Private Ryan (1998) 5 M 35-44 other or not specified West Action|Drama|War
5 5798 Godfather, The (1972) 5 M 35-44 other or not specified West Crime|Drama
6 5693 North by Northwest (1959) 4 F 25-34 college/grad student West Action|Adventure|Mystery|Romance|Thriller
7 5693 Schindler's List (1993) 3 F 25-34 college/grad student West Drama|War
8 5693 Notorious (1946) 3 F 25-34 college/grad student West Film-Noir|Romance|Thriller
9 5692 Shawshank Redemption, The (1994) 5 F 25-34 executive/managerial South Crime|Drama
10 5692 Schindler's List (1993) 5 F 25-34 executive/managerial South Drama|War
11 5692 Raiders of the Lost Ark (Indiana Jones and the... 4 F 25-34 executive/managerial South Action|Adventure
12 5582 Schindler's List (1993) 5 M 45-49 academic/educator Midwest Drama|War
13 5582 To Kill a Mockingbird (1962) 5 M 45-49 academic/educator Midwest Drama
14 5582 Third Man, The (1949) 5 M 45-49 academic/educator Midwest Film-Noir|Mystery|Thriller
15 5560 Sting, The (1973) 5 F 35-44 technician/engineer West Comedy|Crime
16 5560 Guess Who's Coming to Dinner (1967) 5 F 35-44 technician/engineer West Drama
17 5560 Sixth Sense, The (1999) 5 F 35-44 technician/engineer West Drama|Horror|Mystery
18 5392 Fargo (1996) 4 M 50-55 academic/educator West Comedy|Crime|Drama|Thriller
19 5392 Close Encounters of the Third Kind (1977) 5 M 50-55 academic/educator West Adventure|Drama|Sci-Fi
20 5392 American Beauty (1999) 4 M 50-55 academic/educator West Drama|Romance
21 5352 It Happened One Night (1934) 5 F 35-44 customer service Midwest Comedy|Romance
22 5352 Thin Man, The (1934) 4 F 35-44 customer service Midwest Comedy|Crime
23 5352 My Man Godfrey (1936) 5 F 35-44 customer service Midwest Comedy|Romance
24 5235 Wallace & Gromit: A Close Shave (1995) 5 M 25-34 college/grad student Middle Atlantic Animation|Children|Comedy
25 5235 Monty Python and the Holy Grail (1975) 5 M 25-34 college/grad student Middle Atlantic Adventure|Comedy|Fantasy
26 5235 Dr. Strangelove or: How I Learned to Stop Worr... 5 M 25-34 college/grad student Middle Atlantic Comedy|War
27 5203 Forrest Gump (1994) 2 F 45-49 doctor/health care New England Comedy|Drama|Romance|War
28 5203 Shakespeare in Love (1998) 5 F 45-49 doctor/health care New England Comedy|Drama|Romance
29 5203 Braveheart (1995) 5 F 45-49 doctor/health care New England Action|Drama|War
... ... ... ... ... ... ... ... ...
270 286 Raiders of the Lost Ark (Indiana Jones and the... 4 M 25-34 academic/educator Midwest Action|Adventure
271 286 Rear Window (1954) 4 M 25-34 academic/educator Midwest Mystery|Thriller
272 286 North by Northwest (1959) 5 M 25-34 academic/educator Midwest Action|Adventure|Mystery|Romance|Thriller
273 277 Princess Bride, The (1987) 4 F 35-44 academic/educator West Action|Adventure|Comedy|Fantasy|Romance
274 277 Sixth Sense, The (1999) 5 F 35-44 academic/educator West Drama|Horror|Mystery
275 277 Shakespeare in Love (1998) 4 F 35-44 academic/educator West Comedy|Drama|Romance
276 249 Shawshank Redemption, The (1994) 5 F 18-24 sales/marketing Midwest Crime|Drama
277 249 Princess Bride, The (1987) 5 F 18-24 sales/marketing Midwest Action|Adventure|Comedy|Fantasy|Romance
278 249 Usual Suspects, The (1995) 4 F 18-24 sales/marketing Midwest Crime|Mystery|Thriller
279 235 M (1931) 5 M 25-34 other or not specified UnknownOrNonUS Crime|Film-Noir|Thriller
280 235 Reservoir Dogs (1992) 5 M 25-34 other or not specified UnknownOrNonUS Crime|Mystery|Thriller
281 235 Princess Mononoke (Mononoke-hime) (1997) 4 M 25-34 other or not specified UnknownOrNonUS Action|Adventure|Animation|Drama|Fantasy
282 180 Shadow of a Doubt (1943) 4 M 45-49 programmer New England Crime|Drama|Thriller
283 180 Silence of the Lambs, The (1991) 4 M 45-49 programmer New England Crime|Horror|Thriller
284 180 One Flew Over the Cuckoo's Nest (1975) 5 M 45-49 programmer New England Drama
285 170 Dr. Strangelove or: How I Learned to Stop Worr... 5 M 25-34 lawyer Middle Atlantic Comedy|War
286 170 North by Northwest (1959) 5 M 25-34 lawyer Middle Atlantic Action|Adventure|Mystery|Romance|Thriller
287 170 Raiders of the Lost Ark (Indiana Jones and the... 5 M 25-34 lawyer Middle Atlantic Action|Adventure
288 148 Raiders of the Lost Ark (Indiana Jones and the... 5 M 50-55 technician/engineer Midwest Action|Adventure
289 148 Schindler's List (1993) 5 M 50-55 technician/engineer Midwest Drama|War
290 148 North by Northwest (1959) 5 M 50-55 technician/engineer Midwest Action|Adventure|Mystery|Romance|Thriller
291 126 Charade (1963) 5 M 18-24 homemaker West Comedy|Crime|Mystery|Romance|Thriller
293 126 Die Hard (1988) 1 M 18-24 homemaker West Action|Crime|Thriller
294 80 Schindler's List (1993) 5 M 56+ academic/educator Midwest Drama|War
295 80 One Flew Over the Cuckoo's Nest (1975) 4 M 56+ academic/educator Midwest Drama
296 80 Green Mile, The (1999) 5 M 56+ academic/educator Midwest Crime|Drama
297 46 Evil Dead II (Dead by Dawn) (1987) 5 M 18-24 unemployed Southwest Action|Comedy|Fantasy|Horror
298 46 Rosemary's Baby (1968) 5 M 18-24 unemployed Southwest Drama|Horror|Thriller
299 46 Night on Earth (1991) 1 M 18-24 unemployed Southwest Comedy|Drama

300 rows × 8 columns

Note that this was a complete-user hold-out test set (the model's parameters were not calculated using any information from these users and might not even have used any information from the movies that are being recommended), but the recommendations shown here are limited to only the movies that each of these users have rated in order to be able to see how they would have rated the recommendations.

As the top-5 recommendations for each user seem to have been well rated by them, we might guess they are good. They might not be rated as highly as simply recommending the best-rated movies, but the recommendations are personalized and can recommend newer movies too (as long as they have tags).

As for implementing such a model, since it generates the same recommendations for any user with the same combination of gender, age, occupation and location, the recommendations could be pre-computed for each of the $2 \times 7 \times 21 \times 8=2352$ theoreticall buckets (in reality, not all of them make sense though, as people of certain ages - such as under 18 - shouldn't fall into certain occupations).