Learning from Data: Movie Recommendation in the IPython Notebook¶

Neil D. Lawrence and the Sheffield Machine Learning Research Group¶

8th October 2014¶

This notebook has been made available as part of our Open Data Science agenda. If you want to read more about this agenda there is a position paper/blog post available on it here.

This session is about 'learning from data'. How do we take the information on the internet and make sense of it. The answer, as you might expect, is using computers and mathematics. Luckily we also have a suite of tools to help. The first tool is a way of programming in python that really facilitates interaction with data. It is known as the "IPython Notebook", or more recently as the "Jupyter Project".

Welcome to the IPython Notebook¶

The notebook is a great way of interacting with computers. In particular it allows me to integrate text descriptions, maths and code all together in the same place. For me, that's what my research is all about. I try to take concepts that people can describe, then I try to capture the essence of the concept in a mathematical model. Then I try and implement the model on a computer, often combining it with data, to try and do something fun, useful or, ideally, both. Today we'll be looking at recommender systems.

Is Our Map Enough? Are Our Data Enough?¶

Is two dimensions really enough to capture the complexity of humans and their artforms? Is that how shallow we are? On second thoughts, don't answer that. We would certainly like to think that we need more than two dimensions to capture our complexity.

Let's extend our books and libraries analogy further: consider how we should place books that have a historical timeframe as well as some geographical location. Do we really want books from the 2nd World War to sit alongside books from the Roman Empire? Books on the American invasion of Sicily in 1943 are perhaps less related to books about Carthage than those books that study the Jewish Revolt from 66-70 (in the Roman Province of Judaea---more History!). So books that relate to subjects which are closer in time should probably be stored together. However, a student of 'rebellion against empire' may also be interested in the relationship between the Jewish Revolt of 66-70 and the Indian Rebellion of 1857 (against the British Empire), nearly 1800 years later (they might also like the Star Wars movies ...). Whilst the technologies involved in these revolts would be different, they still involve people (who we argued could all be summarised by sets of numbers) and the psychology of those people is shared: a rebellious nation angainst their imperial masters, triggered by misrule with a religious and cultural background.

To capture such nuances we would need further dimensions in our latent representation. But are further dimensions justified by the amount of data we have? Can we really understand the facets of a film that only has at most three or four ratings? One answer is to collect more data to justify extra dimensions. If you have got this far, maybe you'd like to 'fight on' against the imperial misrule of data, and consider the movielens 100k data below: many more ratings from many more users!

Going Further: Play Some More¶

If you want to take this model further then you'll need more data. One possible source of data is the movielens data set. They have data sets containing up to ten million movie ratings. The few ratings we were able to collect in the class are not enough to capture the rich structure underlying these films. Imagine if we assume that the ratings are uniformly distributed between 1 and 5. If you know something about information theory then you could use that to work out the maximum number of bits of information we could gain per rating.

Now we'll download the movielens 100k data and see if we can extract information about these movies.

In [ ]:
import pods
d = pods.datasets.movielens100k()
film_info = d['film_info']
Y_lens=d['Y']
Y_lens.describe()


Can you do stochastic gradient descent on this data and make the movie map? Let's start with the preprocessing.

In [ ]:
Y_lens.rating = Y_lens.rating - Y_lens.rating.mean()


Now let's create the data frames for U and V.

In [ ]:
user_index = Y_lens.user.unique()
item_index = Y_lens.item.unique()

init_u = np.random.normal(size=(len(user_index), q))*0.001
init_v = np.random.normal(size=(len(item_index), q))*0.001
U_lens = pd.DataFrame(init_u, index=user_index)
V_lens = pd.DataFrame(init_v, index=item_index)


Once again let's set the learning rate, and reset the counter.

In [ ]:
learn_rate = 0.01
counter = 0


Now if you have a lot of time you can run it across the movie lens data!

In [ ]:
Y_lens

In [ ]:
# warning, this will take a lot longer than the example above!
# if you start it accidentally and want it to stop, use Kernel->Interrupt from the menu above.
import sys
epochs = 5
for i in range(epochs):
# loop across the ratings
for ind, series in Y_lens.iterrows():
# get which film and user and value the rating us
film = series['item']
user = series['user']
y = series['rating']
# get u and v out from storage
u = U_lens.loc[user].values
v = V_lens.loc[film].values # get the u and v vectors from the matrices.
# compute the update for u
u = u + learn_rate*v*(y-np.dot(u,v))
# compute the update for v
v = v + learn_rate*u*(y-np.dot(u,v))
# put u and v back in storage.
U_lens.loc[user]=u
V_lens.loc[film]=v
counter +=1
if not counter % 100000: # checks if there is no remainder from dividing counter by 100000.
# let's not compute entire objective, it would take too long. Select a every 4000th sample.
obj = objective(Y_lens.loc[Y_lens.index[::4000]], U_lens, V_lens)
print "Update:", counter, "Objective:", obj


Now let's try plotting the new map of the movielens movies.

In [ ]:
# create a new figure axis
fig, axes = plt.subplots(figsize=(12,12))

for index in V_lens.index:
# plot the movie name if it was rated more than 2000 times.
if np.sum(Y_lens['item']==index)>2000:
axes.plot(V_lens[0][index], V_lens[1][index], 'rx')
axes.text(V_lens[0][index], V_lens[1][index], film_info['title'][index])


Doing this at Home¶

The great thing about data science is that there is nothing special about the University environment. All the tools we have used here are freely available, which means you can do all this at home! There are some web pages that help on installing python or there are some commercial packages which pull everything together for you Enthought Canopy and Anaconda.

The movielens 100k example above was pretty slow, one way we can try and make bigger moves with each iteration is to add momentum. Do you think you can manage that? There is a description of what momentum is from a University level CS course here. I'll consider anyone who manages to do this for work experience in my research group!