Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: March 20, 2020
Julia version: 1.3.1
using DataFrames, CSV, Plots, LinearAlgebra, SparseArrays
We would like to make movie recommmendations based on prior user ratings. For this purpose, we will use the MovieLens dataset. There are two files. The first one contains a list of movieId
's and title
's. The second one is a list of user ratings: userId
is the user providing the rating, movieId
is the movie being rated, and rating
is the rating on a $1$ to $5$ scale (allowing half-points).
dfm = CSV.read("./movielens-small-movies.csv")
first(dfm, 5)
movieId | title | |
---|---|---|
Int64 | String | |
1 | 1 | Toy Story (1995) |
2 | 2 | Jumanji (1995) |
3 | 3 | Grumpier Old Men (1995) |
4 | 4 | Waiting to Exhale (1995) |
5 | 5 | Father of the Bride Part II (1995) |
dfr = CSV.read("./movielens-small-ratings.csv")
first(dfr, 5)
userId | movieId | rating | |
---|---|---|---|
Int64 | Int64 | Float64 | |
1 | 1 | 1 | 4.0 |
2 | 1 | 3 | 4.0 |
3 | 1 | 6 | 4.0 |
4 | 1 | 44 | 5.0 |
5 | 1 | 47 | 5.0 |
Here's a summary of the data. There are $9742$ movies and $100836$ ratings made by $610$ users.
describe(dfm)
variable | mean | min | median | max | |
---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | |
1 | movieId | 4871.5 | 1 | 4871.5 | 9742 |
2 | title | '71 (2014) | À nous la liberté (Freedom for Us) (1931) |
describe(dfr)
variable | mean | min | median | max | nunique | nmissing | eltype | |
---|---|---|---|---|---|---|---|---|
Symbol | Float64 | Real | Float64 | Real | Nothing | Nothing | DataType | |
1 | userId | 326.128 | 1 | 325.0 | 610 | Int64 | ||
2 | movieId | 3108.38 | 1 | 2255.0 | 9742 | Int64 | ||
3 | rating | 3.50156 | 0.5 | 3.5 | 5.0 | Float64 |
nrow(dfr)
100836
For instance, rating $1000$ was made by user:
dfr[1000,:userId]
7
who watched the movie:
dfm[dfr[1000,:movieId],:title]
"Phantom of the Opera, The (2004)"
and gave it rating:
dfr[1000,:rating]
2.0
He or she was not a fan...
On the other hand, that same user really enjoyed:
[dfm[dfr[i,:movieId],:title] for i=1:nrow(dfr)
if (dfr[i,:userId] .== dfr[1000,:userId]) & (dfr[i,:rating] .== 5)]
13-element Array{String,1}: "Star Wars: Episode IV - A New Hope (1977)" "Forrest Gump (1994)" "Hot Shots! Part Deux (1993)" "Jurassic Park (1993)" "Silence of the Lambs, The (1991)" "Psycho (1960)" "Terminator, The (1984)" "Back to the Future (1985)" "Contact (1997)" "Seven Samurai (Shichinin no samurai) (1954)" "Planet of the Apes (1968)" "Naked Gun 2 1/2: The Smell of Fear, The (1991)" "Spirited Away (Sen to Chihiro no kamikakushi) (2001)"
We will convert the ratings data into a matrix. Each movies is rated by only a handful of users. More specifically, out of the
length(unique(dfr[:,:userId])) * length(unique(dfr[:,:movieId]))
5931640
possible ratings, we only have
nrow(dfr)
100836
or, in percentage,
nrow(dfr) / (length(unique(dfr[:,:userId])) * length(unique(dfr[:,:movieId])))
0.016999683055613623
We construct a matrix whose rows are the movies and whose columns are the users. Because this matrix will contain a lot of zeros, which we use to encode the absence of a rating, we first define it using a sparse
array (that is, we only list the non-zero entries), and then convert it to a dense matrix:
A = sparse(dfr[:,:movieId],dfr[:,:userId],dfr[:,:rating]);
The sparsity of this dataset is an issue. Two users may have very similar tastes, but may have rewieved different movies making it difficult to identify the similarity between them. One way to deal with the high-dimensionality of this data is to assume that it has a low-dimensional structure. Here it is natural to assume that there are a small number of "idealized" movies -- think action, comedy, horror, etc. -- and a small number of "idealized" moviegoers -- think people who like only action, people who like only comedy, people who like only horror, etc. -- and that every movie or moviegoer is a linear combination of these idealized entities. Mathemiatically, we'll see that we are looking for a low-rank approximation of our matrix $A$.
Parts of this topic's notebooks are based on the following references.
[Axl] S. Axler, Linear Algebra Done Right, Springer, 2015
[BHK] A. Blum, J. Hopcroft, R. Kannan, Foundations of Data Science, Cambridge University Press, 2020
[HJ] R. A. Horn, C. R. Johnson, Matrix Analysis, Cambridge University Press, 2013
[Mor] F. Morgan, Real Analysis, AMS, 2005
[TB] L. N. Trefethen, D. Bau, III, Numerical Linear Algebra, SIAM, 1997
To learn more about principal components analysis, you may want to take one of the following courses: