Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison

TOPIC 3¶

Matrix norms, low-rank approximation, and SVD¶

Updated: March 20, 2020

Julia version: 1.3.1

In [1]:

using DataFrames, CSV, Plots, LinearAlgebra, SparseArrays

Motivating example: movie recommendations¶

We would like to make movie recommmendations based on prior user ratings. For this purpose, we will use the MovieLens dataset. There are two files. The first one contains a list of movieId's and title's. The second one is a list of user ratings: userId is the user providing the rating, movieId is the movie being rated, and rating is the rating on a $1$ to $5$ scale (allowing half-points).

In [2]:

dfm = CSV.read("./movielens-small-movies.csv")
first(dfm, 5)

Out[2]:

5 rows × 2 columns

	movieId	title
	Int64	String
1	1	Toy Story (1995)
2	2	Jumanji (1995)
3	3	Grumpier Old Men (1995)
4	4	Waiting to Exhale (1995)
5	5	Father of the Bride Part II (1995)

In [3]:

dfr = CSV.read("./movielens-small-ratings.csv")
first(dfr, 5)

Out[3]:

5 rows × 3 columns

	userId	movieId	rating
	Int64	Int64	Float64
1	1	1	4.0
2	1	3	4.0
3	1	6	4.0
4	1	44	5.0
5	1	47	5.0

Here's a summary of the data. There are $9742$ movies and $100836$ ratings made by $610$ users.

In [4]:

describe(dfm)

Out[4]:

2 rows × 8 columns (omitted printing of 3 columns)

	variable	mean	min	median	max
	Symbol	Union…	Any	Union…	Any
1	movieId	4871.5	1	4871.5	9742
2	title		'71 (2014)		À nous la liberté (Freedom for Us) (1931)

In [5]:

describe(dfr)

Out[5]:

3 rows × 8 columns

	variable	mean	min	median	max	nunique	nmissing	eltype
	Symbol	Float64	Real	Float64	Real	Nothing	Nothing	DataType
1	userId	326.128	1	325.0	610			Int64
2	movieId	3108.38	1	2255.0	9742			Int64
3	rating	3.50156	0.5	3.5	5.0			Float64

In [6]:

nrow(dfr)

Out[6]:

For instance, rating $1000$ was made by user:

In [7]:

dfr[1000,:userId]

Out[7]:

who watched the movie:

In [8]:

dfm[dfr[1000,:movieId],:title]

Out[8]:

"Phantom of the Opera, The (2004)"

and gave it rating:

In [9]:

dfr[1000,:rating]

Out[9]:

2.0

He or she was not a fan...

On the other hand, that same user really enjoyed:

In [10]:

[dfm[dfr[i,:movieId],:title] for i=1:nrow(dfr)
        if (dfr[i,:userId] .== dfr[1000,:userId]) & (dfr[i,:rating] .== 5)]

Out[10]:

13-element Array{String,1}:
 "Star Wars: Episode IV - A New Hope (1977)"           
 "Forrest Gump (1994)"                                 
 "Hot Shots! Part Deux (1993)"                         
 "Jurassic Park (1993)"                                
 "Silence of the Lambs, The (1991)"                    
 "Psycho (1960)"                                       
 "Terminator, The (1984)"                              
 "Back to the Future (1985)"                           
 "Contact (1997)"                                      
 "Seven Samurai (Shichinin no samurai) (1954)"         
 "Planet of the Apes (1968)"                           
 "Naked Gun 2 1/2: The Smell of Fear, The (1991)"      
 "Spirited Away (Sen to Chihiro no kamikakushi) (2001)"

We will convert the ratings data into a matrix. Each movies is rated by only a handful of users. More specifically, out of the

In [11]:

length(unique(dfr[:,:userId])) * length(unique(dfr[:,:movieId]))

Out[11]:

possible ratings, we only have

In [12]:

nrow(dfr)

Out[12]:

or, in percentage,

In [13]:

nrow(dfr) / (length(unique(dfr[:,:userId])) * length(unique(dfr[:,:movieId])))

Out[13]:

0.016999683055613623

We construct a matrix whose rows are the movies and whose columns are the users. Because this matrix will contain a lot of zeros, which we use to encode the absence of a rating, we first define it using a sparse array (that is, we only list the non-zero entries), and then convert it to a dense matrix:

In [14]:

A = sparse(dfr[:,:movieId],dfr[:,:userId],dfr[:,:rating]);

The sparsity of this dataset is an issue. Two users may have very similar tastes, but may have rewieved different movies making it difficult to identify the similarity between them. One way to deal with the high-dimensionality of this data is to assume that it has a low-dimensional structure. Here it is natural to assume that there are a small number of "idealized" movies -- think action, comedy, horror, etc. -- and a small number of "idealized" moviegoers -- think people who like only action, people who like only comedy, people who like only horror, etc. -- and that every movie or moviegoer is a linear combination of these idealized entities. Mathemiatically, we'll see that we are looking for a low-rank approximation of our matrix $A$.

Reading assignment (before the lectures)¶

Material for this lecture is covered in:

Sections 4.3 and 6.3, and Chapter 7 in [Sol]

References¶

Parts of this topic's notebooks are based on the following references.

[Axl] S. Axler, Linear Algebra Done Right, Springer, 2015

[BHK] A. Blum, J. Hopcroft, R. Kannan, Foundations of Data Science, Cambridge University Press, 2020

[HJ] R. A. Horn, C. R. Johnson, Matrix Analysis, Cambridge University Press, 2013

[Mor] F. Morgan, Real Analysis, AMS, 2005

[Sol] J. Solomon, Numerical Algorithms: Methods for Computer Vision, Machine Learning, and Graphics, CRC, 2015

[TB] L. N. Trefethen, D. Bau, III, Numerical Linear Algebra, SIAM, 1997

To go further¶

To learn more about principal components analysis, you may want to take one of the following courses: