Problem 1 (Theoretical tasks)

40 pts

Ranks and skeleton decomposition

  • What is the rank of the matrix $\begin{pmatrix}1 & 1 & 1 \\ 2 & 2 & 2 \\ 1 & 2 & 1\end{pmatrix}$? Why? Find its skeleton decomposition.

  • Find rank of the matrix $a_{ij} = (i+j)^2$, $i,j=1,\dots,n$. Find its skeleton decomposition. Note: use interpretation of rank as the minimal number of summarands with separated variables $i$ and $j$.

SVD

Interesting remark: matrix norms that have the form $$\|A\|^{\text{shatten}}_p = \left(\sigma_1^p(A) + \dots + \sigma_r^p(A)\right)^{1/p}$$ are called Shatten norms, where $\sigma_i(A)$ are singular values of $A$. Important cases are $p=2$ - the Frobenius norm, $p=\infty$ - the second norm and $p=1$ - the nuclear norm.

  • Prove that $\|A\|_2 = \sigma_1(A)$ and $\|A\|_F = \sqrt{\sigma_1^2(A) + \dots + \sigma_r^2(A)}$. Note: unitary invariance of the second and Frobenius norms might be helpful.

  • Find the distance between $A$ and its truncated SVD in the second and in the Frobenius norms.

  • Suppose $A^* = A$. Find the exact relation between the eigenvalues and singular values of $A$.

Eigenvalues

  • Is matrix $\begin{pmatrix}1 & 0 & 1 \\ 0 & 0 & 0 \\ 1 & 0 & 1\end{pmatrix}$ diagonalizable? Why?

  • Prove that $\lambda(A)\in \mathbb{R}^1$ if $A$ is Hermitian, $\lambda(A)$ has only imaginary part if $A$ is skew-Hermitian and $|\lambda(A)|=1$ if $A$ is unitary.

Linear systems

  • Is it possible to solve 1 000 000 $\times$ 1 000 000 dense linear system on your laptop within 1 month via LU decomposition provided that the LU decomposition is not given?

  • What is the complexity of solving a system with a matrix given by its LU or by its QR decomposition. Explain the answer.

In [ ]:
 

Problem 2 (QR decomposition)

20 pts

  • Implement the QR factorization using Housholder reflections and the modified Gram-Schimdt algorithm. Make sure that they work on a random matrix. To measure orthogonality use $\|G - I\|$ as in Problem Set 1

  • Compare the Gram-Schmidt (from the Problem Set 1), modified Gram-Schmidt and the Housholder algorithms on $5\times5$, $10\times 10$ and $20\times 20$ Hilbert matrices

In [ ]:
 

Problem 3 (PageRank)

40 pts

Connected graph

  • First of all create PageRank matrix $A$ that corresponds to the following graph: Make sure that your matrix is stochastic. What is the largest eigenvalue of this matrix?
  • Implement power method and plot relative errors ${|\lambda_{k+1} - 1|}$ (since you know that the exact $\lambda$ is $1$) and $\frac{\|x_{k+1} - x_{k}\|}{\|x_{k+1}\|}$ (since you have no information about $x$) as a function of $k$, where $k$ is the interation number. DO NOT FORGET TO USE LOGARITHMIC SCALE!

Disconneted graph: damping factor importance

  • Create PageRank matrix $A$ that corresponds to the following graph:

  • Run the power method and plot relative errors ${|\lambda_{k+1} - 1|}$ and $\frac{\|x_{k+1} - x_{k}\|}{\|x_{k+1}\|}$ as a function of $k$. Why do you observe the absense of convergence?
    Hint: think about the multiplicity of $\lambda=1$

In order to avoid this problem Larry Page and Sergey Brin proposed to use the following regularization technique: $$ A_d = dA + \frac{1-d}{N} \begin{pmatrix} 1 & \dots & 1 \\ \vdots & & \vdots \\ 1 & \dots & 1 \end{pmatrix}, $$ where $d$ is small parameter (typically $d=0.85$), which is called damping factor, $A$ is of size $N\times N$. Now $A_d$ is the matrix with multiplicity of the largest eigenvalue equal to 1. Recall that computing the eigenvector of the PageRank matrix which corresponds to the largest eigenvalue has the following interpretation: consider a person who stays in a random node of the graph (i.e. opens a random web page); at each step he follows one of the outcoming edges uniformly at random (i.e. opens one of the links). So the guy randomly walks through the graph and the eigenvector we are looking for is exactly his stationary distribution — for each node it tells you the probability of visiting this particular node. Therefore if the guy has started from a part of a graph which is not connected with the other part, he will never get there. In the regularized model the person at each step follows one of the outcoming links with probability $d$ OR visits a random node from the whole graph with probability $(1-d)$.

  • Now run the power method with $A_d$ and plot errors as a function of iteration for different $d$.

Simple English Wiki

Let us now find the most significant articles on the simple english Wikipedia according to the PageRank model. We provide you with the adjecency matrix of the simple Wikipedia articles (file simple_wiki_matrix.mat, matrix can be acessed by the key 'W') and the dictionary that maps article id to its name on Wikipedia (file simple_wiki_dict.pickle).

  • Normalize each column of the adjecency matrix to get a matrix from the PageRank model. Check that the obtained matrix is stochastic.

  • Plot relative errors ${|\lambda_{k+1} - 1|}$ and $\frac{\|x_{k+1} - x_{k}\|}{\|x_{k+1}\|}$ as a function of $k$ for different $d$. Note that your matrix contains a lot of zeros and is stored in the sparse format. Hence np.dot(A, x) will not work. However, sparse matrices has method .dot(), so A.dot(x) works fine.

  • Print names of top-5 articles when $d=0.85$.

In [ ]:
 

Problem 4 (Eigenfaces)

50 pts

The aim of this task is to build a face classifier. There are 40 persons in the database. Every person is represented by 10 photos with slightly different facial expression.

  • Download the database of faces from here

  • Create training sample:

    Import first 9 images for each face ($9\times 40$ images). Represent these pictures as a matrix $F$ with $9\times 40$ columns, where each column is a reshaped 2D picture. Note: use np.reshape to reshape matrix into column

  • Calculate and plot mean face. Subtract it from each column of the matrix $F$

  • Calculate SVD decomposition of the shifted matrix F and truncate the obtained representaition: $F_r = U_r S_r V_r^T$.

    Here $U_r$ is a matrix with $r$ columns - basis set in a space of faces. $W_r = S_r V_r^T$ is a matrix of coefficients in the basis $U_r$. Note that now every image is represented as a small number of coefficients in the basis $U_r$.

  • Plot vectors in $U_r$ using subplots. Make sure that you get face-like images. Now you know what eigenfaces are =)

  • Import testing set which is the rest of photos. Find their coefficients in the basis $U_r$.

  • Compare the obtained vectors of coefficients to vectors in $W_r$ using cosine similarity and classify testing faces. As an output give indices of faces that were misclassified when $r=5$.

In [ ]:
 

Problem 5 (Building recommender systems with SVD)

50 pts

Recommender systems are gaining more and more popularity. They're used in a broad range of areas: music, movies, e-commerce, social and professional networks, online news to name a few. Web services like Pandora, Netflix and Amazon build and deploy their own recommender engines in order to make customers happier and generate additional revenue. Those companies, and many others, take the problem of building high-quality recommender system very seriously. One of many examples of a great interest in this area coming from industries is the famous Netflix competition with $1million prize for winners.

In this task you'll build a very simple yet powerfull engine for recommender system. Given the Movielens 10M data you'll implement an SVD-based model of movie recommendation system. SVD-based approach belongs to a family of collaborative filtering algorithms that use matrix factorization (MF). While there are many sophisticated algorithms for MF, pure SVD remains one of the top-performers. The main idea behind these algorithms is to represent each user and each movie as vectors in some low-dimensional feature space. That low-dimensional space(or latent factors space) shows what features a user likes and which of them are present in a movie. Interpreting those features is a separate and hard task and we will simply build the latent factors using SVD without focusing on the intrinsic meaning of those features. With this model the "likeability" of a movie for a particular user is estimated by a weighted inner product of their latent factors. The model should also respond fast and produce recommendations right after a new user demonstrated some of his preferences. Recomputing SVD for this task may take prohibitively long time that's why we'll use folding-in technique for making fresh recommendations quickly.

Task 0. Preparing the data.

You're given convenience functions get_movielens_data and split_data.

  • Use these functions to download the data, put it into the memory and split it into the training and testing set with 80/20 rule (80% of the data goes into training set and the rest - into test set).

Be aware that these sets are disjoint, e.g. users from the training set are not in the testing set and vice versa. This means that test users will be "new" (e.g. unseen before) for the trained model. In order to produce reasonable recommendations for these "new" users we will use folding-in (see task 2).

Note: downloading the dataset may take a couple of minutes. If you already have a copy of MovieLens 10M data on your computer you may force program to use it by specifying local_file argument in the get_movielens_data function. The local_file argument should be a full path to the zip file.

In [1]:
import pandas as pd
from pandas.io.common import ZipFile
from requests import get

import numpy as np
import scipy as sp
from scipy import sparse

from collections import namedtuple
import sys
In [2]:
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO 
In [3]:
def get_movielens_data(local_file=None):
    '''Downloads movielens data, normalizes users and movies ids,
    returns data in sparse CSR format.
    '''
    if not local_file:
        print 'Downloading data...'
        zip_file_url = 'http://files.grouplens.org/datasets/movielens/ml-10m.zip'
        zip_response = get(zip_file_url)
        zip_contents = StringIO(zip_response.content)
        print 'Done.'
    else:
        zip_contents = local_file
    
    print 'Loading data into memory...'
    with ZipFile(zip_contents) as zfile:
        zdata = zfile.read('ml-10M100K/ratings.dat')
        delimiter = ';'
        zdata = zdata.replace('::', delimiter) # makes data compatible with pandas c-engine
        ml_data = pd.read_csv(StringIO(zdata), sep=delimiter, header=None, engine='c',
                                  names=['userid', 'movieid', 'rating', 'timestamp'],
                                  usecols=['userid', 'movieid', 'rating'])
    
    # normalize indices to avoid gaps
    ml_data['movieid'] = ml_data.groupby('movieid', sort=False).grouper.group_info[0]
    ml_data['userid'] = ml_data.groupby('userid', sort=False).grouper.group_info[0]
    
    # build sparse user-movie matrix
    data_shape = ml_data[['userid', 'movieid']].max() + 1
    data_matrix = sp.sparse.csr_matrix((ml_data['rating'],
                                       (ml_data['userid'], ml_data['movieid'])),
                                        shape=data_shape, dtype=np.float64)
    
    print 'Done.'
    return data_matrix
In [4]:
def split_data(data, test_ratio=0.2):
    '''Randomly splits data into training and testing datasets. Default ratio is 80%/20%.
    Returns datasets in namedtuple format for convenience. Usage:
    train_data, test_data = split_data(data_matrix)
    or
    movielens_data = split_data(data_matrix)
    and later in code: 
    do smth with movielens_data.train 
    do smth with movielens_data.test
    '''
    
    num_users = data.shape[0]
    idx = np.zeros((num_users,), dtype=bool)
    sel = np.random.choice(num_users, test_ratio*num_users, replace=False)
    np.put(idx, sel, True)
    
    Movielens_data = namedtuple('MovieLens10M', ['train', 'test'])
    movielens_data = Movielens_data(train=data[~idx, :], test=data[idx, :])
    return movielens_data
In [ ]:
 

Task 1. Building the core of recommender.

Build representation of users and movies in the latent factors space with help of SVD.

  • Calculate the data sparsity (e.g. number of nonzero elements divided by the total size of the matrix)
  • Is it feasible to use regular SVD from numpy.linalg on your computer for this task?
  • Fix the rank of approximation and compute truncated SVD using scipy.linalg.svds.
    • Be aware that scipy returns singular values in ascending order (see the docs).
    • Sort all your svd data in descending (by singular values) order without breaking the result of the product (i.e. without messing up the low-rank approximation).
  • The data returned by sparse SVD is also not contiguous in memory which may affect performance. Use np.ascontiguousarray to fix that.
  • Plot singular values.
    • Can you tell from the graph whether the data has a low-rank structure?
    • Is it possible to estimate from the graph what SVD rank (or number of latent factors) will be sufficient for your model?
  • Pick several users at random from the training set. Calculate recommendations for these users using truncated SVD. Compare movies that users rated with top-10 recommendations produced by your latent factors model. What can you say about produced recommendations?
In [ ]:
 

Task 2. Evaluating performance of the recommender.

Evaluation of the model is done by splitting test dataset into 2 subsets:

  • user's behaviour history - this is used to produce recommendations by your trained model
  • evaluation data - considered as the ground truth and used to estimate the quality of recommendations

Overall perfromance is measured by the total number of correct predictions made by the model on the test set.

Your tasks:

  • Set $N$ - size of evaluation set - equal to 3 (for building top-3 recommendations).
  • Split the test dataset into history and evaluation subsets. The simplest way is to do it user-by-user (e.g row-by-row) in the loop (see pseudocode below). In each row the last $N$ rated movies are used as evaluation data and all remainig user's ratings go into history subset. Scipy functions .nonzero() or .indices might be helpful.
  • For each user from test set generate top-$N$ recommendations using the history subset and folding-in technique (described below).
    • What is the complexity of making recommendations for a new user? Compare it with calculation of full SVD.
    • Is it a good idea to use folding-in technique for all future users without recomputing SVD? Why?
  • Calculate the number of correctly predicted recommendations (a.k.a. # of hits). This is done by calculating the number of recommended items which are also present in the evaluation subset for the selected user. You may want to use numpy.in1d function.
  • Report the total number of correct predictions over the full test set.

Folding-in technique

A new user can be considered as an update to original matrix (appending new row). Appending a row in the original matrix corresponds to appending a row into the users latent factors matrix in the SVD decomposition. We can formulate the relation between theese two updates as (see here for details and picture above, for single user $q$ = 1): $$ p^T = x^TVS^{-1} $$ Where $p$ is an update to latent factors matrix and $x$ is an update to original user-movie matrix (e.g. user's preferences or behaviour history). Then, to compute recommendations for the new user we simply restore a part of the original matrix, corresponding to the update: $$ r^T = p^TSV^T = x^TVV^T $$ where $r$ is our recommendations vector that we're looking for: $$ r = VV^Tx $$

Note, that matrix $P = VV^T$ satisfies the following property: $P^2 = P$, e.g. $P$ - is a projector. This means that our folding-in procedure can be naturally describied as a projection of the user preferences onto the latent factors space.

Pseudocode for measuring recommender system quality

initialize total_score variable with 0
for each user in test-data:
    rating_history <-- all but the last N movies rated by user
    evaluation_set <-- the last N movies rated by user

    initialize user_preferences_vector with 1s using rating_history as indices of non-zero values
    top-N recommendations <-- folding-in applied to user_preferences_vector
    correct_predictions <-- # of intersections of recommendations and evaluation_set
    total_score <-- total_score + correct_predictions

return total_score

You may use different implementation at your will, the only two criterias:

  • it should be not slower than a reference implementation
  • it should produce correct results
In [ ]:
 

Task 3. Fine-tuning your model.

  • Try to find the rank that produces the best evaluation score
    • Plot the dependency of evaluation score on rank of SVD for all your trials in one graph
  • Report the best result and the coressponding SVD rank
  • Compare your model with the non-personalized recommender which simply recommends top-3 movies with highest average ratings.

Optionally: You may want to test you parameters with different data splittings in order to minimize risk of local effects. You're also free to add modifications to your code for producing better results. Report what modificatons you've done and what effect it had if any.

In [ ]: