### An example of Matrix Factorization used for recommendation.¶

• Data set: Modified Joke Ratings Data from Jester
• Size: 1000 Users x 100 Jokes
In [2]:
import numpy as np
import pandas as pd

In [11]:
pd.set_option('display.max_colwidth', 120)


Out[11]:
1
0 A man visits the doctor. The doctor says "I have bad news for you.You have cancer and Alzheimer's disease". The man ...
1 This couple had an excellent relationship going until one day he came home from work to find his girlfriend packing....
2 Q. What's 200 feet long and has 4 teeth? A. The front row at a Willie Nelson Concert.
3 Q. What's the difference between a man and a toilet? A. A toilet doesn't follow you around after you use it.
4 Q. What's O. J. Simpson's Internet address? A.\tSlash slash backslash slash slash escape.
5 Bill & Hillary are on a trip back to Arkansas. They're almost out of gas so Bill pulls into a service station on the...
6 How many feminists does it take to screw in a light bulb?That's not funny.
7 Q. Did you hear about the dyslexic devil worshipper? A. He sold his soul to Santa.
8 A country guy goes into a city bar that has a dress code and the maitred' demands he wear a tie. Discouraged the guy...
9 Two cannibals are eating a clown one turns to other and says: "Does this taste funny to you?
In [34]:
def get_joke_text(jokes, id):
return np.array(jokes)[id]

In [41]:
print(get_joke_text(jokes, 99))

["Q: What's the difference between greeting a Queen and greeting thePresident of the United  States?A: You only have to get on one knee to greet the queen."]


#### The rating matrix contains the ratings on 100 jokes by 1000 users (each row is a user profile). The ratings have been normalized to be between 1 and 21 (a 20-point scale), with 1 being the lowest rating. A zero indicated a missing rating¶

In [25]:
dataMat = pd.read_csv("http://facweb.cs.depaul.edu/mobasher/classes/csc478/data/modified_jester_data.csv", header=None)

dataMat.shape

Out[25]:
(1000, 100)
In [26]:
pd.set_option('display.max_colwidth', 40)


Out[26]:
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
0 3.18 19.79 1.34 2.84 3.48 2.50 1.15 15.17 2.02 6.24 ... 13.82 0.00 0.00 0.00 0.00 0.00 5.37 0.00 0.00 0.00
1 15.08 10.71 17.36 15.37 8.62 1.34 10.27 5.66 19.88 20.22 ... 13.82 6.05 10.71 18.86 10.81 8.86 14.06 11.34 6.68 12.07
2 0.00 0.00 0.00 0.00 20.03 20.27 20.03 20.27 0.00 0.00 ... 0.00 0.00 0.00 20.08 0.00 0.00 0.00 0.00 0.00 0.00
3 0.00 19.35 0.00 0.00 12.80 19.16 8.18 17.21 0.00 12.84 ... 0.00 0.00 0.00 11.53 0.00 0.00 0.00 0.00 0.00 0.00
4 19.50 15.61 6.83 5.61 12.36 12.60 18.04 15.61 10.56 16.73 ... 16.19 16.58 15.27 16.19 16.73 12.55 14.11 17.55 12.80 12.60
5 4.83 7.46 11.44 2.50 3.91 6.68 2.31 10.13 4.35 9.20 ... 7.46 4.11 10.32 8.04 8.82 7.65 11.05 1.92 5.95 7.55
6 0.00 0.00 0.00 0.00 19.59 1.15 18.72 19.79 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 13.33 0.00 0.00 0.00 0.00
7 17.84 14.16 20.17 4.79 2.84 9.30 20.27 12.41 5.81 6.58 ... 18.23 9.88 10.90 5.32 7.84 7.65 13.14 10.95 12.31 11.00
8 7.21 7.46 1.58 4.11 2.26 10.71 5.71 2.07 3.14 9.40 ... 15.37 10.71 15.17 10.71 10.71 10.71 10.71 10.71 7.60 6.05
9 14.01 16.15 16.15 14.01 17.41 16.15 19.93 13.52 14.01 19.16 ... 0.00 15.47 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

10 rows × 100 columns

#### The following function uses gradient descent optimization to factorize a rating matrix R into matrices P (the user feature matrix) and Q (the item feature matrix).¶

In [65]:
def matrix_factorization(R, P, Q, K, steps=5000, alpha=0.0002, beta=0.02):

### R = The user x item rating matrix (m x n)
### P = Initial user-factor matrix (m x k)
### Q = Initial item-factor matrix (n x k)
### K = The number of latent factors (features)
### steps = The number of epochs in gradient descent
### alpha = The learning rate for gradient descent
### beta = The regularization coefficient

Q = Q.T
for step in range(steps):
for i in range(len(R)):
for j in range(len(R[i])):
if R[i][j] > 0:
eij = R[i][j] - np.dot(P[i,:],Q[:,j])
for k in range(K):
### update P and Q based on the partial derivatives
P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
eR = np.dot(P,Q)
e = 0
for i in range(len(R)):
for j in range(len(R[i])):
if R[i][j] > 0:
e = e + pow(R[i][j] - np.dot(P[i,:],Q[:,j]), 2)
for k in range(K):
e = e + (beta/2) * ( pow(P[i][k],2) + pow(Q[k][j],2) )
if e < 0.001:
break
print("Step %d of %d; Error: %0.5f; Time: %0.2f" %(step+1, steps, e, time()))
return P, Q.T

In [52]:
M = dataMat.shape[0]
N = dataMat.shape[1]
Ratings = np.array(dataMat)
K = 5
steps = 3000

In [46]:
### Initialize P and Q to random values
P = np.random.rand(M,K)
Q = np.random.rand(N,K)


#### Now let's factorize the Ratings matrix¶

In [56]:
from time import time
t0 = time()
fP, fQ = matrix_factorization(Ratings, P, Q, K, steps=steps)
print("done in %0.3fs." % (time() - t0))

done in 9465.043s.


#### We can write the P and Q factor matrices to disc for later use.¶

In [58]:
outP = open("jokes_p.csv", "w")
outQ = open("jokes_q.csv", "w")
np.savetxt(outP, fP, delimiter=',', fmt='%1.4f')
np.savetxt(outQ, fQ, delimiter=',', fmt='%1.4f')


#### An individual prediction for a given user-item pair can now be obtained by computing the dot product of user's row in the P matrix and the item's column in the Q matrix.¶

In [66]:
### Compute the predicted rating for user 979 and joke 9
print(np.dot(fP[979],fQ[9].T))

11.672180288756982


#### We can compute all the pedictions by multiplying P and Q.T. These predictions can also be saved for later use.¶

In [59]:
Preds = np.dot(fP,fQ.T)

In [60]:
outPreds = open("jokes_predictions.csv", "w")
np.savetxt(outPreds, Preds, delimiter=',', fmt='%1.4f')


#### To evaluate the performance of the algorithm, we will measure the Mean Absolute Error (MAE) by comparing the known ratings in Ratings matrix with the predicted ratings from matrix factorization.¶

In [63]:
totCount = 0
totError = 0
for u in range(M):
err_u = 0
rateCount_u = 0
for j in range(N):
if (Ratings[u,j] > 0): ### Only use known ratings computing error
rateCount_u += 1
err_u += abs(np.dot(fP[u],fQ[j]) - Ratings[u,j])
print("Mean Absolute Error for User %d = %0.3f" %(u, err_u/rateCount_u))
totCount += rateCount_u
totError += err_u
print
print("Overall Mean Absolute Error = %0.3f" %(totError/totCount))

Overall Mean Absolute Error = 2.930

In [ ]: