import warnings,os
warnings.simplefilter('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
%matplotlib inline
from scipy.io import loadmat
In this part of the exercise, we will implement an anomaly detection algorithm to detect the anomalous behavior in server computers. The features measure the throughput (mb/s) and latency (ms) of response of each server. While our servers were operating, we collected m=307 examples of how they were behaving, and thus have an unlabeled dataset. We suspect that the vast majority of these examples are "Normal" (Non-anomalous) examples of servers operating normally, but there might also be some examples of servers acting anomalously within the dataset.
We will use a gaussian model to detect the anomalies in our dataset. On this dataset, we will fit a gaussian distribution and then find the values that have very low probability and hence can be considered anomalies.
# Loading dataset
mat=loadmat('ex8data1.mat')
X=mat['X']
print(*mat.keys(),sep='\n')
__header__ __version__ __globals__ X Xval yval
fig=sns.scatterplot(X[:,0],X[:,1],marker='x')
fig.set(xlabel='Latency (ms)',ylabel='Throughput (mb/s)',title='Original Dataset');
To perform anomaly detection, we will first need to fit a model to the data's distribution.
Given a training set, we want to estimate the gaussian distribution for each of the features. For each feature, we need to find the parameters $\mu_{i}$ and $\sigma_{i}^{2}$ that fit the data. The gaussian distribution is given by:
def mutlivariateGaussian(X,mu,sigma):
p=(1/(np.sqrt(2*np.pi*np.power(sigma,2))))*np.exp(-np.power(X-mu,2)/(2*np.power(sigma,2)))
p=np.product(p,axis=1).reshape((-1,1))
return p
We can estimate the parameters ($\mu_{i},\sigma_{i}^{2}$) of the i-th feature by using the following equations. To estimate the mean, we will use $$\mu_{i}=\frac{1}{m}\sum_{j=1}^{m}x_{i}^{(j)}$$
And for the variance, we will use: $$\sigma_{i}^{2}=\frac{1}{m}\sum_{j=1}^{m}(x_{i}^{(j)}-\mu_{i})^2$$
def estimateGaussian(X):
mu=np.mean(X,axis=0)
sigma=np.std(X,axis=0)
return (mu,sigma)
Let's look at the Gaussian distribution contours of the distribution fit to the dataset.
def plotGaussianContour(X,mu,sigma,title=None):
# Constructing grid around the min and max range of original data
temp=np.linspace(np.min(X)-5,np.max(X)+5)
[x1,x2]=np.meshgrid(temp,temp)
# Computing the Gaussian Density Probability for the grid
temp=np.array([x1.reshape(-1),x2.reshape(-1)]).T
z=mutlivariateGaussian(temp,mu,sigma).reshape(x1.shape)
# Plotting
levels=np.power(10,list(map(lambda x:float(x),list(range(-20,0,3)))))
plt.contour(x1,x2,z,levels=levels);
fig=sns.scatterplot(X[:,0],X[:,1],marker='x')
title='Gaussian Distribution Contours' if title==None else title
fig.set(xlabel='Latency (ms)',ylabel='Throughput (mb/s)',title=title);
# Estimating Gaussian parameters
mu,sigma=estimateGaussian(X)
# Computing Density Probability
p=mutlivariateGaussian(X,mu,sigma)
# Plotting
plotGaussianContour(X,mu,sigma);
Now that we have estimated the gaussian parameters, we can investigate which examples have a very high probability given this distribution and which examples have a very low probability. The low probability examples are more likely to be the anomalies in our dataset. One way to determine which examples are the anomalies is to select the threshold based on a cross validation set. In this part of the exericse, we will implement an algorithm to select the threshold using F1 score on a cross validation set.
$y=1$ corresponds to an anomalous examples and $y=0$ corresponds to a normal examples.
Xval=mat['Xval']
yval=mat['yval']
For many different values of $\varepsilon$, we will compute the resulting $F_1$ score by computing how many examples the current threshold classifies correctly and incorrectly. The $F_1$ score is computed using Precision (prec) and Recall (rec):
$$\boxed{F1=\frac{2.prec.rec}{prec+rec}}$$We can compute Precision (prec) and Recall (rec) using:
$$\boxed{prec=\frac{tp}{tp+fp}}$$$$\boxed{rec=\frac{tp}{tp+fn}}$$where,
def selectThreshold(yval,pval):
'''
Returns the threshold value of epsilon and the corresponding F1 score.
'''
bestEpsilon=bestF1=F1=0
epsilons=np.linspace(np.min(pval),np.max(pval),1000)
for epsilon in epsilons:
cvPrediction=pval<epsilon
tp=np.sum((cvPrediction==1) & (yval==1))
fp=np.sum((cvPrediction==1) & (yval==0))
fn=np.sum((cvPrediction==0) & (yval==1))
prec=rec=0
# Computing Precision
if tp+fp>0:
prec=tp/(tp+fp)
# Computing Recall
if tp+fn>0:
rec=tp/(tp+fn)
# Computing F1 Score
if prec+rec>0:
F1=(2*prec*rec)/(prec+rec)
if F1>bestF1:
bestF1=F1
bestEpsilon=epsilon
return (bestEpsilon,bestF1)
# Using sample mean and std. to compute density probability for cross validation set
pval=mutlivariateGaussian(Xval,mu,sigma)
# Calculating the threshold based on the best F1 Score
epsilon,f1=selectThreshold(yval,pval)
# Getting indices for anomalous examples
indices=np.where((p<epsilon)==True)[0]
# Plotting anomalies
title=f'Threshold : {epsilon:.5f} & F1 Score : {f1:.2f}'
plotGaussianContour(X,mu,sigma,title=title)
sns.scatterplot(X[indices,0],X[indices,1],marker='o',facecolors='none',edgecolor='r',label='Anomalies');
In this dataset, each example is described by 11 features, caputring many more properties of our compute servers.
# Loading dataset
mat=loadmat('ex8data2.mat')
X=mat['X']
print(*mat.keys(),sep='\n')
__header__ __version__ __globals__ X Xval yval
pd.DataFrame(X).sample(5)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
863 | -3.357801 | -6.063495 | 19.827681 | -2.843608 | -6.472805 | 13.165775 | -1.301804 | 11.396985 | -11.586898 | 3.164643 | 17.165776 |
377 | 2.541202 | 1.295366 | 2.816464 | -18.192452 | -5.445314 | 17.679813 | -15.325850 | 9.203726 | -3.411643 | -4.398754 | 14.293232 |
28 | 20.506733 | -1.746643 | 21.146559 | -27.537185 | -9.624690 | 24.728547 | -18.732147 | 8.330862 | -0.397158 | 9.197245 | 9.742638 |
369 | 7.149015 | -7.735763 | 6.911932 | -13.206095 | -30.845135 | -3.004759 | 2.893837 | -1.338351 | -10.686856 | -12.007350 | 7.673190 |
52 | 10.440863 | -20.122496 | 8.223160 | -2.154362 | 12.276639 | -4.383862 | -11.543852 | 3.127246 | -0.237430 | -1.694638 | 4.206813 |
# Centering and scaling the data matrix
xPCA=(X-X.mean(axis=0))/X.std(axis=0)
# Computing Covariance
C=np.cov(xPCA.T)
# Computing eigenvalues and eigenvectors
eigenValues,eigenVectors=np.linalg.eig(C)
# Sorting eigenvalues in descending order
idx=eigenValues.argsort()[::-1]
eigenValues=eigenValues[idx]
eigenVectors=eigenVectors[idx]
# Plotting elbow curve
chart=sns.lineplot(list(range(len(eigenValues))),np.cumsum(eigenValues));
chart.set(xlabel='Eigen Values',ylabel='Cumulative sum');
Looking at the graph above, we can conclude that most of the singular values matter for this dataset.
# Computing Gaussian Parameters
mu,sigma=estimateGaussian(X)
p=mutlivariateGaussian(X,mu,sigma)
# Using Cross Validation set to select a threshold
Xval=mat['Xval']
yval=mat['yval']
# Computing density probability
pval=mutlivariateGaussian(Xval,mu,sigma)
epsilon,f1=selectThreshold(yval,pval)
print(f'Epsilon : {epsilon}')
print(f'F1 Score : {f1}')
Epsilon : 1.3786074982000233e-18 F1 Score : 0.6153846153846154
Found 117 Anomalies
indices=np.where((p<epsilon)==True)[0]
pd.DataFrame(X[indices])
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 15.107877 | -16.430766 | 19.707360 | -19.811888 | -1.644537 | -6.444184 | -6.121214 | 7.042283 | 7.232476 | 17.223479 | -2.956852 |
1 | 12.411706 | 3.150959 | 14.827734 | -10.482672 | -8.250082 | -7.982698 | -20.766918 | 30.689476 | 4.622547 | 12.234751 | 23.167294 |
2 | 20.946754 | 1.048170 | 8.296324 | -2.595042 | -14.061970 | 8.776611 | -21.886068 | 9.769264 | -20.071130 | 13.871906 | 4.197881 |
3 | 5.127033 | -18.517137 | 11.422480 | -28.993408 | -5.797960 | -15.989215 | -12.039753 | 31.296681 | 8.203208 | 8.035668 | 19.519680 |
4 | 11.622006 | -0.229723 | 10.005823 | -9.700729 | -14.765493 | 26.933578 | -9.299882 | 13.109171 | 8.976218 | 8.157170 | 18.663064 |
5 | -5.689178 | -9.707423 | 22.258180 | 7.890772 | -11.397641 | 34.445407 | -9.836411 | -0.179841 | -13.261052 | 15.200689 | 7.595782 |
6 | 16.452993 | -15.320026 | 17.283979 | -5.896654 | -2.940964 | 12.148073 | -3.422623 | 15.684343 | -20.903409 | -0.306900 | -8.481797 |
7 | -1.907518 | -10.191011 | 5.041437 | -31.539885 | 0.761368 | 2.671201 | 2.149600 | 11.920996 | -10.353145 | -5.472842 | -7.579770 |
8 | -2.665572 | -1.793277 | -3.312666 | -11.771151 | -22.948824 | 0.657702 | 4.040898 | 2.684116 | -4.217029 | -12.993472 | 6.657632 |
9 | -2.120839 | -22.573352 | 10.638703 | -2.064867 | 0.942871 | -11.136318 | 1.747898 | 22.794201 | -3.726747 | 4.053987 | -1.185641 |
10 | -12.692722 | -11.498905 | 0.026479 | -8.809048 | -16.233664 | -6.701840 | -10.635467 | -8.274253 | -12.428766 | -2.396440 | 9.134138 |
11 | 14.552377 | -38.593613 | 19.161298 | -11.379035 | -24.679817 | 29.394102 | -10.255991 | 4.409564 | -20.899036 | -6.997668 | 2.424330 |
12 | 3.968173 | -0.948445 | 16.390532 | 4.925583 | -30.252147 | 14.208280 | -11.775899 | 14.966458 | -10.949119 | 0.967746 | 22.231068 |
13 | 21.302835 | -22.478786 | 22.219274 | -10.333549 | -15.153760 | 22.316417 | 12.582181 | -0.541537 | -9.436337 | 10.168542 | 7.909259 |
14 | 12.981934 | 8.319031 | 19.570986 | -17.000717 | -23.493794 | 16.981319 | 9.996739 | 20.386708 | 3.952619 | -4.310215 | 6.554124 |
15 | -14.696577 | -17.209237 | 14.525584 | -20.475409 | 6.159271 | 12.879206 | -13.942417 | 4.008424 | -5.974088 | 17.808364 | 13.138762 |
16 | 6.104585 | 3.181527 | -3.318357 | -23.428609 | -14.559891 | 20.376025 | -8.434354 | 6.850325 | -17.518661 | -4.584083 | 5.415755 |
17 | 18.829449 | 1.171611 | -0.177123 | -3.604261 | -3.413298 | 19.523349 | 0.903542 | 18.084633 | -11.911412 | -7.921118 | 18.581649 |
18 | 8.511318 | -16.107225 | 29.176976 | -21.298302 | -29.855874 | 0.995455 | -7.266977 | 8.159963 | 0.985609 | 7.729459 | -12.966527 |
19 | -5.513919 | -6.838657 | 4.026831 | -7.059106 | 0.967883 | 15.205973 | 4.947855 | 5.362446 | -20.477818 | 11.958153 | 16.785018 |
20 | 19.861932 | -7.857506 | 5.643955 | -4.439531 | -0.846037 | 26.410503 | -28.741808 | 13.098772 | -14.189061 | -2.010233 | 10.575248 |
21 | -10.373139 | -33.577658 | 14.393455 | -3.373885 | -12.078383 | 1.415410 | -6.675364 | -7.337178 | -4.686000 | 3.700622 | 0.949668 |
22 | 6.766758 | -4.407579 | 10.305895 | -4.400896 | 6.754287 | 23.584760 | 14.644711 | 16.288997 | -17.195773 | 18.158690 | 2.366239 |
23 | 1.081479 | -12.690507 | 31.537406 | 9.477382 | -15.676929 | 1.106263 | -10.324035 | 4.418272 | -19.267674 | 6.174647 | 5.671273 |
24 | -15.600509 | -20.351230 | 18.334728 | 10.487489 | -16.037635 | 1.283141 | -8.879621 | 12.140937 | 4.844813 | 2.226241 | -4.970682 |
25 | 6.075045 | -23.857709 | 11.693187 | 4.866555 | -11.308274 | -14.698708 | -6.384952 | -10.841639 | -14.239166 | -9.703491 | 4.654031 |
26 | 17.438743 | -12.615899 | 0.789213 | 1.746310 | -13.324761 | -4.484684 | -8.728469 | 26.461376 | -13.228487 | 10.545975 | 8.663423 |
27 | -0.345422 | -9.868085 | 26.251767 | -25.748620 | -22.168480 | 0.075066 | -4.902324 | 2.772623 | 5.538813 | -10.652863 | 12.019222 |
28 | 8.782745 | -11.039724 | 19.792848 | -7.454397 | -11.136051 | -5.493029 | -26.636752 | 14.116266 | -9.001010 | 12.182347 | 21.790293 |
29 | 1.815232 | 6.536839 | 15.958974 | 6.231871 | 5.501488 | -4.844246 | 1.442609 | 26.645073 | -15.078784 | -14.621224 | 11.307183 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
87 | -7.107823 | -2.659451 | -9.913776 | 7.865090 | -15.062972 | -6.669139 | -10.886804 | -5.428388 | -6.145721 | -1.694927 | 36.544551 |
88 | 14.410571 | -22.477726 | 3.857208 | -2.952726 | -24.880941 | 26.884771 | -10.301260 | 14.843514 | -5.629550 | 15.834725 | 14.923468 |
89 | 2.024200 | -17.995892 | 18.120255 | 9.302085 | -20.657652 | -1.342614 | -18.167920 | -6.761627 | 1.510516 | -0.822743 | 12.133360 |
90 | 24.161473 | 2.150718 | 22.656128 | -13.834454 | 4.338249 | -5.126474 | -6.119224 | 3.414971 | 1.066510 | 9.997343 | 27.965546 |
91 | -11.113160 | -5.647850 | 23.260981 | 4.848569 | -6.399478 | 18.210016 | 13.164300 | 19.956570 | -0.345878 | 9.654331 | 10.193919 |
92 | -11.316527 | -8.600399 | 25.029113 | -2.779305 | -2.414419 | 18.946212 | 2.166333 | -5.778113 | 0.044183 | 4.849727 | 22.058928 |
93 | 10.463708 | -21.580878 | 23.946622 | 7.302745 | -22.052180 | 0.784292 | -7.572949 | 6.403820 | -1.567384 | 7.096635 | 21.298699 |
94 | -2.731858 | -1.160324 | -1.203266 | -9.844513 | -8.309690 | 27.409155 | 8.013834 | 10.136497 | -16.946208 | 5.646174 | 10.622073 |
95 | -0.343405 | -2.980590 | 3.574230 | 10.676147 | -3.662173 | 2.813027 | -21.867923 | -6.557764 | -2.575985 | 3.411594 | 16.251755 |
96 | -1.741997 | -1.487631 | 27.069244 | -7.585768 | 2.936218 | 24.616908 | 2.101370 | 41.978198 | -0.526588 | 8.414796 | -1.203203 |
97 | -9.546285 | -5.598474 | 16.340206 | -0.153478 | -14.982139 | 2.475574 | -14.791769 | 25.655002 | 2.786267 | -5.184959 | 18.274686 |
98 | 12.453219 | -4.065051 | 17.536215 | 7.543518 | 8.571374 | 20.703939 | -18.256704 | 7.374196 | -10.945239 | 0.359840 | -12.650669 |
99 | 13.419761 | 7.410299 | 23.401764 | -17.150200 | -8.987999 | 16.752096 | -7.503705 | 20.856191 | 1.770926 | 17.369485 | 14.560195 |
100 | -8.993098 | -24.137958 | 8.507993 | -28.898992 | -4.012547 | -2.715727 | -4.027505 | 6.212865 | -15.777059 | 8.403863 | -1.090873 |
101 | 7.663754 | -21.315959 | 10.085647 | -13.476459 | 1.869769 | 31.556190 | -4.207739 | 6.783527 | -9.069698 | 25.711498 | 4.205376 |
102 | -9.653924 | -10.097966 | 6.088028 | -17.224249 | -35.351822 | 2.677276 | -13.815382 | 10.687653 | -10.804356 | 5.571192 | -9.528244 |
103 | 5.591235 | 6.707069 | 25.261202 | 7.251960 | -15.335947 | 6.264382 | -1.493672 | 24.244968 | -4.643058 | 13.628094 | 9.853019 |
104 | 5.349885 | -12.919083 | 12.132417 | -30.097538 | -23.687714 | 15.293885 | -7.954287 | 16.382830 | -2.277219 | -12.926472 | 25.837356 |
105 | 11.459300 | -6.873557 | 24.064825 | 0.688340 | -14.256252 | -14.539024 | -26.296003 | 18.508660 | -6.114729 | 15.668279 | 0.323248 |
106 | -3.840912 | -12.800335 | 1.341582 | -5.303823 | -11.195030 | -0.078478 | 0.028865 | -8.083519 | -21.344629 | 10.631312 | -8.177534 |
107 | 12.687826 | -7.989794 | 15.536974 | 6.641290 | 1.062005 | 33.352865 | 6.122964 | 17.815628 | -17.525844 | -2.721404 | 3.446680 |
108 | 0.117894 | -24.108314 | 25.600026 | -2.184434 | -26.406851 | 9.524311 | -10.713197 | -1.100855 | -7.452232 | -16.072449 | -1.506024 |
109 | -14.471792 | -12.099130 | 17.744803 | -5.173237 | -13.187795 | -3.559686 | -6.379751 | -5.804498 | -10.552952 | 21.174772 | 4.976100 |
110 | -8.591807 | -7.044647 | 8.214113 | 4.041871 | -13.361453 | 8.930821 | 6.097335 | -11.421545 | -5.723956 | 12.705342 | 23.809446 |
111 | -11.788266 | -11.483627 | 12.438678 | -3.162811 | 0.121983 | 15.905273 | -14.988100 | 15.049161 | -19.249455 | -8.949271 | 16.943905 |
112 | -7.514689 | -11.633082 | 12.714199 | -9.659930 | -20.170026 | 46.574569 | -10.498804 | 20.522685 | -1.223742 | 4.478910 | -0.186369 |
113 | 8.610956 | -2.087595 | 13.609762 | -5.509457 | 4.565548 | -2.150133 | 6.652794 | 6.604923 | -0.554557 | -21.551409 | 5.522046 |
114 | 4.712890 | -5.468014 | 7.463785 | 15.156009 | -4.815854 | -10.092629 | 3.830391 | -0.360002 | -11.221351 | -3.262486 | 1.591322 |
115 | -12.120077 | -5.845913 | 28.951978 | -10.215184 | -18.070297 | 15.439734 | -4.961338 | 31.429439 | -7.406423 | 5.847000 | 20.797016 |
116 | 3.059507 | -5.926298 | 31.837423 | 0.245340 | 10.050776 | -1.997428 | 2.059449 | 12.571624 | 4.110159 | 14.815644 | 9.035330 |
117 rows × 11 columns
In this part of the exercise ,we will implement the Collaborative Filtering learnign algorithm and apply it to a dataset of movie ratings. This dataset consists of ratings on a scale of 1 to 5. This dataset has $n_u=943$ users, amd $n_m=1682$ movies. We will first write a cost function that computes the collaborative filtering objective function and gradient. After implmenting that, we will use gradient descent to learn the parameters for collaborative filtering.
The objective of collaborative filtering is to predict movie ratings for that movies that users have not yet rated.
The matrix $Y$ (num_movies X num_users) stores the ratings $y^{(i,j)}$ from 1 to 5. The matrix $R$ is an binary valued indicator matrix where $R(i,j)=1$ if user j gave a rating to movie i, and $R(i,j)=0$ otherwise.
The collaborative filtering algorithm in the setting of movie recommendation considers a set of $n$ dimensional parameter vectors $x^{(1)},...,x^{(n_m)}$ and $\theta^{(1)},...,\theta^{(n_u)}$, where the model predicts the rating for movie $i$ by user $j$ as $y^{(i,j)}=(\theta^{(j)})^Tx^{(i)}$. Given a dataset that consists of a set of ratings produced by some users on some movies, we wish to learn the parameter vectors that produce the best fit.
Given $x^{(1)},...,x^{(n_m)}$, estimate $\theta^{(1)},...,\theta^{(n_u)}$ : $$\min_{\theta^{(1)},...,\theta^{(n_u)}} \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2$$
Given $\theta^{(1)},...,\theta^{(n_u)}$, estimate $x^{(1)},...,x^{(n_m)}$ : $$\min_{x^{(1)},...,x^{(n_m)}} \frac{1}{2}\sum_{i=1}^{n_m}\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x_k^{(i)})^2$$
Minimizing $x^{(1)},...,x^{(n_m)}$ and $\theta^{(1)},...,\theta^{(n_u)}$ simultaneously:
$$\boxed{J(x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)})=\frac{1}{2}\sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x_k^{(i)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2}$$$$\min J(x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)})$$$\frac{\partial{J}}{\partial{x_k^{(i)}}}=\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})\theta_k^{(j)}+\lambda x_k^{(j)}$
$x_k^{(i)}:x_k^{(i)}-\alpha\frac{\partial{J}}{\partial{x_k^{(i)}}}$
$$\boxed{x_k^{(i)}:x_k^{(i)}-\alpha[\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})\theta_k^{(j)}+\lambda x_k^{(i)}]}$$$\frac{\partial{J}}{\partial{\theta_k^{(j)}}}=\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)}+\lambda \theta_k^{(j)}$
$\theta_k^{(i)}:\theta_k^{(i)}-\alpha\frac{\partial{J}}{\partial{\theta_k^{(j)}}}$
$$\boxed{\theta_k^{(i)}:\theta_k^{(i)}-\alpha[\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)}+\lambda \theta_k^{(j)}]}$$def costFunction(params,Y,R,numUsers,numMovies,numFeatures,lmbda):
# Unrolling Parameters
Y=Y[:numMovies,:numUsers]
R=R[:numMovies,:numUsers]
params=params.reshape(-1)
X=params[:numMovies*numFeatures].reshape((numMovies,numFeatures))
theta=params[numMovies*numFeatures:].reshape((numUsers,numFeatures))
# Computing Cost
H=np.dot(X,theta.T)
J=(1/2)*np.power(np.multiply(R,H-Y),2).sum()
Jreg=(lmbda/2)*(np.power(X,2).sum()+np.power(theta,2).sum())
J=J+Jreg
# Computing Gradient
xGrad=np.dot(np.multiply(R,H-Y),theta)+lmbda*X
thetaGrad=np.dot(np.multiply(R,H-Y).T,X)+lmbda*theta
# Rolling into one variable
params=np.row_stack((xGrad.reshape((-1,1)),thetaGrad.reshape((-1,1))))
return (J,params)
def regularizedGradientDescent(initialParams,Y,R,numUsers,numMovies,numFeatures,lmbda,alpha,iterations):
jHistory=[]
params=initialParams
print('Training Collaborative Filtering...')
for i in range(iterations):
# Computing cost
J,grad=costFunction(params,Y,R,numUsers,numMovies,numFeatures,lmbda)
jHistory.append(J)
print(f'ITERATIONS : {i+1}\t\tCOST : {J:.3f}',end='\r')
# Parameters update rule
params=params-alpha*grad
df=pd.DataFrame({'Iterations':range(iterations),'Cost Function':jHistory})
sns.lineplot(data=df,x='Iterations',y='Cost Function');
return jHistory,params
Let's check our gradient from 'costFunction' function by computing the gradient using two point slope equation. (Gradient Checking)
def computeNumericalGradient(J,theta):
'''Computes the numerical gradient using two point slope equation'''
numGrad=np.zeros(theta.shape)
perturb=np.zeros(theta.shape)
epsilon=1e-4
for i in range(theta.size):
perturb[i,:]=epsilon
numGrad[i,:]=(J(theta+perturb)-J(theta-perturb))/(2*epsilon)
perturb[i,:]=0
return numGrad
def checkNNGradients(lmbda):
# Create small problem
X=np.random.randn(4,3)
theta=np.random.randn(5,3)
Y=np.dot(X,theta.T)
Y[np.random.rand(*Y.shape)>0.5]=0
R=np.zeros(Y.shape)
R[Y!=0]=1
numUsers=Y.shape[1]
numMovies=Y.shape[0]
numFeatures=theta.shape[1]
# Gradient from cost function derivative
params=np.row_stack((X.reshape((-1,1)),theta.reshape((-1,1))))
J,grad=costFunction(params,Y,R,numUsers,numMovies,numFeatures,lmbda)
# Computing numerical gradient
def cost(params):
J,grad=costFunction(params,Y,R,numUsers,numMovies,numFeatures,lmbda)
return J
numGrad=computeNumericalGradient(cost,params)
# Evaluating the norm of the differences between two solution
diff=np.linalg.norm(numGrad-grad)/np.linalg.norm(numGrad+grad)
return grad,numGrad,diff
lmbda=1.5
grad,numGrad,diff=checkNNGradients(lmbda)
print(f'DIFFERENCE : {diff}')
pd.DataFrame(data={'grad (derivative)':grad.reshape(-1),'grad (numerical)':numGrad.reshape(-1)})
DIFFERENCE : 4.8446728621542694e-12
grad (derivative) | grad (numerical) | |
---|---|---|
0 | 2.570436 | 2.570436 |
1 | -0.173870 | -0.173870 |
2 | 0.443513 | 0.443513 |
3 | -1.151438 | -1.151438 |
4 | 3.078315 | 3.078315 |
5 | 1.406108 | 1.406108 |
6 | 0.251307 | 0.251307 |
7 | 0.860391 | 0.860391 |
8 | -2.623469 | -2.623469 |
9 | -0.681626 | -0.681626 |
10 | -1.868502 | -1.868502 |
11 | 2.908619 | 2.908619 |
12 | -1.269194 | -1.269194 |
13 | -0.828064 | -0.828064 |
14 | -0.727420 | -0.727420 |
15 | 0.420340 | 0.420340 |
16 | 0.405181 | 0.405181 |
17 | -1.245199 | -1.245199 |
18 | 0.002056 | 0.002056 |
19 | 0.248965 | 0.248965 |
20 | -3.372456 | -3.372456 |
21 | 1.402044 | 1.402044 |
22 | -1.401868 | -1.401868 |
23 | 4.446794 | 4.446794 |
24 | 1.564790 | 1.564790 |
25 | 0.518527 | 0.518527 |
26 | -0.877819 | -0.877819 |
Now that we have finished the collaborative filtering cost function and gradient, we can now start training our learning algorithm to make movie recommendations for ourself. Later, we will also enter our own movie preferences, so that when the algorithm runs, we can get our own movie recommendations.
# Loading data
mat=loadmat('ex8_movies.mat')
Y=mat['Y']
R=mat['R']
numUsers=Y.shape[1]
numMovies=Y.shape[0]
numFeatures=10
print("Y : {0} X {1}".format(*Y.shape))
print("R : {0} X {1}".format(*R.shape))
Y : 1682 X 943 R : 1682 X 943
# Loading Movies
movies={'ID':[],'Movie':[],'Year':[]}
with open('movie_ids.txt') as f:
for line in f.readlines():
movies['ID'].append(int(line.split()[0]))
movies['Movie'].append(' '.join(line.split()[1:-1]))
movies['Year'].append(line.split()[-1][1:-1])
movies=pd.DataFrame(movies)
movies=movies.set_index('ID')
movies.sample(10)
Movie | Year | |
---|---|---|
ID | ||
1528 | Nowhere | 1997 |
141 | 20,000 Leagues Under the Sea | 1954 |
27 | Bad Boys | 1995 |
1060 | Adventures of Pinocchio, The | 1996 |
972 | Passion Fish | 1992 |
588 | Beauty and the Beast | 1991 |
159 | Basic Instinct | 1992 |
1454 | Angel and the Badman | 1947 |
1664 | 8 Heads in a Duffel Bag | 1997 |
239 | Sneakers | 1992 |
pd.DataFrame(Y,movies['Movie']).head(5)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 933 | 934 | 935 | 936 | 937 | 938 | 939 | 940 | 941 | 942 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Movie | |||||||||||||||||||||
Toy Story | 5 | 4 | 0 | 0 | 4 | 4 | 0 | 0 | 0 | 4 | ... | 2 | 3 | 4 | 0 | 4 | 0 | 0 | 5 | 0 | 0 |
GoldenEye | 3 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
Four Rooms | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Get Shorty | 3 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 4 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
Copycat | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 943 columns
pd.DataFrame(R,movies['Movie']).head(5)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 933 | 934 | 935 | 936 | 937 | 938 | 939 | 940 | 941 | 942 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Movie | |||||||||||||||||||||
Toy Story | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
GoldenEye | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Four Rooms | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Get Shorty | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Copycat | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 943 columns
plt.imshow(Y,cmap='gray')
plt.xlabel('Users')
plt.ylabel('Movies');
print('Please enter the movie ratings(1-5 or 0 if not seen): \n')
userRatings=np.zeros((numMovies,1))
# Ask user ratings for set movies
indices=[1,98,7,12,55,64,66,69,183,225,355]
for index in indices:
movie=movies.loc[index]
try:
question=f"{movie['Movie']} ({movie['Year']})"
question=question.ljust(50)
rating=float(input(question))
if 1<=rating<=5:
userRatings[int(index)-1]=rating
except Exception as e:
break
Please enter the movie ratings(1-5 or 0 if not seen): Toy Story (1995) 4.1 Silence of the Lambs, The (1991) 4.5 Twelve Monkeys (1995) 1.5 Usual Suspects, The (1995) 4.7 Professional, The (1994) 3 Shawshank Redemption, The (1994) 5 While You Were Sleeping (1995) 3.5 Forrest Gump (1994) 5 Alien (1979) 3 101 Dalmatians (1996) 3.8 Sphere (1998) 0
# Adding new user ratings
Y=np.column_stack((userRatings,Y))
# Updating R with new user ratings
R=np.column_stack((np.zeros(numMovies),R))
indices=np.where(Y[:,0]!=0)[0]
R[indices,0]=1
# Normalize Ratings
Ymean=np.zeros((numMovies,1))
Ynorm=np.zeros(Y.shape)
for i in range(numMovies):
idx=np.where(R[i,:]==1)
Ymean[i]=Y[i,idx].mean()
Ynorm[i,idx]=Y[i,idx]-Ymean[i]
# Setting Initial Parameters
X=np.random.randn(numMovies,numFeatures)
theta=np.random.randn(numUsers,numFeatures)
initialParams=np.row_stack((X.reshape((-1,1)),theta.reshape((-1,1))))
# Setting Gradient descent options
lmbda=10
alpha=0.003
iterations=500
# Training Collaborative filtering
jHistory,params=regularizedGradientDescent(initialParams,Ynorm,R,numUsers,numMovies,numFeatures,lmbda,alpha,iterations)
# Unrolling the learned parameters into X and theta
X=params[:numMovies*numFeatures].reshape((numMovies,numFeatures))
theta=params[numMovies*numFeatures:].reshape((numUsers,numFeatures))
Training Collaborative Filtering... ITERATIONS : 500 COST : 38914.118
After training the model, we can now make the predictions by computing the prediction matrix.
P=np.dot(X,theta.T)
userPrediction=P[:,0]+Ymean.reshape(-1)
For Original Ratings provided by the user :
df=movies.copy()
df['User Ratings']=userRatings
df[df['User Ratings']!=0]
Movie | Year | User Ratings | |
---|---|---|---|
ID | |||
1 | Toy Story | 1995 | 4.1 |
7 | Twelve Monkeys | 1995 | 1.5 |
12 | Usual Suspects, The | 1995 | 4.7 |
55 | Professional, The | 1994 | 3.0 |
64 | Shawshank Redemption, The | 1994 | 5.0 |
66 | While You Were Sleeping | 1995 | 3.5 |
69 | Forrest Gump | 1994 | 5.0 |
98 | Silence of the Lambs, The | 1991 | 4.5 |
183 | Alien | 1979 | 3.0 |
225 | 101 Dalmatians | 1996 | 3.8 |
Top 10 Recommended Movies are :
limit=10
df=movies.copy()
df['Predicted Ratings']=userPrediction
df.sort_values(by=['Predicted Ratings'],ascending=False).head(limit)
Movie | Year | Predicted Ratings | |
---|---|---|---|
ID | |||
1201 | Marlene Dietrich: Shadow and Light | 1996 | 5.0 |
1122 | They Made Me a Criminal | 1939 | 5.0 |
1293 | Star Kid | 1997 | 5.0 |
1467 | Saint of Fort Washington, The | 1993 | 5.0 |
1599 | Someone Else's America | 1995 | 5.0 |
814 | Great Day in Harlem, A | 1994 | 5.0 |
1653 | Entertaining Angels: The Dorothy Day Story | 1996 | 5.0 |
1189 | Prefontaine | 1997 | 5.0 |
1536 | Aiqing wansui | 1994 | 5.0 |
1500 | Santa with Muscles | 1996 | 5.0 |