This example notebook illustrates the usage of the poismf package for recommender systems with implicit feedback data using the Last.FM 360k dataset. The model is described in more detail in Fast Non-Bayesian Poisson Factorization for Implicit-Feedback Recommendations.
The basic idea is to take a sparse input matrix of counts $\mathbf{X}_{m,n}$, which in this case is given by the number of times each user (row in the matrix) played each song (column in the matrix), and find an approximation as the product of two non-negative lower-dimensional latent factor matrices $\mathbf{A}_{m,k}$ and $\mathbf{B}_{n,k}$ by maximizing Poisson likelihood, i.e. fit a model: $$ \mathbf{X} \sim \text{Poisson}(\mathbf{A} \mathbf{B}^T) $$
Which is then used to make predictions on the missing (zero-valued) entries, with the highest-predicted items for each user being the best candidates to recommend.
The package offers different optimization methods which have different advantages in terms of speed and quality, and depending on the settings, is usually able to find good solutions in which the latent factors matrices $\mathbf{A}$ and $\mathbf{B}$ are sparse (i.e. most entries are exactly zero).
import numpy as np, pandas as pd
from scipy.sparse import coo_matrix
lfm = pd.read_table('usersha1-artmbid-artname-plays.tsv',
sep='\t', header=None,
names=['UserId','ItemId', 'Artist','Count'])
lfm.head(3)
UserId | ItemId | Artist | Count | |
---|---|---|---|---|
0 | 00000c289a1829a808ac09c00daf10bc3c4e223b | 3bd73256-3905-4f3a-97e2-8b341527f805 | betty blowtorch | 2137 |
1 | 00000c289a1829a808ac09c00daf10bc3c4e223b | f2fb0ff0-5679-42ec-a55c-15109ce6e320 | die Ärzte | 1099 |
2 | 00000c289a1829a808ac09c00daf10bc3c4e223b | b3ae82c2-e60b-4551-a76d-6620f1b456aa | melissa etheridge | 897 |
lfm = lfm.drop('Artist', axis=1)
lfm = lfm.loc[(lfm.Count > 0) & (lfm.UserId.notnull()) & (lfm.ItemId.notnull())]
lfm['UserId'] = pd.Categorical(lfm.UserId).codes
lfm['ItemId'] = pd.Categorical(lfm.ItemId).codes
lfm.head(3)
UserId | ItemId | Count | |
---|---|---|---|
0 | 0 | 37425 | 2137 |
1 | 0 | 152039 | 1099 |
2 | 0 | 112365 | 897 |
X = coo_matrix((lfm.Count, (lfm.UserId, lfm.ItemId)))
X
<358858x160112 sparse matrix of type '<class 'numpy.int64'>' with 17309518 stored elements in COOrdinate format>
This section will select a random sample of 10,000 users for testing purposes. From these 10,000 users, 30% of their consumed items will be held-out as a test set (randomly chosen), and recommendation models will be fit to the remainder of their data plus the full data for the remainder of the users (see package recometrics for more details).
import recometrics
X_train, X_test, users_test = \
recometrics.split_reco_train_test(
X, split_type = "joined",
users_test_fraction = None,
max_test_users = 10000,
items_test_fraction = 0.3,
min_pos_test = 2,
min_items_pool = 10,
seed = 123
)
X_train_coo = X_train.tocoo()
print("Number of entries to fit the model: {:,}".format(X_train.data.shape[0]))
print("Number of test users: {:,}".format(users_test.shape[0]))
print("Number of entries in training data for test users: {:,}".format(
X_train[:X_test.shape[0]].data.shape[0]))
print("Number of entries in test data: {:,}".format(X_test.data.shape[0]))
Number of entries to fit the model: 17,164,027 Number of test users: 10,000 Number of entries in training data for test users: 337,427 Number of entries in test data: 145,124
The models fit here will be evaluated by typical implicit-feedback recommendation quality metrics such as precision-at-k, recall-at-k, MAP, etc.
These metrics are calculated for each user separately, by taking the entries in the hold-out test set as a positive class, entries which are neither in the training or test sets as a negative class, and producing predictions for all the entries that were not in the training set - the idea being that models which tend to rank highest the songs that the users ended up listening are better.
def print_ranking_metrics(A, B, X_train, X_test, top_n=5):
metrics = recometrics.calc_reco_metrics(
X_train[:X_test.shape[0]], X_test,
A[:X_test.shape[0]], B,
k=top_n, all_metrics=True
)
return(metrics.mean(axis=0).to_frame().T)
This section will fit and evaluate the Poisson factorization model fit with different hyperparameters.
from poismf import PoisMF
Oriented towards speed:
%%time
model_fast = PoisMF(reindex=False, method="pg", use_float=False,
k=10, niter=10, maxupd=1, l2_reg=1e9)\
.fit(X_train_coo)
CPU times: user 1min 27s, sys: 113 ms, total: 1min 27s Wall time: 6.17 s
print_ranking_metrics(model_fast.A, model_fast.B, X_train, X_test, 5)
P@5 | TP@5 | R@5 | AP@5 | TAP@5 | NDCG@5 | Hit@5 | RR@5 | ROC_AUC | PR_AUC | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.05946 | 0.059488 | 0.020351 | 0.01206 | 0.035354 | 0.043764 | 0.2455 | 0.141385 | 0.952828 | 0.029469 |
Faster, but still not-so-good quality:
%%time
model_balanced = PoisMF(reindex=False, method="cg", use_float=False,
k=50, niter=30, maxupd=5, l2_reg=1e4)\
.fit(X_train_coo)
CPU times: user 1h 1min 50s, sys: 792 ms, total: 1h 1min 51s Wall time: 3min 56s
print_ranking_metrics(model_balanced.A, model_balanced.B, X_train, X_test, 5)
P@5 | TP@5 | R@5 | AP@5 | TAP@5 | NDCG@5 | Hit@5 | RR@5 | ROC_AUC | PR_AUC | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.12868 | 0.128743 | 0.044663 | 0.027316 | 0.078918 | 0.09678 | 0.4691 | 0.279227 | 0.981536 | 0.070436 |
Good quality and producing sparse factors, but slow:
%%time
## Note: 'maxupd' for 'tncg' means 'maxfneval'
model_good = PoisMF(reindex=False, method="tncg", use_float=True,
early_stop=False, reuse_prev=True,
k=50, niter=10, maxupd=750, l2_reg=1e3)\
.fit(X_train_coo)
CPU times: user 1h 33min 19s, sys: 1 s, total: 1h 33min 20s Wall time: 5min 59s
print_ranking_metrics(model_good.A, model_good.B, X_train, X_test, 5)
P@5 | TP@5 | R@5 | AP@5 | TAP@5 | NDCG@5 | Hit@5 | RR@5 | ROC_AUC | PR_AUC | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.15106 | 0.151118 | 0.05189 | 0.033128 | 0.096668 | 0.117803 | 0.5118 | 0.318382 | 0.969582 | 0.078752 |
%%time
## Note: 'maxupd' for 'tncg' means 'maxfneval'
model_good = PoisMF(reindex=False, method="tncg", use_float=False,
early_stop=False, reuse_prev=False,
k=50, niter=10, maxupd=750, l2_reg=1e3)\
.fit(X_train_coo)
CPU times: user 3h 13min 49s, sys: 10.8 s, total: 3h 14min Wall time: 12min 22s
print_ranking_metrics(model_good.A, model_good.B, X_train, X_test, 5)
P@5 | TP@5 | R@5 | AP@5 | TAP@5 | NDCG@5 | Hit@5 | RR@5 | ROC_AUC | PR_AUC | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.15616 | 0.156175 | 0.053608 | 0.033359 | 0.097392 | 0.120085 | 0.5294 | 0.319528 | 0.96798 | 0.084333 |
(In this case, it's possible to increase P@5 at the expense of AUC by decreasing the regularization parameter)
Verifying that the obtain latent factors are indeed sparse:
model_good.A[0]
array([0.35320237, 0. , 0. , 0.07327405, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.11740585, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.10553953, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.33448084, 0. , 0. , 0.38001133, 0. ])
print("Percent of zero-valued entries in A: %.2f%%" %
float((model_good.A == 0.).mean() * 100.))
Percent of zero-valued entries in A: 82.68%
print("Percent of zero-valued entries in B: %.2f%%" %
float((model_good.B == 0.).mean() * 100.))
Percent of zero-valued entries in B: 96.13%
model_good.topN(user = 2, n = 5,
exclude = X_train[2].indices)
array([110771, 2291, 1173, 105896, 7811], dtype=uint64)
(These numbers correspond to the IDs of the items in the data that was passed)
If it were a new user - note that the obtained latent factors will differ slightly and it might affect the ranking:
model_good.topN_new((X_train[2].indices, X_train[2].data), n = 5,
exclude = X_train[2].indices)
array([104609, 110771, 71951, 149, 40616], dtype=uint64)
Predicting new (user,item) combinations:
### Predicts triplets (3,4), (3,5), (10,11)
model_good.predict(user=[3,3,3], item=[3,4,11])
array([0. , 0.0003477, 0. ])
Obtaining latent factors for new data:
model_good.predict_factors((X_train[2].indices, X_train[2].data))
array([0. , 0. , 0. , 0. , 0. , 0. , 0.0782565 , 0. , 0. , 0.06393394, 0. , 0. , 0. , 0. , 0. , 0. , 0.12811679, 0.11782422, 0.00386281, 0. , 0. , 0.5620569 , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.23295553, 0. , 0. , 0. , 0.04802638, 0. , 0. , 0. , 0. , 0. , 0.07198239])
from implicit.als import AlternatingLeastSquares
from implicit.bpr import BayesianPersonalizedRanking
from hpfrec import HPF ### <- Bayesian version of poismf
from cmfrec import MostPopular
Xcsr_T = X_train.T
Xcoo_T = X_train_coo.T
%%time
non_personalized = MostPopular(implicit=True).fit(X_train_coo)
CPU times: user 210 ms, sys: 44 ms, total: 254 ms Wall time: 254 ms
print_ranking_metrics(np.ones((X_test.shape[0],1)),
non_personalized.item_bias_.reshape((-1,1)),
X_train, X_test, 5)
P@5 | TP@5 | R@5 | AP@5 | TAP@5 | NDCG@5 | Hit@5 | RR@5 | ROC_AUC | PR_AUC | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.05892 | 0.058943 | 0.020196 | 0.012146 | 0.035587 | 0.043724 | 0.2406 | 0.141 | 0.952892 | 0.029633 |
%%time
ials = AlternatingLeastSquares(factors=50, regularization=0.01,
dtype=np.float64, iterations=15)
ials.fit(Xcsr_T)
WARNING:root:OpenBLAS detected. Its highly recommend to set the environment variable 'export OPENBLAS_NUM_THREADS=1' to disable its internal multithreading
HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))
CPU times: user 7min, sys: 3.88 s, total: 7min 4s Wall time: 29.8 s
print_ranking_metrics(ials.user_factors, ials.item_factors, X_train, X_test, 5)
P@5 | TP@5 | R@5 | AP@5 | TAP@5 | NDCG@5 | Hit@5 | RR@5 | ROC_AUC | PR_AUC | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.20192 | 0.202035 | 0.070324 | 0.045763 | 0.13171 | 0.155549 | 0.6225 | 0.388982 | 0.980055 | 0.120809 |
%%time
bpr = BayesianPersonalizedRanking(factors=50, learning_rate=0.01,
regularization=0.01, dtype=np.float64,
iterations=100)
bpr.fit(Xcoo_T)
HBox(children=(FloatProgress(value=0.0), HTML(value='')))
CPU times: user 51min 34s, sys: 4.7 s, total: 51min 38s Wall time: 3min 27s
print_ranking_metrics(bpr.user_factors, bpr.item_factors, X_train, X_test, 5)
P@5 | TP@5 | R@5 | AP@5 | TAP@5 | NDCG@5 | Hit@5 | RR@5 | ROC_AUC | PR_AUC | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.10156 | 0.101598 | 0.03548 | 0.021576 | 0.061884 | 0.080852 | 0.3787 | 0.223163 | 0.950915 | 0.051051 |
%%time
hpf = HPF(k=50, verbose=False, use_float=False,
stop_crit="maxiter", maxiter=100)\
.fit(X_train_coo)
CPU times: user 2h 29min 59s, sys: 22.5 s, total: 2h 30min 21s Wall time: 13min 4s
print_ranking_metrics(hpf.Theta, hpf.Beta, X_train, X_test, 5)
P@5 | TP@5 | R@5 | AP@5 | TAP@5 | NDCG@5 | Hit@5 | RR@5 | ROC_AUC | PR_AUC | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.14354 | 0.143573 | 0.049147 | 0.031008 | 0.090658 | 0.112452 | 0.5003 | 0.307433 | 0.978796 | 0.079902 |