Poisson factorization¶

This example notebook illustrates the usage of the poismf package for recommender systems with implicit feedback data using the Last.FM 360k dataset. The model is described in more detail in Fast Non-Bayesian Poisson Factorization for Implicit-Feedback Recommendations.

Model description¶

The basic idea is to take a sparse input matrix of counts $\mathbf{X}_{m,n}$, which in this case is given by the number of times each user (row in the matrix) played each song (column in the matrix), and find an approximation as the product of two non-negative lower-dimensional latent factor matrices $\mathbf{A}_{m,k}$ and $\mathbf{B}_{n,k}$ by maximizing Poisson likelihood, i.e. fit a model: $$ \mathbf{X} \sim \text{Poisson}(\mathbf{A} \mathbf{B}^T) $$

Which is then used to make predictions on the missing (zero-valued) entries, with the highest-predicted items for each user being the best candidates to recommend.

The package offers different optimization methods which have different advantages in terms of speed and quality, and depending on the settings, is usually able to find good solutions in which the latent factors matrices $\mathbf{A}$ and $\mathbf{B}$ are sparse (i.e. most entries are exactly zero).

Loading the data¶

In [1]:

import numpy as np, pandas as pd
from scipy.sparse import coo_matrix

lfm = pd.read_table('usersha1-artmbid-artname-plays.tsv',
                    sep='\t', header=None,
                    names=['UserId','ItemId', 'Artist','Count'])
lfm.head(3)

Out[1]:

	UserId	ItemId	Artist	Count
0	00000c289a1829a808ac09c00daf10bc3c4e223b	3bd73256-3905-4f3a-97e2-8b341527f805	betty blowtorch	2137
1	00000c289a1829a808ac09c00daf10bc3c4e223b	f2fb0ff0-5679-42ec-a55c-15109ce6e320	die Ärzte	1099
2	00000c289a1829a808ac09c00daf10bc3c4e223b	b3ae82c2-e60b-4551-a76d-6620f1b456aa	melissa etheridge	897

In [2]:

lfm = lfm.drop('Artist', axis=1)
lfm = lfm.loc[(lfm.Count > 0) & (lfm.UserId.notnull()) & (lfm.ItemId.notnull())]
lfm['UserId'] = pd.Categorical(lfm.UserId).codes
lfm['ItemId'] = pd.Categorical(lfm.ItemId).codes
lfm.head(3)

Out[2]:

	ItemId	Count
0	37425	2137
1	152039	1099
2	112365	897

In [3]:

X = coo_matrix((lfm.Count, (lfm.UserId, lfm.ItemId)))
X

Out[3]:

<358858x160112 sparse matrix of type '<class 'numpy.int64'>'
	with 17309518 stored elements in COOrdinate format>

Producing a train-test split¶

This section will select a random sample of 10,000 users for testing purposes. From these 10,000 users, 30% of their consumed items will be held-out as a test set (randomly chosen), and recommendation models will be fit to the remainder of their data plus the full data for the remainder of the users (see package recometrics for more details).

In [4]:

import recometrics

X_train, X_test, users_test = \
    recometrics.split_reco_train_test(
        X, split_type = "joined",
        users_test_fraction = None,
        max_test_users = 10000,
        items_test_fraction = 0.3,
        min_pos_test = 2,
        min_items_pool = 10,
        seed = 123
    )
X_train_coo = X_train.tocoo()

print("Number of entries to fit the model: {:,}".format(X_train.data.shape[0]))
print("Number of test users: {:,}".format(users_test.shape[0]))
print("Number of entries in training data for test users: {:,}".format(
    X_train[:X_test.shape[0]].data.shape[0]))
print("Number of entries in test data: {:,}".format(X_test.data.shape[0]))

Number of entries to fit the model: 17,164,027
Number of test users: 10,000
Number of entries in training data for test users: 337,427
Number of entries in test data: 145,124

Ranking metrics for evaluation¶

The models fit here will be evaluated by typical implicit-feedback recommendation quality metrics such as precision-at-k, recall-at-k, MAP, etc.

These metrics are calculated for each user separately, by taking the entries in the hold-out test set as a positive class, entries which are neither in the training or test sets as a negative class, and producing predictions for all the entries that were not in the training set - the idea being that models which tend to rank highest the songs that the users ended up listening are better.

In [5]:

def print_ranking_metrics(A, B, X_train, X_test, top_n=5):
    metrics = recometrics.calc_reco_metrics(
        X_train[:X_test.shape[0]], X_test,
        A[:X_test.shape[0]], B,
        k=top_n, all_metrics=True
    )
    return(metrics.mean(axis=0).to_frame().T)

Fitting the model¶

This section will fit and evaluate the Poisson factorization model fit with different hyperparameters.

In [6]:

from poismf import PoisMF

Oriented towards speed:

In [7]:

%%time
model_fast = PoisMF(reindex=False, method="pg", use_float=False,
                    k=10, niter=10, maxupd=1, l2_reg=1e9)\
                .fit(X_train_coo)

CPU times: user 1min 27s, sys: 113 ms, total: 1min 27s
Wall time: 6.17 s

In [8]:

print_ranking_metrics(model_fast.A, model_fast.B, X_train, X_test, 5)

Out[8]:

	P@5	TP@5	R@5	AP@5	TAP@5	NDCG@5	Hit@5	RR@5	ROC_AUC	PR_AUC
0	0.05946	0.059488	0.020351	0.01206	0.035354	0.043764	0.2455	0.141385	0.952828	0.029469

Faster, but still not-so-good quality:

In [9]:

%%time
model_balanced = PoisMF(reindex=False, method="cg", use_float=False,
                        k=50, niter=30, maxupd=5, l2_reg=1e4)\
                    .fit(X_train_coo)

CPU times: user 1h 1min 50s, sys: 792 ms, total: 1h 1min 51s
Wall time: 3min 56s

In [10]:

print_ranking_metrics(model_balanced.A, model_balanced.B, X_train, X_test, 5)

Out[10]:

	P@5	TP@5	R@5	AP@5	TAP@5	NDCG@5	Hit@5	RR@5	ROC_AUC	PR_AUC
0	0.12868	0.128743	0.044663	0.027316	0.078918	0.09678	0.4691	0.279227	0.981536	0.070436

Good quality and producing sparse factors, but slow:

In [11]:

%%time
## Note: 'maxupd' for 'tncg' means 'maxfneval'
model_good = PoisMF(reindex=False, method="tncg", use_float=True,
                    early_stop=False, reuse_prev=True,
                    k=50, niter=10, maxupd=750, l2_reg=1e3)\
                .fit(X_train_coo)

CPU times: user 1h 33min 19s, sys: 1 s, total: 1h 33min 20s
Wall time: 5min 59s

In [12]:

print_ranking_metrics(model_good.A, model_good.B, X_train, X_test, 5)

Out[12]:

	P@5	TP@5	R@5	AP@5	TAP@5	NDCG@5	Hit@5	RR@5	ROC_AUC	PR_AUC
0	0.15106	0.151118	0.05189	0.033128	0.096668	0.117803	0.5118	0.318382	0.969582	0.078752

In [13]:

%%time
## Note: 'maxupd' for 'tncg' means 'maxfneval'
model_good = PoisMF(reindex=False, method="tncg", use_float=False,
                    early_stop=False, reuse_prev=False,
                    k=50, niter=10, maxupd=750, l2_reg=1e3)\
                .fit(X_train_coo)

CPU times: user 3h 13min 49s, sys: 10.8 s, total: 3h 14min
Wall time: 12min 22s

In [14]:

print_ranking_metrics(model_good.A, model_good.B, X_train, X_test, 5)

Out[14]:

	P@5	TP@5	R@5	AP@5	TAP@5	NDCG@5	Hit@5	RR@5	ROC_AUC	PR_AUC
0	0.15616	0.156175	0.053608	0.033359	0.097392	0.120085	0.5294	0.319528	0.96798	0.084333

(In this case, it's possible to increase P@5 at the expense of AUC by decreasing the regularization parameter)

Sparse factors¶

Verifying that the obtain latent factors are indeed sparse:

In [15]:

model_good.A[0]

Out[15]:

array([0.35320237, 0.        , 0.        , 0.07327405, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.11740585,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.10553953, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.33448084, 0.        , 0.        , 0.38001133, 0.        ])

In [16]:

print("Percent of zero-valued entries in A: %.2f%%" %
      float((model_good.A == 0.).mean() * 100.))

Percent of zero-valued entries in A: 82.68%

In [17]:

print("Percent of zero-valued entries in B: %.2f%%" %
      float((model_good.B == 0.).mean() * 100.))

Percent of zero-valued entries in B: 96.13%

Prediction API¶

Ranking top-N items IDs for a given user:

In [18]:

model_good.topN(user = 2, n = 5,
                exclude = X_train[2].indices)

Out[18]:

array([110771,   2291,   1173, 105896,   7811], dtype=uint64)

(These numbers correspond to the IDs of the items in the data that was passed)

If it were a new user - note that the obtained latent factors will differ slightly and it might affect the ranking:

In [19]:

model_good.topN_new((X_train[2].indices, X_train[2].data), n = 5,
                     exclude = X_train[2].indices)

Out[19]:

array([104609, 110771,  71951,    149,  40616], dtype=uint64)

Predicting new (user,item) combinations:

In [20]:

### Predicts triplets (3,4), (3,5), (10,11)
model_good.predict(user=[3,3,3], item=[3,4,11])

Out[20]:

array([0.       , 0.0003477, 0.       ])

Obtaining latent factors for new data:

In [21]:

model_good.predict_factors((X_train[2].indices, X_train[2].data))

Out[21]:

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.0782565 , 0.        , 0.        , 0.06393394,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.12811679, 0.11782422, 0.00386281, 0.        ,
       0.        , 0.5620569 , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.23295553,
       0.        , 0.        , 0.        , 0.04802638, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.07198239])

Comparison against other factorization models¶

In [22]:

from implicit.als import AlternatingLeastSquares
from implicit.bpr import BayesianPersonalizedRanking
from hpfrec import HPF ### <- Bayesian version of poismf
from cmfrec import MostPopular

Xcsr_T = X_train.T
Xcoo_T = X_train_coo.T

In [23]:

%%time
non_personalized = MostPopular(implicit=True).fit(X_train_coo)

CPU times: user 210 ms, sys: 44 ms, total: 254 ms
Wall time: 254 ms

In [24]:

print_ranking_metrics(np.ones((X_test.shape[0],1)),
                      non_personalized.item_bias_.reshape((-1,1)),
                      X_train, X_test, 5)

Out[24]:

	P@5	TP@5	R@5	AP@5	TAP@5	NDCG@5	Hit@5	RR@5	ROC_AUC	PR_AUC
0	0.05892	0.058943	0.020196	0.012146	0.035587	0.043724	0.2406	0.141	0.952892	0.029633

In [25]:

%%time
ials = AlternatingLeastSquares(factors=50, regularization=0.01,
                               dtype=np.float64, iterations=15)
ials.fit(Xcsr_T)

WARNING:root:OpenBLAS detected. Its highly recommend to set the environment variable 'export OPENBLAS_NUM_THREADS=1' to disable its internal multithreading

HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))

CPU times: user 7min, sys: 3.88 s, total: 7min 4s
Wall time: 29.8 s

In [26]:

print_ranking_metrics(ials.user_factors, ials.item_factors, X_train, X_test, 5)

Out[26]:

	P@5	TP@5	R@5	AP@5	TAP@5	NDCG@5	Hit@5	RR@5	ROC_AUC	PR_AUC
0	0.20192	0.202035	0.070324	0.045763	0.13171	0.155549	0.6225	0.388982	0.980055	0.120809

In [27]:

%%time
bpr = BayesianPersonalizedRanking(factors=50, learning_rate=0.01,
                                  regularization=0.01, dtype=np.float64,
                                  iterations=100)
bpr.fit(Xcoo_T)

HBox(children=(FloatProgress(value=0.0), HTML(value='')))

CPU times: user 51min 34s, sys: 4.7 s, total: 51min 38s
Wall time: 3min 27s

In [28]:

print_ranking_metrics(bpr.user_factors, bpr.item_factors, X_train, X_test, 5)

Out[28]:

	P@5	TP@5	R@5	AP@5	TAP@5	NDCG@5	Hit@5	RR@5	ROC_AUC	PR_AUC
0	0.10156	0.101598	0.03548	0.021576	0.061884	0.080852	0.3787	0.223163	0.950915	0.051051

In [29]:

%%time
hpf = HPF(k=50, verbose=False, use_float=False,
          stop_crit="maxiter", maxiter=100)\
        .fit(X_train_coo)

CPU times: user 2h 29min 59s, sys: 22.5 s, total: 2h 30min 21s
Wall time: 13min 4s

In [30]:

print_ranking_metrics(hpf.Theta, hpf.Beta, X_train, X_test, 5)

Out[30]:

	P@5	TP@5	R@5	AP@5	TAP@5	NDCG@5	Hit@5	RR@5	ROC_AUC	PR_AUC
0	0.14354	0.143573	0.049147	0.031008	0.090658	0.112452	0.5003	0.307433	0.978796	0.079902

References¶

Cortes, David. "Fast Non-Bayesian Poisson Factorization for Implicit-Feedback Recommendations." arXiv preprint arXiv:1811.01908 (2018).