This IPython notebook illustrates the usage of the ctpfrec Python package for Collaborative Topic Poisson Factorization in recommender systems based on sparse count data using the RetailRocket dataset, consisting of event logs (view, add to cart, purchase) from an online catalog of products plus anonymized text descriptions of items.
Collaborative Topic Poisson Factorization is a probabilistic model that tries to jointly factorize the user-item interaction matrix along with item-word text descriptions (as bag-of-words) of the items by the product of lower dimensional matrices. The package can also extend this model to add user attributes in the same format as the items’.
Compared to competing methods such as BPR (Bayesian Personalized Ranking) or weighted-implicit NMF (non-negative matrix factorization of the non-probabilistic type that uses squared loss), it only requires iterating over the data for which an interaction was observed and not over data for which no interaction was observed (i.e. it doesn’t iterate over items not clicked by a user), thus being more scalable, and at the same time producing better results when fit to sparse count data (in general). Same for the word counts of items.
The implementation here is based on the paper Content-based recommendations with poisson factorization (Gopalan, P.K., Charlin, L. and Blei, D., 2014).
For a similar package for explicit feedback data see also cmfrec. For Poisson factorization without side information see hpfrec.
Small note: if the TOC here is not clickable or the math symbols don't show properly, try visualizing this same notebook from nbviewer following this link.
The model consists in producing a low-rank non-negative matrix factorization of the item-word matrix (a.k.a. bag-of-words, a matrix where each row represents an item and each column a word, with entries containing the number of times each word appeared in an item’s text, ideally with some pre-processing on the words such as stemming or lemmatization) by the product of two lower-rank matrices
$$ W_{iw} \approx \Theta_{ik} \beta_{wk}^T $$along with another low-rank matrix factorization of the user-item activity matrix (a matrix where each entry corresponds to how many times each user interacted with each item) that shares the same item-factor matrix above plus an offset based on user activity and not based on items’ words
$$ Y_{ui} \approx \eta_{uk} (\Theta_{ik} + \epsilon_{ik})^T $$These matrices are assumed to come from a generative process as follows:
(Where $W$ is the item-word count matrix, $k$ is the number of latent factors, $i$ is the number of items, $w$ is the number of words)
(Where $u$ is the number of users, $Y$ is the user-item interaction matrix)
The model is fit using mean-field variational inference with coordinate ascent. For more details see the paper in the references.
Reading and concatenating the data. First the event logs:
import numpy as np, pandas as pd
events = pd.read_csv("events.csv")
events.head()
timestamp | visitorid | event | itemid | transactionid | |
---|---|---|---|---|---|
0 | 1433221332117 | 257597 | view | 355908 | NaN |
1 | 1433224214164 | 992329 | view | 248676 | NaN |
2 | 1433221999827 | 111016 | view | 318965 | NaN |
3 | 1433221955914 | 483717 | view | 253185 | NaN |
4 | 1433221337106 | 951259 | view | 367447 | NaN |
events.event.value_counts()
view 2664312 addtocart 69332 transaction 22457 Name: event, dtype: int64
In order to put all user-item interactions in one scale, I will arbitrarily assign values as follows:
Thus, if a user clicks an item, that (user, item)
pair will have value=1
, if she later adds it to cart and purchases it, will have value=7
(plus any other views of the same item), and so on.
The reasoning behind this scale is because the distributions of counts and sums of counts seem to still follow a nice exponential distribution with these values, but different values might give better results in terms of models fit to them.
%matplotlib inline
equiv = {
'view':1,
'addtocart':3,
'transaction':3
}
events['count']=events.event.map(equiv)
events.groupby('visitorid')['count'].sum().value_counts().hist(bins=200)
<matplotlib.axes._subplots.AxesSubplot at 0x7f9b3d5d2780>
events = events.groupby(['visitorid','itemid'])['count'].sum().to_frame().reset_index()
events.rename(columns={'visitorid':'UserId', 'itemid':'ItemId', 'count':'Count'}, inplace=True)
events.head()
UserId | ItemId | Count | |
---|---|---|---|
0 | 0 | 67045 | 1 |
1 | 0 | 285930 | 1 |
2 | 0 | 357564 | 1 |
3 | 1 | 72028 | 1 |
4 | 2 | 216305 | 2 |
Now creating a train and test split. For simplicity purposes and in order to be able to make a fair comparison with a model that doesn't use item descriptions, I will try to only take users that had >= 3 items in the training data, and items that had >= 3 users.
Given the lack of user attributes and the fact that it will be compared later to a model without side information, the test set will only have users from the training data, but it's also possible to use user attributes if they follow the same format as the items', in which case the model can also recommend items to new users.
In order to compare it later to a model without items' text, I will also filter out the test set to have only items that were in the training set. This is however not a model limitation, as it can also recommend items that have descriptions but no user interactions.
from sklearn.model_selection import train_test_split
events_train, events_test = train_test_split(events, test_size=.2, random_state=1)
del events
## In order to find users and items with at least 3 interactions each,
## it's easier and faster to use a simple heuristic that first filters according to one criteria,
## then, according to the other, and repeats.
## Finding a real subset of the data in which each item has strictly >= 3 users,
## and each user has strictly >= 3 items, is a harder graph partitioning or optimization
## problem. For a similar example of finding such subsets see also:
## http://nbviewer.ipython.org/github/david-cortes/datascienceprojects/blob/master/optimization/dataset_splitting.ipynb
users_filter_out = events_train.groupby('UserId')['ItemId'].agg(lambda x: len(tuple(x)))
users_filter_out = np.array(users_filter_out.index[users_filter_out < 3])
items_filter_out = events_train.loc[~np.in1d(events_train.UserId, users_filter_out)].groupby('ItemId')['UserId'].agg(lambda x: len(tuple(x)))
items_filter_out = np.array(items_filter_out.index[items_filter_out < 3])
users_filter_out = events_train.loc[~np.in1d(events_train.ItemId, items_filter_out)].groupby('UserId')['ItemId'].agg(lambda x: len(tuple(x)))
users_filter_out = np.array(users_filter_out.index[users_filter_out < 3])
events_train = events_train.loc[~np.in1d(events_train.UserId.values, users_filter_out)]
events_train = events_train.loc[~np.in1d(events_train.ItemId.values, items_filter_out)]
events_test = events_test.loc[np.in1d(events_test.UserId.values, events_train.UserId.values)]
events_test = events_test.loc[np.in1d(events_test.ItemId.values, events_train.ItemId.values)]
print(events_train.shape)
print(events_test.shape)
(381963, 3) (68490, 3)
Now processing the text descriptions of the items:
iteminfo = pd.read_csv("item_properties_part1.csv")
iteminfo2 = pd.read_csv("item_properties_part2.csv")
iteminfo = iteminfo.append(iteminfo2, ignore_index=True)
iteminfo.head()
timestamp | itemid | property | value | |
---|---|---|---|---|
0 | 1435460400000 | 460429 | categoryid | 1338 |
1 | 1441508400000 | 206783 | 888 | 1116713 960601 n277.200 |
2 | 1439089200000 | 395014 | 400 | n552.000 639502 n720.000 424566 |
3 | 1431226800000 | 59481 | 790 | n15360.000 |
4 | 1431831600000 | 156781 | 917 | 828513 |
The item's description contain many fields and have a mixture of words and numbers. The numeric variables, as per the documentation, are prefixed with an "n" and have three digits decimal precision - I will exclude them here since this model is insensitive to numeric attributes such as price. The words are already lemmazed, and since we only have their IDs, it's not possible to do any other pre-processing on them.
Although the descriptions don't say anything about it, looking at the contents and the lengths of the different fields, here I will assume that the field $283$ is the product title and the field $888$ is the product description. I will just concatenate them to obtain an overall item text, but there might be better ways of doing this (such as having different IDs for the same word when it appears in the title or the body, or multiplying those in the title by some number, etc.)
As the descriptions vary over time, I will only take the most recent version for each item:
iteminfo = iteminfo.loc[iteminfo.property.isin(('888','283'))]
iteminfo = iteminfo.loc[iteminfo.groupby(['itemid','property'])['timestamp'].idxmax()]
iteminfo.reset_index(drop=True, inplace=True)
iteminfo.head()
timestamp | itemid | property | value | |
---|---|---|---|---|
0 | 1431226800000 | 0 | 283 | 66094 372274 478989 |
1 | 1433041200000 | 0 | 888 | 478989 |
2 | 1435460400000 | 1 | 283 | 513325 1020281 1204938 172646 72261 30603 8980... |
3 | 1442113200000 | 1 | 888 | 172646 1154859 |
4 | 1431226800000 | 2 | 283 | 822092 325894 504272 147366 343631 648485 n600... |
Note that for simplicity I am completely ignoring the categories (these are easily incorporated e.g. by adding a count of +1 for each category to which an item belongs) and important factors such as the price. I am also completely ignoring all the other fields.
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import coo_matrix
import re
def concat_fields(x):
x = list(x)
out = x[0]
for i in x[1:]:
out += " " + i
return out
class NonNumberTokenizer(object):
def __init__(self):
pass
def __call__(self, txt):
return [i for i in txt.split(" ") if bool(re.search("^\d", i))]
iteminfo = iteminfo.groupby('itemid')['value'].agg(lambda x: concat_fields(x))
t = CountVectorizer(tokenizer=NonNumberTokenizer(), stop_words=None,
dtype=np.int32, strip_accents=None, lowercase=False)
bag_of_words = t.fit_transform(iteminfo)
bag_of_words = coo_matrix(bag_of_words)
bag_of_words = pd.DataFrame({
'ItemId' : iteminfo.index[bag_of_words.row],
'WordId' : bag_of_words.col,
'Count' : bag_of_words.data
})
del iteminfo
bag_of_words.head()
ItemId | WordId | Count | |
---|---|---|---|
0 | 0 | 194496 | 2 |
1 | 0 | 164052 | 1 |
2 | 0 | 245598 | 1 |
3 | 1 | 44188 | 1 |
4 | 1 | 286671 | 1 |
In this case, I will not filter it out by only items that were in the training set, as other items can still be used to get better latent factors.
Fitting the model - note that I'm using some enhancements (passed as arguments to the class constructor) over the original version in the paper:
I'll be also fitting two slightly different models: one that takes (and can make recommendations for) all the items for which there are either descriptions or user clicks, and another that uses all the items for which there are descriptions to initialize the item-related parameters but discards the ones without clicks (can only make recommendations for items that users have clicked).
For more information about the parameters and what they do, see the online documentation:
print(events_train.shape)
print(events_test.shape)
print(bag_of_words.shape)
(381963, 3) (68490, 3) (7676561, 3)
%%time
from ctpfrec import CTPF
recommender_all_items = CTPF(k=70, step_size=lambda x: 1-1/np.sqrt(x+1),
standardize_items=True, initialize_hpf=True, reindex=True,
missing_items='include', allow_inconsistent_math=True, random_seed=1)
recommender_all_items.fit(counts_df=events_train.copy(), words_df=bag_of_words.copy())
***************************************** Collaborative Topic Poisson Factorization ***************************************** Number of users: 65913 Number of items: 418301 Number of words: 342260 Latent factors to use: 70 Initializing parameters... Initializing Theta and Beta through HPF... ********************************** Hierarchical Poisson Factorization ********************************** Number of users: 417053 Number of items: 342260 Latent factors to use: 70 Initializing parameters... Allocating Phi matrix... Initializing optimization procedure... Iteration 10 | Norm(Theta_{10} - Theta_{0}): 3373.40234 Iteration 20 | Norm(Theta_{20} - Theta_{10}): 13.27755 Iteration 30 | Norm(Theta_{30} - Theta_{20}): 11.13662 Iteration 40 | Norm(Theta_{40} - Theta_{30}): 5.30947 Iteration 50 | Norm(Theta_{50} - Theta_{40}): 3.23760 Iteration 60 | Norm(Theta_{60} - Theta_{50}): 2.57951 Iteration 70 | Norm(Theta_{70} - Theta_{60}): 1.99546 Iteration 80 | Norm(Theta_{80} - Theta_{70}): 1.91506 Iteration 90 | Norm(Theta_{90} - Theta_{80}): 1.49374 Iteration 100 | Norm(Theta_{100} - Theta_{90}): 1.17536 Optimization finished Final log-likelihood: -54256333 Final RMSE: 2.4187 Minutes taken (optimization part): 23.7 ********************************** Allocating intermediate matrices... Initializing optimization procedure... Iteration 10 | train llk: -6305341 | train rmse: 2.8694 Iteration 20 | train llk: -6248204 | train rmse: 2.8681 Iteration 30 | train llk: -6228858 | train rmse: 2.8675 Iteration 40 | train llk: -6220805 | train rmse: 2.8672 Iteration 50 | train llk: -6212324 | train rmse: 2.8670 Iteration 60 | train llk: -6212101 | train rmse: 2.8670 Optimization finished Final log-likelihood: -6212101 Final RMSE: 2.8670 Minutes taken (optimization part): 15.4 Producing Python dictionaries... CPU times: user 5h 10min 38s, sys: 1min 39s, total: 5h 12min 18s Wall time: 39min 46s
%%time
recommender_clicked_items_only = CTPF(k=70, step_size=lambda x: 1-1/np.sqrt(x+1),
standardize_items=True, initialize_hpf=True, reindex=True,
missing_items='exclude', allow_inconsistent_math=True, random_seed=1)
recommender_clicked_items_only.fit(counts_df=events_train.copy(), words_df=bag_of_words.copy())
***************************************** Collaborative Topic Poisson Factorization *****************************************
/home/david_cortes_rivera/ctpfrec/ctpfrec.py:463: UserWarning: Some words are associated only with items that are in 'words_df' but not in 'counts_df'. These will be used to initialize Beta but will be excluded from the final model. If you still wish to include them in the model, use 'missing_items='include''. For information about which words are used by the model, see the attribute 'word_mapping_'. warnings.warn(msg)
Number of users: 65913 Number of items: 39578 Number of words: 67980 Latent factors to use: 70 Initializing parameters... Initializing Theta and Beta through HPF... ********************************** Hierarchical Poisson Factorization ********************************** Number of users: 417053 Number of items: 342260 Latent factors to use: 70 Initializing parameters... Allocating Phi matrix... Initializing optimization procedure... Iteration 10 | Norm(Theta_{10} - Theta_{0}): 3373.40234 Iteration 20 | Norm(Theta_{20} - Theta_{10}): 13.27888 Iteration 30 | Norm(Theta_{30} - Theta_{20}): 11.13438 Iteration 40 | Norm(Theta_{40} - Theta_{30}): 5.31399 Iteration 50 | Norm(Theta_{50} - Theta_{40}): 3.23850 Iteration 60 | Norm(Theta_{60} - Theta_{50}): 2.54416 Iteration 70 | Norm(Theta_{70} - Theta_{60}): 1.98683 Iteration 80 | Norm(Theta_{80} - Theta_{70}): 1.91646 Iteration 90 | Norm(Theta_{90} - Theta_{80}): 1.50169 Iteration 100 | Norm(Theta_{100} - Theta_{90}): 1.18380 Optimization finished Final log-likelihood: -54259927 Final RMSE: 2.4187 Minutes taken (optimization part): 23.7 ********************************** Allocating intermediate matrices... Initializing optimization procedure... Iteration 10 | train llk: -5006436 | train rmse: 2.8536 Iteration 20 | train llk: -4944714 | train rmse: 2.8482 Iteration 30 | train llk: -4924733 | train rmse: 2.8460 Iteration 40 | train llk: -4926419 | train rmse: 2.8454 Optimization finished Final log-likelihood: -4926419 Final RMSE: 2.8454 Minutes taken (optimization part): 1.9 Producing Python dictionaries... CPU times: user 3h 24min 4s, sys: 38.5 s, total: 3h 24min 42s Wall time: 25min 59s
Most of the time here was spent in fitting the model to items that no user in the training set had clicked. If using instead a random initialization, it would have taken a lot less time to fit this model (there would be only a fraction of the items - see above time spent in each procedure), but the results are slightly worse.
Disclaimer: this notebook was run on a Google cloud server with Skylake CPU using 8 cores, and memory usage tops at around 6GB of RAM for the first model (including all the objects loaded before). In a desktop computer, it would take a bit longer to fit.
There are many different metrics to evaluate recommendation quality in implicit datasets, but all of them have their drawbacks. The idea of this notebook is to illustrate the package usage and not to introduce and compare evaluation metrics, so I will only perform some common sense checks on the test data.
For implementations of evaluation metrics for implicit recommendations see other packages such as lightFM.
As some common sense checks, the predictions should:
Here I'll check these four conditions:
events_test['Predicted'] = recommender_all_items.predict(user=events_test.UserId, item=events_test.ItemId)
events_test['RandomItem'] = np.random.choice(events_train.ItemId.unique(), size=events_test.shape[0])
events_test['PredictedRandom'] = recommender_all_items.predict(user=events_test.UserId,
item=events_test.RandomItem)
print("Average prediction for combinations in test set: ", events_test.Predicted.mean())
print("Average prediction for random combinations: ", events_test.PredictedRandom.mean())
Average prediction for combinations in test set: 0.017780766 Average prediction for random combinations: 0.0047758827
from sklearn.metrics import roc_auc_score
was_clicked = np.r_[np.ones(events_test.shape[0]), np.zeros(events_test.shape[0])]
score_model = np.r_[events_test.Predicted.values, events_test.PredictedRandom.values]
roc_auc_score(was_clicked[~np.isnan(score_model)], score_model[~np.isnan(score_model)])
0.7079527323667897
np.corrcoef(events_test.Count[~events_test.Predicted.isnull()], events_test.Predicted[~events_test.Predicted.isnull()])[0,1]
0.12031331307638801
import matplotlib.pyplot as plt
%matplotlib inline
_ = plt.hist(events_test.Predicted, bins=200)
plt.xlim(0,5)
plt.show()
events_test['Predicted'] = recommender_clicked_items_only.predict(user=events_test.UserId, item=events_test.ItemId)
events_test['PredictedRandom'] = recommender_clicked_items_only.predict(user=events_test.UserId,
item=events_test.RandomItem)
print("Average prediction for combinations in test set: ", events_test.Predicted.mean())
print("Average prediction for random combinations: ", events_test.PredictedRandom.mean())
Average prediction for combinations in test set: 0.025673132 Average prediction for random combinations: 0.008127485
was_clicked = np.r_[np.ones(events_test.shape[0]), np.zeros(events_test.shape[0])]
score_model = np.r_[events_test.Predicted.values, events_test.PredictedRandom.values]
roc_auc_score(was_clicked, score_model)
0.6907211476157746
np.corrcoef(events_test.Count, events_test.Predicted)[0,1]
0.06974015808183695
_ = plt.hist(events_test.Predicted, bins=200)
plt.xlim(0,5)
plt.show()
A natural benchmark to compare this model is to is a Poisson factorization model without any item side information - here I'll do the comparison with a Hierarchical Poisson factorization model with the same metrics as above:
%%time
from hpfrec import HPF
recommender_no_sideinfo = HPF(k=70)
recommender_no_sideinfo.fit(events_train.copy())
********************************** Hierarchical Poisson Factorization ********************************** Number of users: 65913 Number of items: 39578 Latent factors to use: 70 Initializing parameters... Allocating Phi matrix... Initializing optimization procedure... Iteration 10 | train llk: -4635584 | train rmse: 2.8502 Iteration 20 | train llk: -4548912 | train rmse: 2.8397 Iteration 30 | train llk: -4512693 | train rmse: 2.8336 Iteration 40 | train llk: -4492286 | train rmse: 2.8297 Iteration 50 | train llk: -4476969 | train rmse: 2.8287 Iteration 60 | train llk: -4464443 | train rmse: 2.8282 Iteration 70 | train llk: -4454397 | train rmse: 2.8282 Iteration 80 | train llk: -4448200 | train rmse: 2.8280 Iteration 90 | train llk: -4442528 | train rmse: 2.8275 Iteration 100 | train llk: -4437068 | train rmse: 2.8272 Optimization finished Final log-likelihood: -4437068 Final RMSE: 2.8272 Minutes taken (optimization part): 1.5 CPU times: user 12min 22s, sys: 2.19 s, total: 12min 24s Wall time: 1min 34s
events_test_comp = events_test.copy()
events_test_comp['Predicted'] = recommender_no_sideinfo.predict(user=events_test_comp.UserId, item=events_test_comp.ItemId)
events_test_comp['PredictedRandom'] = recommender_no_sideinfo.predict(user=events_test_comp.UserId,
item=events_test_comp.RandomItem)
print("Average prediction for combinations in test set: ", events_test_comp.Predicted.mean())
print("Average prediction for random combinations: ", events_test_comp.PredictedRandom.mean())
Average prediction for combinations in test set: 0.023392139 Average prediction for random combinations: 0.0063642794
was_clicked = np.r_[np.ones(events_test_comp.shape[0]), np.zeros(events_test_comp.shape[0])]
score_model = np.r_[events_test_comp.Predicted.values, events_test_comp.PredictedRandom.values]
roc_auc_score(was_clicked, score_model)
0.6910112931686316
np.corrcoef(events_test_comp.Count, events_test_comp.Predicted)[0,1]
0.1007423756772694
As can be seen, adding the side information and widening the catalog to include more items using only their text descriptions (no clicks) results in an improvemnet over all 3 metrics, especially correlation with number of clicks.
More important than that however, is its ability to make recommendations from a far wider catalog of items, which in practice can make a much larger difference in recommendation quality than improvement in typicall offline metrics.
The package provides a simple API for making predictions and Top-N recommended lists. These Top-N lists can be made among all items, or across some user-provided subset only, and you can choose to discard items with which the user had already interacted in the training set.
Here I will:
Unfortunately, since all the data is anonymized, it's not possible to make a qualitative evaluation of the results by looking at the recommended lists as it is in other datasets.
users_many_events = events_train.groupby('UserId')['ItemId'].agg(lambda x: len(tuple(x)))
users_many_events = np.array(users_many_events.index[users_many_events > 20])
np.random.seed(1)
chosen_user = np.random.choice(users_many_events)
chosen_user
1362222
%%time
recommender_all_items.topN(chosen_user, n=20)
CPU times: user 44 ms, sys: 0 ns, total: 44 ms Wall time: 52 ms
array([ 9877, 119736, 312728, 241555, 257040, 325310, 320130, 445351, 409804, 384302, 219512, 38965, 234255, 303828, 37029, 309778, 248455, 190000, 290999, 213834])
(These numbers represent the IDs of the items being recommended as they appeared in the events_train
data frame)
%%time
recommender_clicked_items_only.topN(chosen_user, n=20)
CPU times: user 8 ms, sys: 0 ns, total: 8 ms Wall time: 1.65 ms
array([119736, 441852, 372188, 344723, 116624, 439963, 345279, 4001, 183511, 33912, 354585, 456056, 29940, 272324, 89323, 186702, 190000, 227790, 92361, 78729])
%%time
recommender_no_sideinfo.topN(chosen_user, n=20)
CPU times: user 4 ms, sys: 0 ns, total: 4 ms Wall time: 1.48 ms
array([ 9877, 241555, 325310, 38965, 283115, 272455, 37115, 412622, 252319, 314789, 108486, 265571, 20740, 212917, 210087, 198784, 381941, 82377, 178274, 122219])