Collaborative filtering for 'implicit feedback' data


This projects consists in building a recommender system to recommend songs according to the history of tracks played by a given user from subset of the MillionSong TasteProfile data that was made into a Kaggle competition, using Hierarchical Poisson Factorization. Unlike recommendations based on movie ratings, for example, recommending based on listening activity is harder as it is only an indirect measure of user preference, and doesn’t usually signal user dislikes.

HPF (Hierarchical Poisson Factorization) is a probabilistic model that tries to factorize the user-item interaction count matrix as the product of two lower dimensional matrices, just like regular factorization methods used with explicit feedback data, but taking this product as the parameter of a Poisson random variable (thus the model likelihood optimization is different than minimizing the sum of least squares). Additionally, the user-attribute and item-attribute matrices are given a Bayesian Gamma prior, making them non-negative and adjusting for the user activity level and song popularity.

Unlike other methods such as BPR (Bayesian Personalized Ranking) or weighted-implicit ALS, it only requires iterating over the data for which an interaction was observed and not over data for which no interaction was observed (i.e. it doesn’t iterate over songs not played by the user), thus being more scalable, and at the same time producing better results.

The implementation here is based on the paper Gopalan, P., Hofman, J. M., & Blei, D. M. (2013). Scalable recommendation with poisson factorization. arXiv preprint arXiv:1311.1704., and is implemented using PyMC3 without relying on variational inference, so speed is perhaps not as great as the authors' original coordinate ascent optimization algorithm.

It’s possible to extend this model to make use of song side information, such as artist, tags, genre, and others, but that increases the model complexity and computational time a lot without bringing too much of an improvement (as seen in the paper Gopalan, P. K., Charlin, L., & Blei, D. (2014). Content-based recommendations with poisson factorization. In Advances in Neural Information Processing Systems (pp. 3176-3184).)


Sections

1. Loading the data

2. Implementing the model

3. Checking some recommendations


1. Loading the data

Reading and reindexing the data:

In [1]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, pymc3 as pm, theano

playcounts=list()
user_id_to_int=dict()
user_int_to_id=dict()
song_id_to_int=dict()
song_int_to_id=dict()
cnt_users=0
cnt_songs=0

def parse_line(line):
    global playcounts, cnt_users, cnt_songs
    user,song,playcount = line.decode('utf-8').split('\t')
    if user not in user_id_to_int:
        user_id_to_int[user]=cnt_users
        user_int_to_id[cnt_users]=user
        cnt_users+=1
    if song not in song_int_to_id:
        song_int_to_id[song]=cnt_songs
        song_int_to_id[cnt_songs]=song
        cnt_songs+=1
    
    user=user_id_to_int[user]
    song=song_int_to_id[song]
    playcount=int(playcount.strip())
    
    playcounts.append((user, song, playcount))

with open('D:\\Downloads\\millionsong\\kaggle_visible_evaluation_triplets.txt','rb') as f:
    for line in f:
        parse_line(line)
              
print(cnt_users)
print(cnt_songs)
print(len(playcounts))
110000
163206
1450933

Note that an algorithm like implicit-ALS would require constructing a matrix with more than 10^10 entries with this dataset, thus not feasible on a cheap laptop.

In [2]:
del user_id_to_int
del user_int_to_id
del song_id_to_int

The dataset doesn't contain timestamps so I'll make a random train-test split:

In [3]:
from sklearn.model_selection import train_test_split

playcounts=pd.DataFrame(playcounts, columns=['UserId', 'SongId', 'Playcount'])

train, test = train_test_split(playcounts, test_size=.2, random_state=1)

users_train=set(train.UserId)
items_train=set(train.SongId)
test=test.loc[test.UserId.isin(users_train)]
test=test.loc[test.UserId.isin(items_train)]
test=test.loc[test.Playcount>1]
del users_train
del items_train

print(train.shape)
print(test.shape)
train.head()
(1160746, 3)
(115511, 3)
Out[3]:
UserId SongId Playcount
745579 56512 49759 1
688265 52114 434 3
1425112 108044 112662 1
231922 17542 325 1
1159262 87716 2623 6

2. Implementing the model

PyMC3 Implementation. The hyperparameters are the ones suggested in the paper:

In [4]:
# hyperparameters
a=.3
a_prime=.3
b_prime=1.0

c=.3
c_prime=.3
d_prime=1.0

# number of factors
k = 30
In [5]:
np.random.seed(1)
theta_init=np.random.gamma(.3, .3, size=(cnt_users,k))
beta_init=np.random.gamma(.3, .3, size=(cnt_songs,k))

with pm.Model():
    user_activity=pm.Gamma('user_activity', a_prime, a_prime/b_prime, shape = (cnt_users,1) )
    user_prior=theano.tensor.tile(user_activity, k, ndim=2)
    theta=pm.Gamma('theta', a, user_prior, shape=(cnt_users,k), testval=theta_init)
    
    item_pop=pm.Gamma('item_pop', c_prime, c_prime/d_prime, shape = (cnt_songs,1) )
    item_prior=theano.tensor.tile(item_pop, k, ndim=2)
    beta=pm.Gamma('beta', c, item_prior, shape=(cnt_songs,k), testval=beta_init)
    
    xhat=theano.tensor.sum(theta[train.UserId] * beta[train.SongId], axis=1)
    R=pm.Poisson('R',mu=xhat, observed=train.Playcount.astype('float32'))
    HPF=pm.find_MAP()
logp = 1.3104e+07, ||grad|| = 11,095: 100%|████| 69/69 [14:10<00:00,  9.11s/it]  

Now extracting the results:

In [6]:
theta_star=HPF['theta']
beta_star=HPF['beta'].T

3. Checking some recommendations

Now examining the Top-20 recommended songs for some users. The list of song names, artist, and other side information is also made available:

In [7]:
song_info=pd.read_table('D:\\Downloads\\millionsong\\unique_tracks.txt', sep='<SEP>', engine='python',
              header=None, names=['TrackId', 'SongId', 'Artist', 'Title'])
del song_info['TrackId']
song_info.head()
Out[7]:
SongId Artist Title
0 SOQMMHC12AB0180CB8 Faster Pussy cat Silent Night
1 SOVFVAK12A8C1350D9 Karkkiautomaatti Tanssi vaan
2 SOGTUKN12AB017F4F1 Hudson Mohawke No One Could Ever
3 SOBNYVR12A8C13558C Yerba Brava Si Vos Querés
4 SOHSBXH12A8C13B0DF Der Mystic Tangle Of Aspens
In [8]:
def top_n(user_id, n=20):
    global theta_star, beta_star, song_info
    recommended_list = np.argsort(theta_star[user_id].dot(beta_star))
    recommended_list = [song_int_to_id[song] for song in recommended_list[:n]]
    recommended_list = pd.DataFrame(pd.Series(recommended_list), columns=['SongId'])
    return pd.merge(recommended_list, song_info, on='SongId', how='left')

top_n(user_id=0)
Out[8]:
SongId Artist Title
0 SODQIKZ12AB017DA1C LOFOFORA Accélère
1 SOBXQTK12A6D4F8131 MxPx Move To Bremerton (Extended Version) (Let It H...
2 SOWORHE12A8C139631 Looptroop Fort Europa
3 SOHTYVY12A6D4F8AD8 Celtic Woman My Lagan Love (Live)
4 SOYVEHH12A6D4FA795 Bright Eyes The Biggest Lie
5 SOSZGBS12AB018A2F3 The Kovenant The Dark Conquest
6 SODQFMV12AC468AD20 Last Autumn's Dream Never Faraway
7 SOWZOFF12A6D4F7211 Spiller Groovejet (If This Ain't Love)
8 SOISVAU12A8AE47CF6 Cavalera Conspiracy Must Kill
9 SOVZKID12AB017CB70 Simon Wynberg Silent Night
10 SOFWMZU12A8C13DCAD Will Ackerman Anne's Song
11 SOKTEQD12A8C139A9A Laraaji Meditation No 1
12 SOSSBHT12AB0182EDB Brutal Knights Beyotch Island
13 SOBPTPL12AC468AB06 Kenny Rogers Endless Love
14 SOXVJAW12AF72A156E Mindi Abair Remember
15 SOXSLYW12A6BD5371B Dubstar Not So Manic Now
16 SOEVQSZ12AB0185217 Roscoe Holcomb Little Birdie
17 SOKIPDP12AB018358F Naked Raygun New Dreams
18 SOKURUK12A8C13316B Juiz Electric vs DJ Katakis African Beauty (Heatbeat Remix)
19 SOXXHNJ12AB0186924 Marcus Miller Higher Ground
In [9]:
top_n(user_id=123)
Out[9]:
SongId Artist Title
0 SONTNZU12A6D4F9D93 Ramones Somebody Put Something In My Drink (Remaster...
1 SOFHTWQ12AB0181730 Muse Sunburn [Live]
2 SOZHUBY12A8C13EFBA Ulrich Müller Doppler Affettuoso_ Sonata
3 SOTZZBD12AC4687BDE Mardi Gras.BB Delhi Morning Raga
4 SOHVEWJ12AB0182914 Lemonheads Don't Tell Yourself
5 SOCETNN12AB0189DE9 The Screamin' Cheetah Wheelies Jami (LP Version)
6 SOWKBJE12AB0180D25 Sixtoo Funny Sticks Reprise
7 SORNBAK12A8151D769 Beatsteaks She Was Great
8 SODOBTK12AB017D701 Solar Fields Discovering
9 SOOLUWB12A8C1398C4 Duoteque Adyra
10 SOJXURJ12A6D4FA82F Whitesnake Take me with you (live)
11 SOHZYMC12AB017EEED Craig David Where's Your Love [Feat. Tinchy Stryder]
12 SOKTBNK12A8C13FB20 Natasha Bedingfield Pirate Bones
13 SOXAKWS12A58A7C8AC Der Dritte Raum Swing Bop
14 SOWHXUR12A8C13F74E DON JOHNSON BIG BAND Running Man
15 SOALINL12AB018A936 Freddy Cannon Transistor Sister
16 SODCIMP12AB018279A Marjorie Estiano So Easy
17 SOKXYCV12A8C1349D4 Traband Adam a Luna (Adam And The Moon)
18 SODCYMP12A8C13D205 The Rolling Stones I Need You Baby (Mona)
19 SOJSMDF12A8C142656 Johnny & the Hurricanes Beatnik Fly
In [10]:
top_n(user_id=53000)
Out[10]:
SongId Artist Title
0 SOZAKJI12A6D4FB97A Panteón Rococó feat. Mimi Maura Cosas Del Ayer (Hoy y siempre)
1 SOFTSJI12A8C134C23 U.S. Bombs U.S. Bombs
2 SOTKMIT12A8C136BC5 John Michael Talbot May I Never Boast (Heart Of The Shepherd Album...
3 SOZWEWI12A67ADC82B Kate Ryan Nos regards Qui M'Enflamment
4 SOEJMIM12A8AE459C0 Stanley Clarke He Lives On (Story About The Last Journey Of A...
5 SOLPNII12A6701F64E Lucky Boys Confusion Commitment (Album Version)
6 SOKRLMR12A58A7AD4A Ruth Polaroïd/Roman/Photo
7 SOUQLSO12AB018B93D PlayRadioPlay! Selfish Introvert
8 SOWIBTM12AB01815D1 Chris Kenner I Like It Like That
9 SOCLJBD12AC95F06D2 Torch I Need Your Love
10 SOCTHMT12AF72A69DF Buju Banton Rampage
11 SOKNQIW12A8C13360E The Cult Butterflies
12 SOBIOGV12AB018E71A Flunk Personal Stereo
13 SOBVKSX12A8C142227 Bløf De Mooiste Verliezers (UMOJA Live)
14 SONZRFG12AB018556F Ryan Star This Could Be The Year (Album Version)
15 SOBCULV12AB01836DB Nena Weisses Schiff
16 SOCJICX12A8C13BBBB Vector Lovers Hush Now
17 SOQMVIO12A6D4F49B0 The Great Redneck Hope Let's Fall In Love Over AIM So We Can Fuck Whe...
18 SOFDBXG12B0B808312 Howie Day Disco
19 SOCHFSS12AB0184463 Black Label Society Death March
In [11]:
top_n(user_id=100000)
Out[11]:
SongId Artist Title
0 SOEOCAN12A58A7C48C Prodigy The Life
1 SOZWEWI12A67ADC82B Kate Ryan Nos regards Qui M'Enflamment
2 SODDIDB12A8AE463BD Kitaro Quasar
3 SOHBZED12AB017DCA9 Thompson Twins Love Is The Law
4 SOVOPLZ12AAF3B43FB Black Lips Old Man
5 SOJDLVA12A8C144644 Ben Folds Kylie From Connecticut
6 SOCQJAB12A67ADA035 Syreeta Black Maybe
7 SOXAKWS12A58A7C8AC Der Dritte Raum Swing Bop
8 SOACHZA12A8C13B6DF Olle Adolphson Goggles (remaster '03)
9 SOXOPTU12AAF3B21C7 Baroness Ogeechee Hymnal
10 SOMFMBN12AB017C87E SSS The Bastard Stench
11 SOVWTNX12A8C137A40 Michael Learns To Rock My Blue Angel
12 SODXWEF12A6D4FA1A1 The Clark Sisters Simply Yes (Miracle Album Version)
13 SOFLVOE12A8C13862C Slipknot All Hope Is Gone (Album Version)
14 SOUYDMF12AF729F9BF George Lopez Orientation
15 SOQSOFR12A8C142F27 Digitonal Emberkreiss
16 SOLIHTW12A6701D922 Young Gunz / Denim Life We Chose
17 SOZDRDR12AB018CE30 Alicia Keys That's How Strong My Love Is
18 SOBDSHS12A6D4F98D5 Eva Cassidy Fine And Mellow
19 SOMRYGW12AB01853BD The Midway State No Crying

Unfortunately, for implicit data such as playcounts, there is no intuitive metric such as the root mean squared error or average rating of Top-N recommendations to compare with. There are some metrics such as [email protected] and [email protected], which try to see how much do the Top-K recommended lists for each user include the items in the hold-out set, but these are very slow to calculate and also not entirely good metrics.

Nevertheless, it's possible to see how well would it rank the songs for each user in the hold-out set and compare that to the number of playcounts observed.

It's also possible to do some common sense checks such as checking the mean prediction for songs that were in the test set and compare it to randomly selected songs:

In [12]:
test['Predicted']=test.apply(lambda x: theta_star[int(x['UserId'])].dot(beta_star[:,int(x['SongId'])]), axis=1)

np.random.seed(1)
test['RandomSongId']=np.random.randint(cnt_songs, size=test.shape[0])
test['PredictedRandom']=test.apply(lambda x: theta_star[int(x['UserId'])].dot(beta_star[:,int(x['RandomSongId'])]), axis=1)

print('Average Score for songs in hold-out set: ',test.Predicted.mean())
print('Average Score for random songs: ',test.PredictedRandom.mean())
test.head(10)
Average Score for songs in hold-out set:  2.884634063615049
Average Score for random songs:  0.6744080090711221
Out[12]:
UserId SongId Playcount Predicted RandomSongId PredictedRandom
689569 52215 3667 5 12.976068 128037 0.809523
2399 184 2121 3 0.232450 5192 0.720546
948969 71943 2666 11 2.041758 50057 0.187526
1214563 91846 2605 2 0.488312 109259 0.383671
1448562 109821 51636 2 0.130306 73349 0.229258
341671 25877 13119 2 1.332852 117583 0.221988
868962 65857 39934 4 0.476213 21440 0.220367
1148222 86909 148338 2 1.131511 151681 0.983645
1130271 85564 3903 4 5.682648 138823 0.070143
215670 16335 23180 2 0.124622 31228 0.101731
In [13]:
np.corrcoef(test.Predicted, test.Playcount)[0,1]
Out[13]:
0.10238816159040297
In [14]:
users_more_2_songs=test.groupby('UserId')['SongId'].agg(lambda x: len(tuple(x))>2)
users_more_2_songs=users_more_2_songs.loc[users_more_2_songs]
users_more_2_songs=set(users_more_2_songs.index)

test_eval=test.loc[test.UserId.isin(users_more_2_songs)]
test_eval=test_eval.groupby('UserId')[['Playcount','Predicted']].corr(method='kendall').ix[0::2,'Predicted']
np.nanmean(test_eval.to_frame().values)
C:\Users\david\Anaconda3\lib\site-packages\ipykernel\__main__.py:6: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
Out[14]:
0.05039925238011899
In [15]:
test_eval.head(15)
Out[15]:
UserId           
15      Playcount   -0.707107
17      Playcount    0.501280
30      Playcount    0.333333
34      Playcount   -0.414039
43      Playcount    0.333333
52      Playcount   -0.077850
54      Playcount    0.000000
65      Playcount   -0.316228
69      Playcount    0.358569
74      Playcount    0.182574
75      Playcount    1.000000
80      Playcount    0.235702
97      Playcount    0.527046
102     Playcount   -0.223607
106     Playcount    0.816497
Name: Predicted, dtype: float64