This projects consists in building a recommender system to recommend songs according to the history of tracks played by a given user from subset of the MillionSong TasteProfile data that was made into a Kaggle competition, using Hierarchical Poisson Factorization. Unlike recommendations based on movie ratings, for example, recommending based on listening activity is harder as it is only an indirect measure of user preference, and doesn’t usually signal user dislikes.
HPF (Hierarchical Poisson Factorization) is a probabilistic model that tries to factorize the user-item interaction count matrix as the product of two lower dimensional matrices, just like regular factorization methods used with explicit feedback data, but taking this product as the parameter of a Poisson random variable (thus the model likelihood optimization is different than minimizing the sum of least squares). Additionally, the user-attribute and item-attribute matrices are given a Bayesian Gamma prior, making them non-negative and adjusting for the user activity level and song popularity.
Unlike other methods such as BPR (Bayesian Personalized Ranking) or weighted-implicit ALS, it only requires iterating over the data for which an interaction was observed and not over data for which no interaction was observed (i.e. it doesn’t iterate over songs not played by the user), thus being more scalable, and at the same time producing better results.
The implementation here is based on the paper Gopalan, P., Hofman, J. M., & Blei, D. M. (2013). Scalable recommendation with poisson factorization. arXiv preprint arXiv:1311.1704., and is implemented using PyMC3 without relying on variational inference, so speed is perhaps not as great as the authors' original coordinate ascent optimization algorithm.
It’s possible to extend this model to make use of song side information, such as artist, tags, genre, and others, but that increases the model complexity and computational time a lot without bringing too much of an improvement (as seen in the paper Gopalan, P. K., Charlin, L., & Blei, D. (2014). Content-based recommendations with poisson factorization. In Advances in Neural Information Processing Systems (pp. 3176-3184).)
3. Checking some recommendations
Reading and reindexing the data:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, pymc3 as pm, theano
playcounts=list()
user_id_to_int=dict()
user_int_to_id=dict()
song_id_to_int=dict()
song_int_to_id=dict()
cnt_users=0
cnt_songs=0
def parse_line(line):
global playcounts, cnt_users, cnt_songs
user,song,playcount = line.decode('utf-8').split('\t')
if user not in user_id_to_int:
user_id_to_int[user]=cnt_users
user_int_to_id[cnt_users]=user
cnt_users+=1
if song not in song_int_to_id:
song_int_to_id[song]=cnt_songs
song_int_to_id[cnt_songs]=song
cnt_songs+=1
user=user_id_to_int[user]
song=song_int_to_id[song]
playcount=int(playcount.strip())
playcounts.append((user, song, playcount))
with open('D:\\Downloads\\millionsong\\kaggle_visible_evaluation_triplets.txt','rb') as f:
for line in f:
parse_line(line)
print(cnt_users)
print(cnt_songs)
print(len(playcounts))
110000 163206 1450933
Note that an algorithm like implicit-ALS would require constructing a matrix with more than 10^10 entries with this dataset, thus not feasible on a cheap laptop.
del user_id_to_int
del user_int_to_id
del song_id_to_int
The dataset doesn't contain timestamps so I'll make a random train-test split:
from sklearn.model_selection import train_test_split
playcounts=pd.DataFrame(playcounts, columns=['UserId', 'SongId', 'Playcount'])
train, test = train_test_split(playcounts, test_size=.2, random_state=1)
users_train=set(train.UserId)
items_train=set(train.SongId)
test=test.loc[test.UserId.isin(users_train)]
test=test.loc[test.UserId.isin(items_train)]
test=test.loc[test.Playcount>1]
del users_train
del items_train
print(train.shape)
print(test.shape)
train.head()
(1160746, 3) (115511, 3)
UserId | SongId | Playcount | |
---|---|---|---|
745579 | 56512 | 49759 | 1 |
688265 | 52114 | 434 | 3 |
1425112 | 108044 | 112662 | 1 |
231922 | 17542 | 325 | 1 |
1159262 | 87716 | 2623 | 6 |
PyMC3 Implementation. The hyperparameters are the ones suggested in the paper:
# hyperparameters
a=.3
a_prime=.3
b_prime=1.0
c=.3
c_prime=.3
d_prime=1.0
# number of factors
k = 30
np.random.seed(1)
theta_init=np.random.gamma(.3, .3, size=(cnt_users,k))
beta_init=np.random.gamma(.3, .3, size=(cnt_songs,k))
with pm.Model():
user_activity=pm.Gamma('user_activity', a_prime, a_prime/b_prime, shape = (cnt_users,1) )
user_prior=theano.tensor.tile(user_activity, k, ndim=2)
theta=pm.Gamma('theta', a, user_prior, shape=(cnt_users,k), testval=theta_init)
item_pop=pm.Gamma('item_pop', c_prime, c_prime/d_prime, shape = (cnt_songs,1) )
item_prior=theano.tensor.tile(item_pop, k, ndim=2)
beta=pm.Gamma('beta', c, item_prior, shape=(cnt_songs,k), testval=beta_init)
xhat=theano.tensor.sum(theta[train.UserId] * beta[train.SongId], axis=1)
R=pm.Poisson('R',mu=xhat, observed=train.Playcount.astype('float32'))
HPF=pm.find_MAP()
logp = 1.3104e+07, ||grad|| = 11,095: 100%|████| 69/69 [14:10<00:00, 9.11s/it]
Now extracting the results:
theta_star=HPF['theta']
beta_star=HPF['beta'].T
Now examining the Top-20 recommended songs for some users. The list of song names, artist, and other side information is also made available:
song_info=pd.read_table('D:\\Downloads\\millionsong\\unique_tracks.txt', sep='<SEP>', engine='python',
header=None, names=['TrackId', 'SongId', 'Artist', 'Title'])
del song_info['TrackId']
song_info.head()
SongId | Artist | Title | |
---|---|---|---|
0 | SOQMMHC12AB0180CB8 | Faster Pussy cat | Silent Night |
1 | SOVFVAK12A8C1350D9 | Karkkiautomaatti | Tanssi vaan |
2 | SOGTUKN12AB017F4F1 | Hudson Mohawke | No One Could Ever |
3 | SOBNYVR12A8C13558C | Yerba Brava | Si Vos Querés |
4 | SOHSBXH12A8C13B0DF | Der Mystic | Tangle Of Aspens |
def top_n(user_id, n=20):
global theta_star, beta_star, song_info
recommended_list = np.argsort(theta_star[user_id].dot(beta_star))
recommended_list = [song_int_to_id[song] for song in recommended_list[:n]]
recommended_list = pd.DataFrame(pd.Series(recommended_list), columns=['SongId'])
return pd.merge(recommended_list, song_info, on='SongId', how='left')
top_n(user_id=0)
SongId | Artist | Title | |
---|---|---|---|
0 | SODQIKZ12AB017DA1C | LOFOFORA | Accélère |
1 | SOBXQTK12A6D4F8131 | MxPx | Move To Bremerton (Extended Version) (Let It H... |
2 | SOWORHE12A8C139631 | Looptroop | Fort Europa |
3 | SOHTYVY12A6D4F8AD8 | Celtic Woman | My Lagan Love (Live) |
4 | SOYVEHH12A6D4FA795 | Bright Eyes | The Biggest Lie |
5 | SOSZGBS12AB018A2F3 | The Kovenant | The Dark Conquest |
6 | SODQFMV12AC468AD20 | Last Autumn's Dream | Never Faraway |
7 | SOWZOFF12A6D4F7211 | Spiller | Groovejet (If This Ain't Love) |
8 | SOISVAU12A8AE47CF6 | Cavalera Conspiracy | Must Kill |
9 | SOVZKID12AB017CB70 | Simon Wynberg | Silent Night |
10 | SOFWMZU12A8C13DCAD | Will Ackerman | Anne's Song |
11 | SOKTEQD12A8C139A9A | Laraaji | Meditation No 1 |
12 | SOSSBHT12AB0182EDB | Brutal Knights | Beyotch Island |
13 | SOBPTPL12AC468AB06 | Kenny Rogers | Endless Love |
14 | SOXVJAW12AF72A156E | Mindi Abair | Remember |
15 | SOXSLYW12A6BD5371B | Dubstar | Not So Manic Now |
16 | SOEVQSZ12AB0185217 | Roscoe Holcomb | Little Birdie |
17 | SOKIPDP12AB018358F | Naked Raygun | New Dreams |
18 | SOKURUK12A8C13316B | Juiz Electric vs DJ Katakis | African Beauty (Heatbeat Remix) |
19 | SOXXHNJ12AB0186924 | Marcus Miller | Higher Ground |
top_n(user_id=123)
SongId | Artist | Title | |
---|---|---|---|
0 | SONTNZU12A6D4F9D93 | Ramones | Somebody Put Something In My Drink (Remaster... |
1 | SOFHTWQ12AB0181730 | Muse | Sunburn [Live] |
2 | SOZHUBY12A8C13EFBA | Ulrich Müller Doppler | Affettuoso_ Sonata |
3 | SOTZZBD12AC4687BDE | Mardi Gras.BB | Delhi Morning Raga |
4 | SOHVEWJ12AB0182914 | Lemonheads | Don't Tell Yourself |
5 | SOCETNN12AB0189DE9 | The Screamin' Cheetah Wheelies | Jami (LP Version) |
6 | SOWKBJE12AB0180D25 | Sixtoo | Funny Sticks Reprise |
7 | SORNBAK12A8151D769 | Beatsteaks | She Was Great |
8 | SODOBTK12AB017D701 | Solar Fields | Discovering |
9 | SOOLUWB12A8C1398C4 | Duoteque | Adyra |
10 | SOJXURJ12A6D4FA82F | Whitesnake | Take me with you (live) |
11 | SOHZYMC12AB017EEED | Craig David | Where's Your Love [Feat. Tinchy Stryder] |
12 | SOKTBNK12A8C13FB20 | Natasha Bedingfield | Pirate Bones |
13 | SOXAKWS12A58A7C8AC | Der Dritte Raum | Swing Bop |
14 | SOWHXUR12A8C13F74E | DON JOHNSON BIG BAND | Running Man |
15 | SOALINL12AB018A936 | Freddy Cannon | Transistor Sister |
16 | SODCIMP12AB018279A | Marjorie Estiano | So Easy |
17 | SOKXYCV12A8C1349D4 | Traband | Adam a Luna (Adam And The Moon) |
18 | SODCYMP12A8C13D205 | The Rolling Stones | I Need You Baby (Mona) |
19 | SOJSMDF12A8C142656 | Johnny & the Hurricanes | Beatnik Fly |
top_n(user_id=53000)
SongId | Artist | Title | |
---|---|---|---|
0 | SOZAKJI12A6D4FB97A | Panteón Rococó feat. Mimi Maura | Cosas Del Ayer (Hoy y siempre) |
1 | SOFTSJI12A8C134C23 | U.S. Bombs | U.S. Bombs |
2 | SOTKMIT12A8C136BC5 | John Michael Talbot | May I Never Boast (Heart Of The Shepherd Album... |
3 | SOZWEWI12A67ADC82B | Kate Ryan | Nos regards Qui M'Enflamment |
4 | SOEJMIM12A8AE459C0 | Stanley Clarke | He Lives On (Story About The Last Journey Of A... |
5 | SOLPNII12A6701F64E | Lucky Boys Confusion | Commitment (Album Version) |
6 | SOKRLMR12A58A7AD4A | Ruth | Polaroïd/Roman/Photo |
7 | SOUQLSO12AB018B93D | PlayRadioPlay! | Selfish Introvert |
8 | SOWIBTM12AB01815D1 | Chris Kenner | I Like It Like That |
9 | SOCLJBD12AC95F06D2 | Torch | I Need Your Love |
10 | SOCTHMT12AF72A69DF | Buju Banton | Rampage |
11 | SOKNQIW12A8C13360E | The Cult | Butterflies |
12 | SOBIOGV12AB018E71A | Flunk | Personal Stereo |
13 | SOBVKSX12A8C142227 | Bløf | De Mooiste Verliezers (UMOJA Live) |
14 | SONZRFG12AB018556F | Ryan Star | This Could Be The Year (Album Version) |
15 | SOBCULV12AB01836DB | Nena | Weisses Schiff |
16 | SOCJICX12A8C13BBBB | Vector Lovers | Hush Now |
17 | SOQMVIO12A6D4F49B0 | The Great Redneck Hope | Let's Fall In Love Over AIM So We Can Fuck Whe... |
18 | SOFDBXG12B0B808312 | Howie Day | Disco |
19 | SOCHFSS12AB0184463 | Black Label Society | Death March |
top_n(user_id=100000)
SongId | Artist | Title | |
---|---|---|---|
0 | SOEOCAN12A58A7C48C | Prodigy | The Life |
1 | SOZWEWI12A67ADC82B | Kate Ryan | Nos regards Qui M'Enflamment |
2 | SODDIDB12A8AE463BD | Kitaro | Quasar |
3 | SOHBZED12AB017DCA9 | Thompson Twins | Love Is The Law |
4 | SOVOPLZ12AAF3B43FB | Black Lips | Old Man |
5 | SOJDLVA12A8C144644 | Ben Folds | Kylie From Connecticut |
6 | SOCQJAB12A67ADA035 | Syreeta | Black Maybe |
7 | SOXAKWS12A58A7C8AC | Der Dritte Raum | Swing Bop |
8 | SOACHZA12A8C13B6DF | Olle Adolphson | Goggles (remaster '03) |
9 | SOXOPTU12AAF3B21C7 | Baroness | Ogeechee Hymnal |
10 | SOMFMBN12AB017C87E | SSS | The Bastard Stench |
11 | SOVWTNX12A8C137A40 | Michael Learns To Rock | My Blue Angel |
12 | SODXWEF12A6D4FA1A1 | The Clark Sisters | Simply Yes (Miracle Album Version) |
13 | SOFLVOE12A8C13862C | Slipknot | All Hope Is Gone (Album Version) |
14 | SOUYDMF12AF729F9BF | George Lopez | Orientation |
15 | SOQSOFR12A8C142F27 | Digitonal | Emberkreiss |
16 | SOLIHTW12A6701D922 | Young Gunz / Denim | Life We Chose |
17 | SOZDRDR12AB018CE30 | Alicia Keys | That's How Strong My Love Is |
18 | SOBDSHS12A6D4F98D5 | Eva Cassidy | Fine And Mellow |
19 | SOMRYGW12AB01853BD | The Midway State | No Crying |
Unfortunately, for implicit data such as playcounts, there is no intuitive metric such as the root mean squared error or average rating of Top-N recommendations to compare with. There are some metrics such as Precision@K and Recall@K, which try to see how much do the Top-K recommended lists for each user include the items in the hold-out set, but these are very slow to calculate and also not entirely good metrics.
Nevertheless, it's possible to see how well would it rank the songs for each user in the hold-out set and compare that to the number of playcounts observed.
It's also possible to do some common sense checks such as checking the mean prediction for songs that were in the test set and compare it to randomly selected songs:
test['Predicted']=test.apply(lambda x: theta_star[int(x['UserId'])].dot(beta_star[:,int(x['SongId'])]), axis=1)
np.random.seed(1)
test['RandomSongId']=np.random.randint(cnt_songs, size=test.shape[0])
test['PredictedRandom']=test.apply(lambda x: theta_star[int(x['UserId'])].dot(beta_star[:,int(x['RandomSongId'])]), axis=1)
print('Average Score for songs in hold-out set: ',test.Predicted.mean())
print('Average Score for random songs: ',test.PredictedRandom.mean())
test.head(10)
Average Score for songs in hold-out set: 2.884634063615049 Average Score for random songs: 0.6744080090711221
UserId | SongId | Playcount | Predicted | RandomSongId | PredictedRandom | |
---|---|---|---|---|---|---|
689569 | 52215 | 3667 | 5 | 12.976068 | 128037 | 0.809523 |
2399 | 184 | 2121 | 3 | 0.232450 | 5192 | 0.720546 |
948969 | 71943 | 2666 | 11 | 2.041758 | 50057 | 0.187526 |
1214563 | 91846 | 2605 | 2 | 0.488312 | 109259 | 0.383671 |
1448562 | 109821 | 51636 | 2 | 0.130306 | 73349 | 0.229258 |
341671 | 25877 | 13119 | 2 | 1.332852 | 117583 | 0.221988 |
868962 | 65857 | 39934 | 4 | 0.476213 | 21440 | 0.220367 |
1148222 | 86909 | 148338 | 2 | 1.131511 | 151681 | 0.983645 |
1130271 | 85564 | 3903 | 4 | 5.682648 | 138823 | 0.070143 |
215670 | 16335 | 23180 | 2 | 0.124622 | 31228 | 0.101731 |
np.corrcoef(test.Predicted, test.Playcount)[0,1]
0.10238816159040297
users_more_2_songs=test.groupby('UserId')['SongId'].agg(lambda x: len(tuple(x))>2)
users_more_2_songs=users_more_2_songs.loc[users_more_2_songs]
users_more_2_songs=set(users_more_2_songs.index)
test_eval=test.loc[test.UserId.isin(users_more_2_songs)]
test_eval=test_eval.groupby('UserId')[['Playcount','Predicted']].corr(method='kendall').ix[0::2,'Predicted']
np.nanmean(test_eval.to_frame().values)
C:\Users\david\Anaconda3\lib\site-packages\ipykernel\__main__.py:6: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
0.05039925238011899
test_eval.head(15)
UserId 15 Playcount -0.707107 17 Playcount 0.501280 30 Playcount 0.333333 34 Playcount -0.414039 43 Playcount 0.333333 52 Playcount -0.077850 54 Playcount 0.000000 65 Playcount -0.316228 69 Playcount 0.358569 74 Playcount 0.182574 75 Playcount 1.000000 80 Playcount 0.235702 97 Playcount 0.527046 102 Playcount -0.223607 106 Playcount 0.816497 Name: Predicted, dtype: float64