Clustering Musical Artists

Note: if you are visualizing this notebook directly from GitHub, some mathematical symbols might display incorrectly or not display at all. This same notebook can be rendered from nbviewer by following this link.

This project consists on clustering musical artists using a dataset with the top 50 played artists per user of a random sample of ~360,000 users from Last.fm, which can be found here, based on the idea that artists who are preferred by the same users tend to be similar.

Part 1 - Loading and cleaning the data

1.1 - Downloading the dataset and generating a sample

In [1]:
import os, sys

def download_data():
    import urllib, tarfile
    data_file = urllib.URLopener()
    data_file.retrieve("http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz", "lastfm-dataset-360K.tar.gz")
    data_file = tarfile.open("lastfm-dataset-360K.tar.gz", 'r:gz')
    data_file.extractall()
    
def generate_sample(file_path=os.getcwd()+'/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv',n=10000):
    with open('small_sample.tsv','w') as out:
        with open(file_path) as f:
            for i in range(n):
                out.write(f.readline())
    
#download_data()
# generate_sample()

1.2 - Loading and formatting the data

In [8]:
import os, json

#Here I'm assuming there's an Spark Context already set up under the name 'sc'

# dataset=sc.textFile(os.getcwd()+'/small_sample.tsv')
dataset=sc.textFile(os.getcwd()+'/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv')

dataset=dataset.map(lambda x: x.split('\t')).filter(lambda x: len(x[1])>30) #removing invalid artists (e.g. 'bbc radio')
dataset=dataset.filter(lambda x: x[1]!='89ad4ac3-39f7-470e-963a-56509c546377') #removing 'various artists'

#converting hash values to integers to speed up the analysis
u_list=dataset.map(lambda x: x[0]).distinct().collect()
users=dict()
for v,k in enumerate(u_list):
    users[k]=v
a_list=dataset.map(lambda x: x[1]).distinct().collect()
artists=dict()
for v,k in enumerate(a_list):
    artists[k]=v
n_users=len(u_list)
n_artists=len(a_list)

dataset=dataset.map(lambda x: (users[x[0]],artists[x[1]],x[2])) #the number of plays is not relevant here
dataset.cache()
del u_list, a_list, users, artists


print("There are ", n_users, " different users")
print("There are ", n_artists, " valid artists")


#Generating some useful files for later

#Artists in this dataset can appear under more than one name
def build_arts_dict(dataset):
    return dict(dataset.map(lambda x: (x[1],x[2])).groupByKey().mapValues(list).mapValues(lambda x: x[0]).map(lambda x: [x[0],x[1]]).collect())

arts_dic=build_arts_dict(dataset)
with open('artists.json', 'w') as outfile:
    json.dump(arts_dic, outfile)
del arts_dic
dataset.cache()
dataset.map(lambda x: (x[0],x[1])).saveAsTextFile('processed_data')

users_per_artist=dict(dataset.map(lambda x: (x[1],x[0])).groupByKey().mapValues(len).map(list).collect())
with open('users_per_artist.json', 'w') as outfile:
    json.dump(users_per_artist, outfile)
del users_per_artist
dataset.unpersist()
There are  359339  different users
There are  160162  valid artists
Out[8]:
PythonRDD[202] at RDD at PythonRDD.scala:43

Part 2 - Establishing Artists' pairwise similarities

Calculating cosine similarities among artists (this penalizes according to the frequencies of artists across users, giving less of a penalty for cases of assimetric frequencies than dice similarity, for example), considering only wheter they are in a user's top played list regardless of the number of plays and only taking into account those with a high-enoguh similarity.

If we assume that each user has roughly 50-top played artists, and each artist has an equal chance of appearing within any given user's playlist, then the expected cosine similarity between two artists, if everything happened by chance, could be approximated like this:

$$expected.sim=\frac{(\frac{50}{n. artists})^2 \times n.users}{\sqrt{50}^2} \approx 0.0007 $$

However, since pairs of artists with such similarity would likely not be in the same cluster, it could be a better idea to set an arbitrary threshold instead, so as to decrease the number of pairs. A cosine similarity of 0.1 would be equivalent to two artists appearing each in the playlists of 100 users and having 10 users in common; or in a different case, to an artist appearing in 100 users' playlist and another in 50, having 7 users in common.

In order to decrease the number of artists pairs to evaluate in the clustering part and make it manageable on a single machine, it would be better to set a minimum requirement for common users among artists - here I set it to 7, so a pair of artists should have at least 7 users in common in order for it to be assigned a non-zero distance, otherwise there would be artists assigned to the same cluster only because one user heard both - as well as a threshold for the cosine distance, which I set at 4 times the expected value if everything happened at random, in order for a pair to be considered as having a certain similarity.

2.1 Generating candidate pairs of aritsts to compare

This dataset encompasses 360,000 users, each having a list of 50 top played artists, summing up to 160,000 different artists. Trying to establish similarities between all the artists would imply looping through $160,000 \times (160,000-1)/2 \approx 13,000,000,000$ pairs, so it's better to first see which artists have users in common, as of the 13 billion possible pairs, there are very few with at least one user in common. Doing it this way would imply looping over only $360,000 \times 50 \times (50-1)/2 \approx 441,000,000$ pairs, and in the process, it's possible to count how many users do artists have in common to ease further distance calculations.

In [2]:
import itertools
from operator import add

def ordered(pair):
    n1,n2=pair
    if n2>n1:
        return n1,n2
    else:
        return n2,n1

dataset=sc.textFile('processed_data').map(eval).groupByKey().mapValues(list).map(lambda x: x[1])
dataset=dataset.flatMap(lambda x: [(ordered(i),1) for i in itertools.combinations(x,2)]).reduceByKey(add).map(lambda x: (x[0][0],x[0][1],x[1])).filter(lambda x: x[2]>6)
dataset.cache()
Out[2]:
PythonRDD[10] at RDD at PythonRDD.scala:43

2.2 Converting to cosine distances

In [3]:
from math import sqrt
import json

#Taken from the previous results
n_artists=160162
n_users=359339
threshold=4* (  ((50.0/n_artists)**2)*n_users/(sqrt(50.0)**2)  )

with open('users_per_artist.json') as file:
    users_per_artist=json.load(file)
users_per_artist={int(k):v for k,v in users_per_artist.items()}
bc_dic=sc.broadcast(users_per_artist)
del users_per_artist

dataset=dataset.map(lambda x: (x[0],x[1],x[2]*1.0/(sqrt(bc_dic.value[x[0]])*sqrt(bc_dic.value[x[1]])))).filter(lambda x: x[2]>threshold)
dataset.cache()
dataset.saveAsTextFile('sims')
print('There are ',dataset.count(),' non-zero pairwise distances')
There are  7607557  non-zero pairwise distances

Part 3 - Clustering Artists

Here I'll produce different clusterings, using 100, 200, 500, 700 and 1000 clusters usign power iteration clustering, which provides similar (though usually slightly inferior) results to spectral clustering but runs faster and is scalable.

In [3]:
from pyspark.mllib.clustering import PowerIterationClustering as pic
import pandas as pd
import json

# dataset=sc.textFile('sims').map(eval)
# dataset.cache()

n_clusters=100
clusters=pic.train(dataset,n_clusters)
clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts100')
del clusters

n_clusters=200
clusters=pic.train(dataset,n_clusters)
clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts200')
del clusters

n_clusters=500
clusters=pic.train(dataset,n_clusters)
clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts500')
del clusters

n_clusters=700
clusters=pic.train(dataset,n_clusters)
clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts700')
del clusters

n_clusters=1000
clusters=pic.train(dataset,n_clusters)
clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts1000')
del clusters


dataset=pd.read_csv('clusts100\part-00000',header=None)
dataset.columns=['artist_id','cluster100']
dataset200=pd.read_csv('clusts200\part-00000',header=None)
dataset200.columns=['artist_id','cluster200']
dataset500=pd.read_csv('clusts500\part-00000',header=None)
dataset500.columns=['artist_id','cluster500']
dataset700=pd.read_csv('clusts700\part-00000',header=None)
dataset700.columns=['artist_id','cluster700']
dataset1000=pd.read_csv('clusts1000\part-00000',header=None)
dataset1000.columns=['artist_id','cluster1000']


dataset=dataset.merge(dataset200,how='outer',on='artist_id').merge(dataset500,how='outer',on='artist_id').merge(dataset700,how='outer',on='artist_id').merge(dataset1000,how='outer',on='artist_id')

with open('artists.json') as art:
    artists_dict=json.load(art)
artists_dict={int(k):v for k,v in artists_dict.items()}
dataset['artist_name']=[artists_dict[art] for art in dataset['artist_id']]
dataset.to_csv('results_all.csv',index=False)

del dataset200,dataset500,dataset700,dataset1000
print(dataset.shape[0],' artists were clustered')
dataset.head()
55637  artists were clustered
Out[3]:
artist_id cluster100 cluster200 cluster500 cluster700 cluster1000 artist_name
0 43120 54 56 328 29 976 sy smith
1 90797 74 154 103 466 829 phoebe killdeer and the short straws
2 10290 56 147 273 30 928 judge jules
3 17934 85 78 275 683 853 strip steve
4 30429 49 38 299 672 352 marja mattlar

Since most artists are in only one or two user's playlists, it's unreliable (and computationally complex) to cluster them with so few data. That's why only a fraction (around one third) of the artists were considered for the clustering process.

Additional: clustering with scikit-learn (dbscan) and igraph (louvain modularity) (both are non-parallel). I chose these parameters and algorithms after some manual experimentation seeing which ones give a reasonable spread of artists across clusters. These algorithms have the nice property of automatically determining the number of clusters.

In [6]:
#Rearranging the data format
import re

dataset=sc.textFile('sims')
dataset=dataset.map(lambda x: re.sub('[\(\)\s]','',x))
dataset.repartition(1).saveAsTextFile('sims_csv')
dataset.unpersist()
del dataset
In [4]:
import sklearn, igraph, scipy, re
import pandas as pd
import sklearn.cluster

dataset=pd.read_csv('sims_csv/part-00000',header=None)
dataset.columns=['art1','art2','sim']
dataset['dist']=[1-i for i in dataset['sim']]
present_artists=set(dataset['art1'].append(dataset['art2']).values.tolist())
new_numer_art_to_int=dict()
new_numer_int_to_art=dict()
count=0
for art in present_artists:
    new_numer_art_to_int[art]=count
    new_numer_int_to_art[count]=art
    count+=1
del present_artists, count
dataset['art1']=[new_numer_art_to_int[i] for i in dataset['art1']]
dataset['art2']=[new_numer_art_to_int[i] for i in dataset['art2']]

I=dataset['art1'].append(dataset['art2'])
J=dataset['art2'].append(dataset['art1'])
V=dataset['dist'].append(dataset['dist'])

dataset_matrix=scipy.sparse.csr_matrix((V,(I,J)))
del I,J,V
dataset_matrix

dbsc=sklearn.cluster.DBSCAN(eps=0.775,metric='precomputed').fit_predict(dataset_matrix)
new_res=pd.Series(range(dataset_matrix.shape[0])).to_frame()
new_res.columns=['artist_id']
new_res['dbsc']=dbsc
del dbsc, dataset_matrix

g=igraph.Graph(edges=dataset[['art1','art2']].values.tolist(),directed=False)
g.es['weight']=dataset['sim'].values.tolist()
del dataset
louvain_weighted=g.community_multilevel(weights=g.es['weight'])
new_res['louvain']=louvain_weighted.membership
new_res['artist_id']=[new_numer_int_to_art[i] for i in new_res['artist_id']]

results=pd.read_csv('results_all.csv',engine='python')
results=results.merge(new_res,how='left',on='artist_id')
new_res=new_res.merge(results[['artist_id','cluster100','cluster200','cluster500','cluster700','cluster1000']],how='left',on='artist_id')
cols=results.columns.tolist()
cols=cols[6:7]+cols[1:6]+cols[7:9]
results=results[cols]
results.columns=[re.sub('cluster','pic',i) for i in results.columns]
new_res.columns=[re.sub('cluster','pic',i) for i in new_res.columns]

results.to_csv('results_all.csv',index=False)
results.head()
Out[4]:
artist_name pic100 pic200 pic500 pic700 pic1000 dbsc louvain
0 sy smith 54 56 328 29 976 -1 55
1 phoebe killdeer and the short straws 74 154 103 466 829 -1 62
2 judge jules 56 147 273 30 928 -1 55
3 strip steve 85 78 275 683 853 -1 62
4 marja mattlar 49 38 299 672 352 -1 38

Note: a cluster assignment of -1 means that the row was not asigned to any cluster. In DBSCAN most of the artists are not assigned to any cluster, thus those clusters should be of better quality.

Part 4 - Checking cluster sizes and calculating cluster quality metrics

4.1 Checking the sizes of the largest clusters for the different algorithms

In [5]:
sizes=[pd.Series(results[i].value_counts()) for i in results.columns[1:]]
sizes[5]=sizes[5][1:]
for i in range(len(sizes)):
    sizes[i].index=range(len(sizes[i]))
sizes=pd.DataFrame(sizes).transpose()
sizes.columns=results.columns[1:]
sizes.fillna('')
Out[5]:
pic100 pic200 pic500 pic700 pic1000 dbsc louvain
0 1817 1013 414 287 287 2065 13605
1 1779 1005 361 287 196 724 6314
2 1752 908 360 271 188 289 6001
3 1664 875 350 251 187 261 4418
4 1547 762 349 246 183 217 2518
5 1507 757 343 244 179 187 2514
6 1442 750 337 244 176 177 2470
7 1329 749 332 241 174 176 2224
8 1324 746 332 237 170 176 1893
9 1319 730 332 235 168 169 1633
10 1292 724 329 234 168 162 1552
11 1204 722 327 229 165 159 1097
12 1188 716 326 228 165 158 960
13 1185 696 324 223 163 147 830
14 1184 688 324 221 160 144 828
15 1170 683 321 220 158 144 744
16 1146 678 319 220 158 137 684
17 1144 676 316 218 157 133 641
18 1120 663 314 218 157 120 641
19 1112 656 312 215 157 111 508
20 1084 653 311 215 156 101 496
21 972 652 310 214 156 98 367
22 950 645 305 214 155 85 350
23 907 622 304 211 154 84 328
24 890 622 301 209 153 82 302
25 888 615 300 209 149 80 237
26 880 606 300 208 149 80 215
27 788 594 299 208 148 79 208
28 753 592 299 207 148 75 157
29 724 591 292 207 147 72 102
... ... ... ... ... ... ... ...
970 2
971 2
972 2
973 2
974 2
975 2
976 2
977 2
978 2
979 2
980 2
981 2
982 2
983 1
984 1
985 1
986 1
987 1
988 1
989 1
990 1
991 1
992 1
993 1
994 1
995 1
996 1
997 1
998 1
999 1

1000 rows × 7 columns

From these results, it can be seen that 1000 clusters was definitely too much, since many artists ended up in their own cluster.

4.2 Calculating cluster quality metrics

Given the size of this dataset, it's not feasible to calculate typical clustering quality metrics such as the silhouette coefficient or the Dunn index, but some metrics for graph cuts can be used. In this case, I'll use modularity, which can be calculated very efficiently for this dataset. This metric is, however, very sensitive to singleton clusters (clusters of size 1) and favors larger clusters, so in this case it might not be the best decision criteria to see which algorithm did better, but it's a good indicator to have some idea of it. Possible values for modularity range from -0.5 to 1, with more being better. For the case of DBSCAN, however, this metric wouldn't be comparable to other algorithms, since most artists are not assigned to any cluster.

In [6]:
import numpy as np

print('Modularity for Power Iteration Clustering with 100 clusters :',g.modularity(membership=new_res['pic100'],weights=g.es['weight']))
print('Modularity for Power Iteration Clustering with 200 clusters :',g.modularity(membership=new_res['pic200'],weights=g.es['weight']))
print('Modularity for Power Iteration Clustering with 500 clusters :',g.modularity(membership=new_res['pic500'],weights=g.es['weight']),)
print('Modularity for Power Iteration Clustering with 700 clusters :',g.modularity(membership=new_res['pic700'],weights=g.es['weight']))
print('Modularity for Power Iteration Clustering with 1000 clusters :',g.modularity(membership=new_res['pic1000'],weights=g.es['weight']))
print('Modularity for Louvain Modularity (',len(set(louvain_weighted.membership)),'clusters) :',louvain_weighted.modularity)
print()
print("Results for DBSCAN:")
print("Number of clusters: ",len(set(results['dbsc']))-1)
print("Number of artists belonging to a cluster: ",len(results['dbsc'].loc[results['dbsc']!=-1]))
del g, louvain_weighted, new_res, new_numer_art_to_int, new_numer_int_to_art
Modularity for Power Iteration Clustering with 100 clusters : 0.037718345357822605
Modularity for Power Iteration Clustering with 200 clusters : 0.019087964635427532
Modularity for Power Iteration Clustering with 500 clusters : 0.0073897684741175
Modularity for Power Iteration Clustering with 700 clusters : 0.005237515826273008
Modularity for Power Iteration Clustering with 1000 clusters : 0.003573255160752189
Modularity for Louvain Modularity ( 76 clusters) : 0.6073314825719774

Results for DBSCAN:
Number of clusters:  343
Number of artists belonging to a cluster:  11867

Graph modularity seems to suggest that, in this case, power iteration clustering did terribly bad compared to louvain modularity, but as mentioned before, this metric might not be the most adequate and it doesn't necessarily mean that the results are invalid. A good knowledge about the music industry and manual examination would be needed to tell this. Morever, the clusters obtained from power iteration clustering also come with different levels of granularity (accordign to the number of clusters).

Part 5 - Checking a sample of the results

Here I'll check some of the medium-sized clusters obtained from DBSCAN, which should be of better quality than the ones obtained from the other algorithms, since it only assigned a fraction of them to a cluster.

In [7]:
clusters=results['dbsc'].value_counts()[10:15].index
clusters=[pd.DataFrame(results[['artist_name']].loc[results['dbsc']==i]) for i in clusters]
for i in range(len(clusters)):
    clusters[i].index=range(len(clusters[i]))
clusters=clusters[0].merge(clusters[1],left_index=True, right_index=True).merge(clusters[2],left_index=True, right_index=True).merge(clusters[3],left_index=True, right_index=True).merge(clusters[4],left_index=True, right_index=True)
clusters.columns=['cluster'+str(i) for i in range(1,len(clusters.columns)+1)]
clusters
Out[7]:
cluster1 cluster2 cluster3 cluster4 cluster5
0 edip akbayram shreya ghosal hans albers ten typ mes david houston
1 silahs?z kuvvet juggy d jimmy makulis bora vince gill
2 muharrem ertas himesh reshammiya heino pijani powietrzem kitty wells
3 ali akbar moradi r.d.burman michael heck koniec ?wiata buck owens
4 rojin b21 petra frey pokahontaz trace adkins
5 cem karaca suzanne maria & margot hellwig lumpex75 eric church
6 emre ayd?n call hansi hinterseer coalition janie fricke
7 mozole mirach jagjit singh lale andersen fu brooks & dunn
8 fuat saka noor jehan vikinger s?owa we krwi montgomery gentry
9 sabahat akkiraz dr. zeus nino de angelo pogotowie seksualne toby keith
10 emrah vishal bhardwaj das stoakogler trio 2cztery7 lee ann womack
11 nazan Öncel amrinder gill manuela strachy na lachy johnny horton
12 agire jiyan farida khanum lys assia hey connie smith
13 ferhat göçer kamal heer g. g. anderson kaliber 44 mark chesnutt
14 athena spb uwe busse kazik staszewski donna fargo
15 grup yorum sujatha mara kayser pezet pam tillis
16 mt bikram singh bernd stelter dezerter collin raye
17 haluk levent ghulam ali wencke myhre pezet-noon eddy arnold
18 mazhar alanson unni menon markus becker all wheel drive clint black
19 sümer ezgü fuzon die stoakogler zeus jessica andrews
20 ajda pekkan shantanu moitra klostertaler kombajn do zbierania kur po wioskach keith whitley
21 fuat kuldip manak ireen sheer muchy patty loveless
22 tolga çandar chitra rex gildo marek grechuta terri clark
23 aynur do?an udit narayan & alka yagnik rudi schuricke cool kids of death bill anderson
24 esengül srinivas uta bresan infekcja mindy mccready
25 nilgül salim-sulaiman oliver thomas konkwista 88 charley pride
26 bulutsuzluk Özlemi babbu maan roberto blanco homomilitia dwight yoakam
27 direc-t mohit chauhan freddy quinn akurat rascal flatts
28 muazzez ersoy bombay jayashree alpenrebellen lady pank vern gosdin
29 erol parlak ismail darbar katja ebstein east west rockers george jones
... ... ... ... ... ...
117 aydilge a.r. rahman jonny hill farben lehre porter wagoner
118 bülent ortaçgil mithoon nik p. awantura kellie pickler
119 Üç nokta bir gurdas maan andrea jürgens indios bravos charlie rich
120 a?k?n nur yengi shamur andreas martin pid?ama porno webb pierce
121 hümeyra shehzad roy maxi arland daab hank thompson
122 gece yolcular? rajesh roshan axel becker izrael jason aldean
123 y?ld?z tilbe ali haider kastelruther spatzen tworzywo sztuczne doug stone
124 mor ve Ötesi jasbir jassi die flippers waglewski fisz emade faron young
125 yeni türkü wadali brothers bata illic grzegorz turnau sugarland
126 serdar ortaç sunidhi chauhan matthias reim peja jim ed brown
127 ?smail hakk? demircio?lu bally jagpal peter maffay happysad tracy byrd
128 nefret karthik brunner & brunner p?omie? 81 jason michael carroll
129 demir demirkan javed ali slavko avsenik und seine original oberkrainer rahim sonny james
130 leman sam shaan andrea berg cymeon x chris cagle
131 hayko cepkin abida parveen die zillertaler april craig morgan
132 yal?n asha bhosle & kishore kumar marianne & michael t.love bobby bare
133 aylin asl?m stereo nation jürgen wwo billy currington
134 fuchs rishi rich jürgen drews btm rodney atkins
135 berdan mardini sukhwinder singh margot eskens bia?a gor?czka claude king
136 yüksek sadakat vishal - shekhar ernst mosch und seine original egerländer musi... sound pollution jimmy wayne
137 pinhani rdb die jungen zillertaler guernica y luno gene watson
138 Çelik shail hada angela wiedl ca?a góra barwinków keith anderson
139 hande yener amitabh bachchan original naabtal duo lilu reba mcentire
140 soner arica talat mahmood vico torriani maleo reggae rockers johnny lee
141 emre altu? aziz mian hanne haller maria peszek jim reeves
142 abluka alarm neeraj shridhar & tulsi kumar gus backus defekt muzgó jimmy dean
143 cengiz Özkan ali zafar mickie krause fisz randy travis
144 bülent ersoy chitra singh juliane werding pih billie jo spears
145 ali ekber cicek bally sagoo wenche myhre dixon37 dan seals
146 betül demir achanak truck stop molesta lonestar

147 rows × 5 columns

Trying to interpret the clusters:

  • The first cluster seems to contain mostly Turkish musicians of popular and folk music.
  • The second cluster seems to contain mostly Indian-Punjabi musicians, also of popular and folk music.
  • The third cluster seems to contain mostly German-speaking singers of pop and movie-derived songs.
  • The fourth cluster seems to contain mostly east European small artists of alternative rock and indie art.
  • The fifth cluster seems to contain mostly country artists from the US.