From the official Spotipy docs:
"Spotipy is a lightweight Python library for the Spotify Web API. With Spotipy you get full access to all of the music data provided by the Spotify platform."
Spotify offers a number of API endpoints to access the Spotify data. In this notebook, I used the following:
The data was collected on several days during the months of April, May and August 2018.
The goal is to show how to collect audio features data for tracks from the official Spotify Web API in order to use it for further analysis/ machine learning which will be part of another notebook.
The below code is sufficient to set up Spotipy for querying the API endpoint. A more detailed explanation of the whole procedure is available in the official docs.
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
cid ="xx"
secret = "xx"
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
The data collection is divided into 2 parts: the track IDs and the audio features. In this step, I'm going to collect 10.000 track IDs from the Spotify API.
The search endpoint used in this step had a few limitations:
Spotify cut down the maximum offset to 10.000 (as of May 2018?), I was lucky enough to do my first collection attempt while it was still 100.000
My solution: using a nested for loop, I increased the offset by 50 in the outer loop until the maxium limit/ offset was reached. The inner for loop did the actual querying and appending the returned results to appropriate lists which I used afterwards to create my dataframe.
# timeit library to measure the time needed to run this code
import timeit
start = timeit.default_timer()
# create empty lists where the results are going to be stored
artist_name = []
track_name = []
popularity = []
track_id = []
for i in range(0,10000,50):
track_results = sp.search(q='year:2018', type='track', limit=50,offset=i)
for i, t in enumerate(track_results['tracks']['items']):
artist_name.append(t['artists'][0]['name'])
track_name.append(t['name'])
track_id.append(t['id'])
popularity.append(t['popularity'])
stop = timeit.default_timer()
print ('Time to run this code (in seconds):', stop - start)
Time to run this code (in seconds): 242.2539935503155
In the next few cells, I'm going to do some exploratory data analysis as well as data preparation of the newly gained data.
A quick check for the track_id list:
print('number of elements in the track_id list:', len(track_id))
number of elements in the track_id list: 10000
Looks good. Now loading the lists in a dataframe.
import pandas as pd
df_tracks = pd.DataFrame({'artist_name':artist_name,'track_name':track_name,'track_id':track_id,'popularity':popularity})
print(df_tracks.shape)
df_tracks.head()
(10000, 4)
artist_name | popularity | track_id | track_name | |
---|---|---|---|---|
0 | Drake | 100 | 2G7V7zsVDxg1yRsu7Ew9RJ | In My Feelings |
1 | XXXTENTACION | 97 | 3ee8Jmje8o58CHK66QrVC2 | SAD! |
2 | Tyga | 96 | 5IaHrVsrferBYDm0bDyABy | Taste (feat. Offset) |
3 | Cardi B | 97 | 58q2HKrzhC3ozto2nDdN4z | I Like It |
4 | XXXTENTACION | 95 | 0JP9xo3adEtGSdUEISiszL | Moonlight |
df_tracks.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 4 columns): artist_name 10000 non-null object popularity 10000 non-null int64 track_id 10000 non-null object track_name 10000 non-null object dtypes: int64(1), object(3) memory usage: 312.6+ KB
Sometimes, the same track is returned under different track IDs (single, as part of an album etc.).
This needs to be checked for and corrected if needed.
# group the entries by artist_name and track_name and check for duplicates
grouped = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
grouped[grouped > 1].count()
524
There are 524 duplicate entries which will be dropped in the next cell:
df_tracks.drop_duplicates(subset=['artist_name','track_name'], inplace=True)
# doing the same grouping as before to verify the solution
grouped_after_dropping = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
grouped_after_dropping[grouped_after_dropping > 1].count()
0
This time the results are empty. Another way of checking this:
df_tracks[df_tracks.duplicated(subset=['artist_name','track_name'],keep=False)].count()
artist_name 0 popularity 0 track_id 0 track_name 0 dtype: int64
Checking how many tracks are left now:
df_tracks.shape
(9460, 4)
With the audio features endpoint I will now get the audio features data for my 9460 track IDs.
The limitation for this endpoint is that a maximum of 100 track IDs can be submitted per query.
Again, I used a nested for loop. This time the outer loop was pulling track IDs in batches of size 100 and the inner for loop was doing the query and appending the results to the rows list.
Additionaly, I had to implement a check when a track ID didn't return any audio features (i.e. None was returned) as this was causing issues.
# again measuring the time
start = timeit.default_timer()
# empty list, batchsize and the counter for None results
rows = []
batchsize = 100
None_counter = 0
for i in range(0,len(df_tracks['track_id']),batchsize):
batch = df_tracks['track_id'][i:i+batchsize]
feature_results = sp.audio_features(batch)
for i, t in enumerate(feature_results):
if t == None:
None_counter = None_counter + 1
else:
rows.append(t)
print('Number of tracks where no audio features were available:',None_counter)
stop = timeit.default_timer()
print ('Time to run this code (in seconds):',stop - start)
Number of tracks where no audio features were available: 86 Time to run this code (in seconds): 11.267732854001224
Same as with the first dataset, checking how the rows list looks like:
print('number of elements in the track_id list:', len(rows))
number of elements in the track_id list: 9374
Finally, I will load the audio features in a dataframe.
df_audio_features = pd.DataFrame.from_dict(rows,orient='columns')
print("Shape of the dataset:", df_audio_features.shape)
df_audio_features.head()
Shape of the dataset: (9374, 18)
acousticness | analysis_url | danceability | duration_ms | energy | id | instrumentalness | key | liveness | loudness | mode | speechiness | tempo | time_signature | track_href | type | uri | valence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00669 | https://api.spotify.com/v1/audio-analysis/2G7V... | 0.738 | 217933 | 0.466 | 2G7V7zsVDxg1yRsu7Ew9RJ | 0.01020 | 8 | 0.449 | -9.433 | 1 | 0.1370 | 181.992 | 4 | https://api.spotify.com/v1/tracks/2G7V7zsVDxg1... | audio_features | spotify:track:2G7V7zsVDxg1yRsu7Ew9RJ | 0.401 |
1 | 0.25800 | https://api.spotify.com/v1/audio-analysis/3ee8... | 0.740 | 166606 | 0.613 | 3ee8Jmje8o58CHK66QrVC2 | 0.00372 | 8 | 0.123 | -4.880 | 1 | 0.1450 | 75.023 | 4 | https://api.spotify.com/v1/tracks/3ee8Jmje8o58... | audio_features | spotify:track:3ee8Jmje8o58CHK66QrVC2 | 0.473 |
2 | 0.02360 | https://api.spotify.com/v1/audio-analysis/5IaH... | 0.884 | 232959 | 0.559 | 5IaHrVsrferBYDm0bDyABy | 0.00000 | 0 | 0.101 | -7.442 | 1 | 0.1200 | 97.994 | 4 | https://api.spotify.com/v1/tracks/5IaHrVsrferB... | audio_features | spotify:track:5IaHrVsrferBYDm0bDyABy | 0.342 |
3 | 0.09900 | https://api.spotify.com/v1/audio-analysis/58q2... | 0.816 | 253390 | 0.726 | 58q2HKrzhC3ozto2nDdN4z | 0.00000 | 5 | 0.372 | -3.998 | 0 | 0.1290 | 136.048 | 4 | https://api.spotify.com/v1/tracks/58q2HKrzhC3o... | audio_features | spotify:track:58q2HKrzhC3ozto2nDdN4z | 0.650 |
4 | 0.55600 | https://api.spotify.com/v1/audio-analysis/0JP9... | 0.921 | 135090 | 0.537 | 0JP9xo3adEtGSdUEISiszL | 0.00404 | 9 | 0.102 | -5.723 | 0 | 0.0804 | 128.009 | 4 | https://api.spotify.com/v1/tracks/0JP9xo3adEtG... | audio_features | spotify:track:0JP9xo3adEtGSdUEISiszL | 0.711 |
df_audio_features.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9374 entries, 0 to 9373 Data columns (total 18 columns): acousticness 9374 non-null float64 analysis_url 9374 non-null object danceability 9374 non-null float64 duration_ms 9374 non-null int64 energy 9374 non-null float64 id 9374 non-null object instrumentalness 9374 non-null float64 key 9374 non-null int64 liveness 9374 non-null float64 loudness 9374 non-null float64 mode 9374 non-null int64 speechiness 9374 non-null float64 tempo 9374 non-null float64 time_signature 9374 non-null int64 track_href 9374 non-null object type 9374 non-null object uri 9374 non-null object valence 9374 non-null float64 dtypes: float64(9), int64(4), object(5) memory usage: 1.3+ MB
Some columns are not needed for the analysis so I will drop them.
Also the ID column will be renamed to track_id so that it matches the column name from the first dataframe.
columns_to_drop = ['analysis_url','track_href','type','uri']
df_audio_features.drop(columns_to_drop, axis=1,inplace=True)
df_audio_features.rename(columns={'id': 'track_id'}, inplace=True)
df_audio_features.shape
(9374, 14)
# merge both dataframes
# the 'inner' method will make sure that we only keep track IDs present in both datasets
df = pd.merge(df_tracks,df_audio_features,on='track_id',how='inner')
print("Shape of the dataset:", df_audio_features.shape)
df.head()
Shape of the dataset: (9374, 14)
artist_name | popularity | track_id | track_name | acousticness | danceability | duration_ms | energy | instrumentalness | key | liveness | loudness | mode | speechiness | tempo | time_signature | valence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Drake | 100 | 2G7V7zsVDxg1yRsu7Ew9RJ | In My Feelings | 0.00669 | 0.738 | 217933 | 0.466 | 0.01020 | 8 | 0.449 | -9.433 | 1 | 0.1370 | 181.992 | 4 | 0.401 |
1 | XXXTENTACION | 97 | 3ee8Jmje8o58CHK66QrVC2 | SAD! | 0.25800 | 0.740 | 166606 | 0.613 | 0.00372 | 8 | 0.123 | -4.880 | 1 | 0.1450 | 75.023 | 4 | 0.473 |
2 | Tyga | 96 | 5IaHrVsrferBYDm0bDyABy | Taste (feat. Offset) | 0.02360 | 0.884 | 232959 | 0.559 | 0.00000 | 0 | 0.101 | -7.442 | 1 | 0.1200 | 97.994 | 4 | 0.342 |
3 | Cardi B | 97 | 58q2HKrzhC3ozto2nDdN4z | I Like It | 0.09900 | 0.816 | 253390 | 0.726 | 0.00000 | 5 | 0.372 | -3.998 | 0 | 0.1290 | 136.048 | 4 | 0.650 |
4 | XXXTENTACION | 95 | 0JP9xo3adEtGSdUEISiszL | Moonlight | 0.55600 | 0.921 | 135090 | 0.537 | 0.00404 | 9 | 0.102 | -5.723 | 0 | 0.0804 | 128.009 | 4 | 0.711 |
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 9374 entries, 0 to 9373 Data columns (total 17 columns): artist_name 9374 non-null object popularity 9374 non-null int64 track_id 9374 non-null object track_name 9374 non-null object acousticness 9374 non-null float64 danceability 9374 non-null float64 duration_ms 9374 non-null int64 energy 9374 non-null float64 instrumentalness 9374 non-null float64 key 9374 non-null int64 liveness 9374 non-null float64 loudness 9374 non-null float64 mode 9374 non-null int64 speechiness 9374 non-null float64 tempo 9374 non-null float64 time_signature 9374 non-null int64 valence 9374 non-null float64 dtypes: float64(9), int64(5), object(3) memory usage: 1.3+ MB
Just in case, checking for any duplicate tracks:
df[df.duplicated(subset=['artist_name','track_name'],keep=False)]
artist_name | popularity | track_id | track_name | acousticness | danceability | duration_ms | energy | instrumentalness | key | liveness | loudness | mode | speechiness | tempo | time_signature | valence |
---|
Everything seems to be fine so I will save the dataframe as a .csv file.
df.to_csv('SpotifyAudioFeatures08082018.csv')