FMA: A Dataset For Music Analysis

Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, Xavier Bresson, EPFL LTS2.

Creation

From raw_*.csv, this notebook generates:

  • tracks.csv: per-track / album / artist metadata.
  • genres.csv: genre hierarchy.
  • echonest.csv: cleaned Echonest features.

A companion script, creation.py:

  1. Query the API and store metadata in raw_tracks.csv, raw_albums.csv, raw_artists.csv and raw_genres.csv.
  2. Download the audio for each track.
  3. Trim the audio to 30s clips.
  4. Normalize the permissions and modification / access times.
  5. Create the .zip archives.
In [1]:
import os
import ast
import pickle

import IPython.display as ipd
import numpy as np
import pandas as pd

import utils
import creation
In [2]:
AUDIO_DIR = os.environ.get('AUDIO_DIR')
BASE_DIR = os.path.abspath(os.path.dirname(AUDIO_DIR))
FMA_FULL = os.path.join(BASE_DIR, 'fma_full')
FMA_LARGE = os.path.join(BASE_DIR, 'fma_large')

1 Retrieve metadata and audio from FMA

  1. Crawl the tracks, albums and artists metadata through their API.
  2. Download original .mp3 by HTTPS for each track id (only if we don't have it already).

Todo:

  • Scrap curators.
  • Download images (track_image_file, album_image_file, artist_image_file). Beware the quality.
  • Verify checksum for some random tracks.

Dataset update:

  • To add new tracks: iterate from largest known track id to the most recent only.
  • To update user data: we need to get all tracks again.
In [3]:
# ./creation.py metadata
# ./creation.py data /path/to/fma/fma_full
# ./creation.py clips /path/to/fma

#!cat creation.py
In [4]:
# converters={'genres': ast.literal_eval}
tracks = pd.read_csv('raw_tracks.csv', index_col=0)
albums = pd.read_csv('raw_albums.csv', index_col=0)
artists = pd.read_csv('raw_artists.csv', index_col=0)
genres = pd.read_csv('raw_genres.csv', index_col=0)

not_found = pickle.load(open('not_found.pickle', 'rb'))
In [5]:
def get_fs_tids(audio_dir):
    tids = []
    for _, dirnames, files in os.walk(audio_dir):
        if dirnames == []:
            tids.extend(int(file[:-4]) for file in files)
    return tids

audio_tids = get_fs_tids(FMA_FULL)
clips_tids = get_fs_tids(FMA_LARGE)
In [6]:
print('tracks: {} collected ({} not found, {} max id)'.format(
    len(tracks), len(not_found['tracks']), tracks.index.max()))
print('albums: {} collected ({} not found, {} in tracks)'.format(
    len(albums), len(not_found['albums']), len(tracks['album_id'].unique())))
print('artists: {} collected ({} not found, {} in tracks)'.format(
    len(artists), len(not_found['artists']), len(tracks['artist_id'].unique())))
print('genres: {} collected'.format(len(genres)))
print('audio: {} collected ({} not found, {} not in tracks)'.format(
    len(audio_tids), len(not_found['audio']), len(set(audio_tids).difference(tracks.index))))
print('clips: {} collected ({} not found, {} not in tracks)'.format(
    len(clips_tids), len(not_found['clips']), len(set(clips_tids).difference(tracks.index))))
assert sum(tracks.index.isin(audio_tids)) + len(not_found['audio']) == len(tracks)
assert sum(tracks.index.isin(clips_tids)) + len(not_found['clips']) == sum(tracks.index.isin(audio_tids))
assert len(clips_tids) + len(not_found['clips']) + len(not_found['audio']) == len(tracks)
tracks: 109727 collected (45594 not found, 155320 max id)
albums: 15234 collected (480 not found, 15714 in tracks)
artists: 16916 collected (250 not found, 17166 in tracks)
genres: 164 collected
audio: 110668 collected (180 not found, 1121 not in tracks)
clips: 109261 collected (286 not found, 0 not in tracks)
In [7]:
N = 5
ipd.display(tracks.head(N))
ipd.display(albums.head(N))
ipd.display(artists.head(N))
ipd.display(genres.head(N))
album_id album_title album_url artist_id artist_name artist_url artist_website license_image_file license_image_file_large license_parent_id ... track_information track_instrumental track_interest track_language_code track_listens track_lyricist track_number track_publisher track_title track_url
track_id
2 1.0 AWOL - A Way Of Life http://freemusicarchive.org/music/AWOL/AWOL_-_... 1 AWOL http://freemusicarchive.org/music/AWOL/ http://www.AzillionRecords.blogspot.com http://i.creativecommons.org/l/by-nc-sa/3.0/us... http://fma-files.s3.amazonaws.com/resources/im... 5.0 ... NaN 0 4656 en 1293 NaN 3 NaN Food http://freemusicarchive.org/music/AWOL/AWOL_-_...
3 1.0 AWOL - A Way Of Life http://freemusicarchive.org/music/AWOL/AWOL_-_... 1 AWOL http://freemusicarchive.org/music/AWOL/ http://www.AzillionRecords.blogspot.com http://i.creativecommons.org/l/by-nc-sa/3.0/us... http://fma-files.s3.amazonaws.com/resources/im... 5.0 ... NaN 0 1470 en 514 NaN 4 NaN Electric Ave http://freemusicarchive.org/music/AWOL/AWOL_-_...
5 1.0 AWOL - A Way Of Life http://freemusicarchive.org/music/AWOL/AWOL_-_... 1 AWOL http://freemusicarchive.org/music/AWOL/ http://www.AzillionRecords.blogspot.com http://i.creativecommons.org/l/by-nc-sa/3.0/us... http://fma-files.s3.amazonaws.com/resources/im... 5.0 ... NaN 0 1933 en 1151 NaN 6 NaN This World http://freemusicarchive.org/music/AWOL/AWOL_-_...
10 6.0 Constant Hitmaker http://freemusicarchive.org/music/Kurt_Vile/Co... 6 Kurt Vile http://freemusicarchive.org/music/Kurt_Vile/ http://kurtvile.com http://i.creativecommons.org/l/by-nc-nd/3.0/88... http://fma-files.s3.amazonaws.com/resources/im... NaN ... NaN 0 54881 en 50135 NaN 1 NaN Freeway http://freemusicarchive.org/music/Kurt_Vile/Co...
20 4.0 Niris http://freemusicarchive.org/music/Chris_and_Ni... 4 Nicky Cook http://freemusicarchive.org/music/Chris_and_Ni... NaN http://i.creativecommons.org/l/by-nc-nd/3.0/88... http://fma-files.s3.amazonaws.com/resources/im... NaN ... NaN 0 978 en 361 NaN 3 NaN Spiritual Level http://freemusicarchive.org/music/Chris_and_Ni...

5 rows × 38 columns

album_comments album_date_created album_date_released album_engineer album_favorites album_handle album_image_file album_images album_information album_listens album_producer album_title album_tracks album_type album_url artist_name artist_url tags
album_id
1 0 11/26/2008 01:44:45 AM 1/05/2009 NaN 4 AWOL_-_A_Way_Of_Life https://freemusicarchive.org/file/images/album... [{'image_id': '1955', 'image_file': 'https://f... <p></p> 6073 NaN AWOL - A Way Of Life 7 Album http://freemusicarchive.org/music/AWOL/AWOL_-_... AWOL http://freemusicarchive.org/music/AWOL/ []
100 0 11/26/2008 01:55:44 AM 1/09/2009 NaN 0 On_Opaque_Things https://freemusicarchive.org/file/images/album... [{'image_id': '4403', 'image_file': 'https://f... NaN 5613 NaN On Opaque Things 4 Album http://freemusicarchive.org/music/Bird_Names/O... Bird Names http://freemusicarchive.org/music/Bird_Names/ []
1000 0 12/04/2008 09:28:49 AM 10/26/2008 NaN 0 DMBQ_Live_at_2008_Record_Fair_on_WFMU_Record_F... https://freemusicarchive.org/file/images/album... [{'image_id': '31997', 'image_file': 'https://... <p>http://blog.wfmu.org/freeform/2008/10/what-... 1092 NaN DMBQ Live at 2008 Record Fair on WFMU Record F... 4 Live Performance http://freemusicarchive.org/music/DMBQ/DMBQ_Li... DMBQ http://freemusicarchive.org/music/DMBQ/ []
10000 0 9/05/2011 04:42:57 PM NaN NaN 0 Live_at_CKUT_on_Montreal_Sessions_1434 https://freemusicarchive.org/file/images/album... [{'image_id': '12266', 'image_file': 'https://... <p>Live Set on the Montreal Session February 2... 1001 NaN Live at CKUT on Montreal Sessions 1 Radio Program http://freemusicarchive.org/music/Sundrips/Liv... Sundrips http://freemusicarchive.org/music/Sundrips/ []
10001 0 9/06/2011 12:02:58 AM 1/01/2006 NaN 0 Grounds_Dream_Cosmic_Love https://freemusicarchive.org/file/images/album... [{'image_id': '24091', 'image_file': 'https://... <p>Recorded in Linnavuori, Finland, 2005 (with... 504 NaN Ground's Dream Cosmic Love 1 Album http://freemusicarchive.org/music/Uton/Grounds... Uton http://freemusicarchive.org/music/Uton/ []
artist_active_year_begin artist_active_year_end artist_associated_labels artist_bio artist_comments artist_contact artist_date_created artist_donation_url artist_favorites artist_flattr_name ... artist_location artist_longitude artist_members artist_name artist_paypal_name artist_related_projects artist_url artist_website artist_wikipedia_page tags
artist_id
1 2006.0 NaN NaN <p>A Way Of Life, A Collective of Hip-Hop from... 0 Brown Bum aka Choke 11/26/2008 01:42:32 AM NaN 9 NaN ... New Jersey -74.405661 Sajje Morocco,Brownbum,ZawidaGod,Custodian of ... AWOL NaN The list of past projects is 2 long but every1... http://freemusicarchive.org/music/AWOL/ http://www.AzillionRecords.blogspot.com NaN ['awol']
10 NaN NaN Mistletone, Marriage Records <p>"Lucky Dragons" means any recorded or perfo... 3 Lukey Dargons 11/26/2008 01:43:35 AM http://glaciersofnice.com/shop/ 111 NaN ... Los Angeles, CA -118.243685 Luke Fischbeck\nSarah Rara Lucky Dragons NaN NaN http://freemusicarchive.org/music/Lucky_Dragons/ http://hawksandsparrows.org/ NaN ['lucky dragons']
100 2004.0 NaN Captcha Records (HBSP-2X), Pickled Egg (Europe) <p><span style="font-family:Verdana, Geneva, A... 1 Chris Kalis 11/26/2008 02:05:22 AM NaN 8 NaN ... Chicago, IL -87.629798 Chris Kalis, Harry Brenner, Scott McGaughey, B... Chandeliers NaN Killer Whales, \nMichael Columbia\nMandate\nMr... http://freemusicarchive.org/music/Chandeliers/ thechandeliers.com NaN ['chandeliers']
1000 NaN NaN NaN <p><a href="http://marzipanmarzipan.com">Marzi... 0 NaN 12/04/2008 09:24:35 AM NaN 0 NaN ... NaN 12.567380 NaN Marzipan Marzipan NaN NaN http://freemusicarchive.org/music/Marzipan_Mar... https://soundcloud.com/marzipanmarzipan NaN []
10000 NaN NaN NaN <p><span style="font-family:'Times New Roman',... 0 NaN 1/21/2011 02:11:31 PM NaN 1 NaN ... NaN NaN Jack Hertz\nPHOBoS\nBlue Hell Jack Hertz, PHOBoS, Blue Hell NaN NaN http://freemusicarchive.org/music/Jack_Hertz_P... http://surrism.phonoethics.com/surrism-phonoet... NaN ['jack hertz phobos blue hell']

5 rows × 24 columns

genre_color genre_handle genre_parent_id genre_title
genre_id
1 #006666 Avant-Garde 38.0 Avant-Garde
2 #CC3300 International NaN International
3 #000099 Blues NaN Blues
4 #990099 Jazz NaN Jazz
5 #8A8A65 Classical NaN Classical

2 Format metadata

Todo:

  • Sanitize values, e.g. list of words for tags, valid links in artist_wikipedia_page, remove html markup in free-form text.
    • Clean tags. E.g. some tags are just artist names.
  • Fill metadata about encoding: length, number of samples, sample rate, bit rate, channels (mono/stereo), 16bits?.
  • Update duration from audio
    • 2624 is marked as 05:05:50 (18350s) although it is reported as 00:21:15.15 by ffmpeg.
    • 112067: 3714s --> 01:59:55.06, 112808: 3718s --> 01:59:59.56
    • ffmpeg: Estimating duration from bitrate, this may be inaccurate
    • Solution, decode the complete mp3: ffmpeg -i input.mp3 -f null -
In [8]:
df, column = tracks, 'tags'
null = sum(df[column].isnull())
print('{} null, {} non-null'.format(null, df.shape[0] - null))
df[column].value_counts().head(10)
0 null, 109727 non-null
Out[8]:
[]                                                                                                                                                                             85881
['interiors c1964', 'existential', 'hardcore-punk', 'pop-punk', 'punk-rock', 'internet boyfriend', 'rew starr', 'public domain', 'creative commons', 'microsong challenge']      314
['classwar karaoke']                                                                                                                                                             239
['all styles experimental']                                                                                                                                                      215
['improvisation', 'not normal music', 'all styles experimental']                                                                                                                 195
['era 1']                                                                                                                                                                        176
['all styles experimental', 'harsh noise', 'not normal music']                                                                                                                   150
['music is a belief', 'chary', 'nishad', 'uju', 'ibiene', 'nazeem', 'deepu', 'maneet', 'azedine', 'mohammad']                                                                    140
['new zealand']                                                                                                                                                                  140
['improvisation', 'all styles experimental', 'not normal music']                                                                                                                 128
Name: tags, dtype: int64

2.1 Tracks

In [9]:
drop = [
    'license_image_file', 'license_image_file_large', 'license_parent_id', 'license_url',  # keep title only
    'track_file', 'track_image_file',  # used to download only
    'track_url', 'album_url', 'artist_url',  # only relevant on website
    'track_copyright_c', 'track_copyright_p',  # present for ~1000 tracks only
    # 'track_composer', 'track_lyricist', 'track_publisher',  # present for ~4000, <1000 and <2000 tracks
    'track_disc_number',  # different from 1 for <1000 tracks
    'track_explicit', 'track_explicit_notes',  # present for <4000 tracks
    'track_instrumental'  # ~6000 tracks have a 1, there is an instrumental genre
]
tracks.drop(drop, axis=1, inplace=True)
tracks.rename(columns={'license_title': 'track_license', 'tags': 'track_tags'}, inplace=True)
In [10]:
tracks['track_duration'] = tracks['track_duration'].map(creation.convert_duration)
In [11]:
def convert_datetime(df, column, format=None):
    df[column] = pd.to_datetime(df[column], infer_datetime_format=True, format=format)
convert_datetime(tracks, 'track_date_created')
convert_datetime(tracks, 'track_date_recorded')
In [12]:
tracks['album_id'].fillna(-1, inplace=True)
tracks['track_bit_rate'].fillna(-1, inplace=True)
tracks = tracks.astype({'album_id': int, 'track_bit_rate': int})
In [13]:
def convert_genres(genres):
    genres = ast.literal_eval(genres)
    return [int(genre['genre_id']) for genre in genres]

tracks['track_genres'].fillna('[]', inplace=True)
tracks['track_genres'] = tracks['track_genres'].map(convert_genres)
In [14]:
tracks.columns
Out[14]:
Index(['album_id', 'album_title', 'artist_id', 'artist_name', 'artist_website',
       'track_license', 'track_tags', 'track_bit_rate', 'track_comments',
       'track_composer', 'track_date_created', 'track_date_recorded',
       'track_duration', 'track_favorites', 'track_genres',
       'track_information', 'track_interest', 'track_language_code',
       'track_listens', 'track_lyricist', 'track_number', 'track_publisher',
       'track_title'],
      dtype='object')

2.2 Albums

In [15]:
drop = [
    'artist_name', 'album_url', 'artist_url',  # in tracks already (though it can be different)
    'album_handle',
    'album_image_file', 'album_images',  # todo: shall be downloaded
    #'album_producer', 'album_engineer',  # present for ~2400 albums only
]
albums.drop(drop, axis=1, inplace=True)
albums.rename(columns={'tags': 'album_tags'}, inplace=True)
In [16]:
convert_datetime(albums, 'album_date_created')
convert_datetime(albums, 'album_date_released')
In [17]:
albums.columns
Out[17]:
Index(['album_comments', 'album_date_created', 'album_date_released',
       'album_engineer', 'album_favorites', 'album_information',
       'album_listens', 'album_producer', 'album_title', 'album_tracks',
       'album_type', 'album_tags'],
      dtype='object')

2.3 Artists

In [18]:
drop = [
    'artist_website', 'artist_url',  # in tracks already (though it can be different)
    'artist_handle',
    'artist_image_file', 'artist_images',  # todo: shall be downloaded
    'artist_donation_url', 'artist_paypal_name', 'artist_flattr_name',  # ~1600 & ~400 & ~70, not relevant
    'artist_contact',  # ~1500, not very useful data
    # 'artist_active_year_begin', 'artist_active_year_end',  # ~1400, ~500 only
    # 'artist_associated_labels',  # ~1000
    # 'artist_related_projects',  # only ~800, but can be combined with bio
]
artists.drop(drop, axis=1, inplace=True)
artists.rename(columns={'tags': 'artist_tags'}, inplace=True)
In [19]:
convert_datetime(artists, 'artist_date_created')
for column in ['artist_active_year_begin', 'artist_active_year_end']:
    artists[column].replace(0.0, np.nan, inplace=True)
    convert_datetime(artists, column, format='%Y.0')
In [20]:
artists.columns
Out[20]:
Index(['artist_active_year_begin', 'artist_active_year_end',
       'artist_associated_labels', 'artist_bio', 'artist_comments',
       'artist_date_created', 'artist_favorites', 'artist_latitude',
       'artist_location', 'artist_longitude', 'artist_members', 'artist_name',
       'artist_related_projects', 'artist_wikipedia_page', 'artist_tags'],
      dtype='object')

2.4 Merge DataFrames

In [21]:
not_found['albums'].remove(None)
not_found['albums'].append(-1)
not_found['albums'] = [int(i) for i in not_found['albums']]
not_found['artists'] = [int(i) for i in not_found['artists']]
In [22]:
tracks = tracks.merge(albums, left_on='album_id', right_index=True, sort=False, how='left', suffixes=('', '_dup'))

n = sum(tracks['album_title_dup'].isnull())
print('{} tracks without extended album information ({} tracks without album_id)'.format(
    n, sum(tracks['album_id'] == -1)))
assert sum(tracks['album_id'].isin(not_found['albums'])) == n
assert sum(tracks['album_title'] != tracks['album_title_dup']) == n

tracks.drop('album_title_dup', axis=1, inplace=True)
assert not any('dup' in col for col in tracks.columns)
3674 tracks without extended album information (1041 tracks without album_id)
In [23]:
# Album artist can be different than track artist. Keep track artist.
#tracks[tracks['artist_name'] != tracks['artist_name_dup']].select(lambda x: 'artist_name' in x, axis=1)
In [24]:
tracks = tracks.merge(artists, left_on='artist_id', right_index=True, sort=False, how='left', suffixes=('', '_dup'))

n = sum(tracks['artist_name_dup'].isnull())
print('{} tracks without extended artist information'.format(n))
assert sum(tracks['artist_id'].isin(not_found['artists'])) == n
assert sum(tracks['artist_name'] != tracks[('artist_name_dup')]) == n

tracks.drop('artist_name_dup', axis=1, inplace=True)
assert not any('dup' in col for col in tracks.columns)
974 tracks without extended artist information
In [25]:
columns = []
for name in tracks.columns:
    names = name.split('_')
    columns.append((names[0], '_'.join(names[1:])))
tracks.columns = pd.MultiIndex.from_tuples(columns)
assert all(label in ['track', 'album', 'artist'] for label in tracks.columns.get_level_values(0))
In [26]:
# Todo: fill other columns ?
tracks['album', 'tags'].fillna('[]', inplace=True)
tracks['artist', 'tags'].fillna('[]', inplace=True)

columns = [('album', 'favorites'), ('album', 'comments'), ('album', 'listens'), ('album', 'tracks'),
           ('artist', 'favorites'), ('artist', 'comments')]
for column in columns:
    tracks[column].fillna(-1, inplace=True)
columns = {column: int for column in columns}
tracks = tracks.astype(columns)

3 Data cleaning

Todo: duplicates (metadata and audio)

In [27]:
def keep(index, df):
    old = len(df)
    df = df.loc[index]
    new = len(df)
    print('{} lost, {} left'.format(old - new, new))
    return df

tracks = keep(tracks.index, tracks)
0 lost, 109727 left
In [28]:
# Audio not found or could not be trimmed.
tracks = keep(tracks.index.difference(not_found['audio']), tracks)
tracks = keep(tracks.index.difference(not_found['clips']), tracks)
180 lost, 109547 left
286 lost, 109261 left

Errors from the features.py script.

  • IndexError('index 0 is out of bounds for axis 0 with size 0',)
    • ffmpeg: Header missing
    • ffmpeg: Could not find codec parameters for stream 0 (Audio: mp3, 0 channels, s16p): unspecified frame size. Consider increasing the value for the 'analyzeduration' and 'probesize' options
    • tids: 117759
  • NoBackendError()
    • ffmpeg: Format mp3 detected only with low score of 1, misdetection possible!
    • tids: 80015, 115235
  • UserWarning('Trying to estimate tuning from empty frequency set.',)
    • librosa error
    • tids: 1440, 26436, 38903, 57603, 62095, 62954, 62956, 62957, 62959, 62971, 86079, 96426, 104623, 106719, 109714, 114501, 114528, 118003, 118004, 127827, 130298, 130296, 131076, 135804, 154923
  • ParameterError('Filter pass-band lies beyond Nyquist',)
    • librosa error
    • tids: 152204, 28106, 29166, 29167, 29169, 29168, 29170, 29171, 29172, 29173, 29179, 43903, 56757, 59361, 75461, 92346, 92345, 92347, 92349, 92350, 92351, 92353, 92348, 92352, 92354, 92355, 92356, 92358, 92359, 92361, 92360, 114448, 136486, 144769, 144770, 144771, 144773, 144774, 144775, 144778, 144776, 144777
In [29]:
# Feature extraction failed.
FAILED = [1440, 26436, 28106, 29166, 29167, 29168, 29169, 29170, 29171, 29172,
          29173, 29179, 38903, 43903, 56757, 57603, 59361, 62095, 62954, 62956,
          62957, 62959, 62971, 75461, 80015, 86079, 92345, 92346, 92347, 92348,
          92349, 92350, 92351, 92352, 92353, 92354, 92355, 92356, 92357, 92358,
          92359, 92360, 92361, 96426, 104623, 106719, 109714, 114448, 114501,114528,
          115235, 117759, 118003, 118004, 127827, 130296, 130298, 131076, 135804, 136486,
          144769, 144770, 144771, 144773, 144774, 144775, 144776, 144777, 144778, 152204,
          154923]
tracks = keep(tracks.index.difference(FAILED), tracks)
71 lost, 109190 left
In [30]:
# License forbids redistribution.
tracks = keep(tracks['track', 'license'] != 'FMA-Limited: Download Only', tracks)
print('{} licenses'.format(len(tracks[('track', 'license')].unique())))
2616 lost, 106574 left
114 licenses
In [31]:
#sum(tracks['track', 'title'].duplicated())

4 Genres

In [32]:
genres.drop(['genre_handle', 'genre_color'], axis=1, inplace=True)
genres.rename(columns={'genre_parent_id': 'parent', 'genre_title': 'title'}, inplace=True)
In [33]:
genres['parent'].fillna(0, inplace=True)
genres = genres.astype({'parent': int})
In [34]:
# 13 (Easy Listening) has parent 126 which is missing
# --> a root genre on the website, although not in the genre menu
genres.at[13, 'parent'] = 0

# 580 (Abstract Hip-Hop) has parent 1172 which is missing
# --> listed as child of Hip-Hop on the website
genres.at[580, 'parent'] = 21

# 810 (Nu-Jazz) has parent 51 which is missing
# --> listed as child of Easy Listening on website
genres.at[810, 'parent'] = 13

# 763 (Holiday) has parent 763 which is itself
# --> listed as child of Sound Effects on website
genres.at[763, 'parent'] = 16

# Todo: should novelty be under Experimental? It is alone on website.
In [35]:
# Genre 806 (hiphop) should not exist. Replace it by 21 (Hip-Hop).
print('{} tracks have genre 806'.format(
    sum(tracks['track', 'genres'].map(lambda genres: 806 in genres))))
def change_genre(genres):
    return [genre if genre != 806 else 21 for genre in genres]
tracks['track', 'genres'] = tracks['track', 'genres'].map(change_genre)
genres.drop(806, inplace=True)
34 tracks have genre 806
In [36]:
def get_parent(genre, track_all_genres=None):
    parent = genres.at[genre, 'parent']
    if track_all_genres is not None:
        track_all_genres.append(genre)
    return genre if parent == 0 else get_parent(parent, track_all_genres)

# Get all genres, i.e. all genres encountered when walking from leafs to roots.
def get_all_genres(track_genres):
    track_all_genres = list()
    for genre in track_genres:
        get_parent(genre, track_all_genres)
    return list(set(track_all_genres))

tracks['track', 'genres_all'] = tracks['track', 'genres'].map(get_all_genres)
In [37]:
# Number of tracks per genre.
def count_genres(subset=tracks.index):
    count = pd.Series(0, index=genres.index)
    for _, track_all_genres in tracks.loc[subset, ('track', 'genres_all')].items():
        for genre in track_all_genres:
            count[genre] += 1
    return count

genres['#tracks'] = count_genres()
genres[genres['#tracks'] == 0]
Out[37]:
parent title #tracks
genre_id
175 86 Bollywood 0
178 4 Be-Bop 0
In [38]:
def get_top_genre(track_genres):
    top_genres = set(genres.at[genres.at[genre, 'top_level'], 'title'] for genre in track_genres)
    return top_genres.pop() if len(top_genres) == 1 else np.nan

# Top-level genre.
genres['top_level'] = genres.index.map(get_parent)
tracks['track', 'genre_top'] = tracks['track', 'genres'].map(get_top_genre)
In [39]:
genres.head(10)
Out[39]:
parent title #tracks top_level
genre_id
1 38 Avant-Garde 8693 38
2 0 International 5271 2
3 0 Blues 1752 3
4 0 Jazz 4126 4
5 0 Classical 4106 5
6 38 Novelty 914 38
7 20 Comedy 217 20
8 0 Old-Time / Historic 868 8
9 0 Country 1987 9
10 0 Pop 13845 10

5 Subsets: large, medium, small

5.1 Large

Main characteristic: the full set with clips trimmed to a manageable size.

5.2 Medium

Main characteristic: clean metadata (includes 1 top-level genre) and quality audio.

In [40]:
fma_medium = pd.DataFrame(tracks)
In [41]:
# Missing meta-information.

# Missing extended album and artist information.
fma_medium = keep(~fma_medium['album', 'id'].isin(not_found['albums']), fma_medium)
fma_medium = keep(~fma_medium['artist', 'id'].isin(not_found['artists']), fma_medium)

# Untitled track or album.
fma_medium = keep(~fma_medium['track', 'title'].isnull(), fma_medium)
fma_medium = keep(fma_medium['track', 'title'].map(lambda x: 'untitled' in x.lower()) == False, fma_medium)
fma_medium = keep(fma_medium['album', 'title'].map(lambda x: 'untitled' in x.lower()) == False, fma_medium)

# One tag is often just the artist name. Tags too scarce for tracks and albums.
#keep(fma_medium['artist', 'tags'].map(len) >= 2, fma_medium)

# Too scarce.
#fma_medium = keep(~fma_medium['album', 'information'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'bio'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'website'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'wikipedia_page'].isnull(), fma_medium)

# Too scarce.
#fma_medium = keep(~fma_medium['artist', 'location'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'latitude'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'longitude'].isnull(), fma_medium)
3529 lost, 103045 left
598 lost, 102447 left
1 lost, 102446 left
674 lost, 101772 left
65 lost, 101707 left
In [42]:
# Technical quality.
# Todo: sample rate
fma_medium = keep(fma_medium['track', 'bit_rate'] > 100000, fma_medium)

# Choosing standard bit rates discards all VBR.
#fma_medium = keep(fma_medium['track', 'bit_rate'].isin([320000, 256000, 192000, 160000, 128000]), fma_medium)
1326 lost, 100381 left
In [43]:
fma_medium = keep(fma_medium['track', 'duration'] >= 60, fma_medium)
fma_medium = keep(fma_medium['track', 'duration'] <= 600, fma_medium)

fma_medium = keep(fma_medium['album', 'tracks'] >= 1, fma_medium)
fma_medium = keep(fma_medium['album', 'tracks'] <= 50, fma_medium)
4736 lost, 95645 left
5399 lost, 90246 left
466 lost, 89780 left
5353 lost, 84427 left
In [44]:
# Lower popularity bound.
fma_medium = keep(fma_medium['track', 'listens'] >= 100, fma_medium)
fma_medium = keep(fma_medium['track', 'interest'] >= 200, fma_medium)
fma_medium = keep(fma_medium['album', 'listens'] >= 1000, fma_medium);

# Favorites and comments are very scarce.
#fma_medium = keep(fma_medium['artist', 'favorites'] >= 1, fma_medium)
4941 lost, 79486 left
1064 lost, 78422 left
1769 lost, 76653 left
In [45]:
# Targeted genre classification.
fma_medium = keep(~fma_medium['track', 'genre_top'].isnull(), fma_medium);
#keep(fma_medium['track', 'genres'].map(len) == 1, fma_medium);
42495 lost, 34158 left
In [46]:
# Adjust size with popularity measure. Should be of better quality.
N_TRACKS = 25000

# Observations
# * More albums killed than artists --> be sure not to kill diversity
# * Favorites and preterites genres differently --> do it per genre?
# Normalization
# * mean, median, std, max
# * tracks per album or artist
# Test
# * 4/5 of same tracks were selected with various set of measures
# * <5% diff with max and mean

popularity_measures = [('track', 'listens'), ('track', 'interest')]  # ('album', 'listens')
# ('track', 'favorites'), ('track', 'comments'),
# ('album', 'favorites'), ('album', 'comments'),
# ('artist', 'favorites'), ('artist', 'comments'),

normalization = {measure: fma_medium[measure].max() for measure in popularity_measures}
def popularity_measure(track):
    return sum(track[measure] / normalization[measure] for measure in popularity_measures)
fma_medium['popularity_measure'] = fma_medium.apply(popularity_measure, axis=1)
fma_medium = keep(fma_medium.sort_values('popularity_measure', ascending=False).index[:N_TRACKS], fma_medium)
9158 lost, 25000 left
In [47]:
tmp = genres[genres['parent'] == 0].reset_index().set_index('title')
tmp['#tracks_medium'] = fma_medium['track', 'genre_top'].value_counts()
tmp.sort_values('#tracks_medium', ascending=False)
Out[47]:
genre_id parent #tracks top_level #tracks_medium
title
Rock 12 0 32923 12 7103
Electronic 15 0 34413 15 6314
Experimental 38 0 38154 38 2251
Hip-Hop 21 0 8389 21 2201
Folk 17 0 12706 17 1519
Instrumental 1235 0 14938 1235 1350
Pop 10 0 13845 10 1186
International 2 0 5271 2 1018
Classical 5 0 4106 5 619
Old-Time / Historic 8 0 868 8 510
Jazz 4 0 4126 4 384
Country 9 0 1987 9 178
Soul-RnB 14 0 1499 14 154
Spoken 20 0 1876 20 118
Blues 3 0 1752 3 74
Easy Listening 13 0 730 13 21

5.3 Small

Main characteristic: genre balanced (and echonest features).

Choices:

  • 8 genres with 1000 tracks --> 8,000 tracks
  • 10 genres with 500 tracks --> 5,000 tracks

Todo:

  • Download more echonest features so that all tracks can have them. Otherwise intersection of tracks with echonest features and one top-level genre is too small.
In [48]:
N_GENRES = 8
N_TRACKS = 1000

top_genres = tmp.sort_values('#tracks_medium', ascending=False)[:N_GENRES].index
fma_small = pd.DataFrame(fma_medium)
fma_small = keep(fma_small['track', 'genre_top'].isin(top_genres), fma_small)
2058 lost, 22942 left
In [49]:
to_keep = []
for genre in top_genres:
    subset = fma_small[fma_small['track', 'genre_top'] == genre]
    drop = subset.sort_values('popularity_measure').index[:-N_TRACKS]
    fma_small.drop(drop, inplace=True)
assert len(fma_small) == N_GENRES * N_TRACKS

5.4 Subset indication

In [50]:
SUBSETS = ('small', 'medium', 'large')
tracks['set', 'subset'] = pd.Series().astype('category', categories=SUBSETS, ordered=True)
tracks.loc[tracks.index, ('set', 'subset')] = 'large'
tracks.loc[fma_medium.index, ('set', 'subset')] = 'medium'
tracks.loc[fma_small.index, ('set', 'subset')] = 'small'

5.5 Echonest

In [51]:
echonest = pd.read_csv('raw_echonest.csv', index_col=0, header=[0, 1, 2])
echonest = keep(~echonest['echonest', 'temporal_features'].isnull().any(axis=1), echonest)
echonest = keep(~echonest['echonest', 'audio_features'].isnull().any(axis=1), echonest)
echonest = keep(~echonest['echonest', 'social_features'].isnull().any(axis=1), echonest)

echonest = keep(echonest.index.isin(tracks.index), echonest);
keep(echonest.index.isin(fma_medium.index), echonest);
keep(echonest.index.isin(fma_small.index), echonest);
0 lost, 14511 left
205 lost, 14306 left
239 lost, 14067 left
938 lost, 13129 left
7848 lost, 5281 left
11835 lost, 1294 left

6 Splits: training, validation, test

Take into account:

  • Artists may only appear on one side.
  • Stratification: ideally, all characteristics (#tracks per artist, duration, sampling rate, information, bio) and targets (genres, tags) should be equally distributed.
In [52]:
for genre in genres.index:
    tracks['genre', genres.at[genre, 'title']] = tracks['track', 'genres_all'].map(lambda genres: genre in genres)

SPLITS = ('training', 'test', 'validation')
PERCENTAGES = (0.8, 0.1, 0.1)
tracks['set', 'split'] = pd.Series().astype('category', categories=SPLITS)

for subset in SUBSETS:

    tracks_subset = tracks['set', 'subset'] <= subset

    # Consider only top-level genres for small and medium.
    genre_list = list(tracks.loc[tracks_subset, ('track', 'genre_top')].unique())
    if subset == 'large':
        genre_list = list(genres['title']) 

    while True:
        if len(genre_list) == 0:
            break

        # Choose most constrained genre, i.e. genre with the least unassigned artists.
        tracks_unsplit = tracks['set', 'split'].isnull()
        count = tracks[tracks_subset & tracks_unsplit].set_index(('artist', 'id'), append=True)['genre']
        count = count.groupby(level=1).sum().astype(np.bool).sum()
        genre = np.argmin(count[genre_list])
        genre_list.remove(genre)
        
        # Given genre, select artists.
        tracks_genre = tracks['genre', genre] == 1
        artists = tracks.loc[tracks_genre & tracks_subset & tracks_unsplit, ('artist', 'id')].value_counts()
        #print('-->', genre, len(artists))

        current = {split: np.sum(tracks_genre & tracks_subset & (tracks['set', 'split'] == split)) for split in SPLITS}

        # Assign artists with most tracks first.
        for artist, count in artists.items():
            choice = np.argmin([current[split] / percentage for split, percentage in zip(SPLITS, PERCENTAGES)])
            current[SPLITS[choice]] += count
            #assert tracks.loc[tracks['artist', 'id'] == artist, ('set', 'split')].isnull().all()
            tracks.loc[tracks['artist', 'id'] == artist, ('set', 'split')] = SPLITS[choice]

# Tracks without genre can only serve as unlabeled data for training, e.g. for semi-supervised algorithms.
no_genres = tracks['track', 'genres_all'].map(lambda genres: len(genres) == 0)
no_split = tracks['set', 'split'].isnull()
assert not (no_split & ~no_genres).any()
tracks.loc[no_split, ('set', 'split')] = 'training'

# Not needed any more.
tracks.drop('genre', axis=1, level=0, inplace=True)

7 Store

In [53]:
for dataset in 'tracks', 'genres', 'echonest':
    eval(dataset).sort_index(axis=0, inplace=True)
    eval(dataset).sort_index(axis=1, inplace=True)
    params = dict(float_format='%.10f') if dataset == 'echonest' else dict()
    eval(dataset).to_csv(dataset + '.csv', **params)
In [54]:
# ./creation.py normalize /path/to/fma
# ./creation.py zips /path/to/fma

8 Description

In [55]:
tracks = utils.load('tracks.csv')
tracks.dtypes
Out[55]:
album   comments                      int64
        date_created         datetime64[ns]
        date_released        datetime64[ns]
        engineer                     object
        favorites                     int64
        id                            int64
        information                category
        listens                       int64
        producer                     object
        tags                         object
        title                        object
        tracks                        int64
        type                       category
artist  active_year_begin    datetime64[ns]
        active_year_end      datetime64[ns]
        associated_labels            object
        bio                        category
        comments                      int64
        date_created         datetime64[ns]
        favorites                     int64
        id                            int64
        latitude                    float64
        location                     object
        longitude                   float64
        members                      object
        name                         object
        related_projects             object
        tags                         object
        website                      object
        wikipedia_page               object
set     split                        object
        subset                     category
track   bit_rate                      int64
        comments                      int64
        composer                     object
        date_created         datetime64[ns]
        date_recorded        datetime64[ns]
        duration                      int64
        favorites                     int64
        genre_top                  category
        genres                       object
        genres_all                   object
        information                  object
        interest                      int64
        language_code                object
        license                    category
        listens                       int64
        lyricist                     object
        number                        int64
        publisher                    object
        tags                         object
        title                        object
dtype: object
In [56]:
N = 5
ipd.display(tracks['track'].head(N))
ipd.display(tracks['album'].head(N))
ipd.display(tracks['artist'].head(N))
bit_rate comments composer date_created date_recorded duration favorites genre_top genres genres_all information interest language_code license listens lyricist number publisher tags title
track_id
2 256000 0 NaN 2008-11-26 01:48:12 2008-11-26 168 2 Hip-Hop [21] [21] NaN 4656 en Attribution-NonCommercial-ShareAlike 3.0 Inter... 1293 NaN 3 NaN [] Food
3 256000 0 NaN 2008-11-26 01:48:14 2008-11-26 237 1 Hip-Hop [21] [21] NaN 1470 en Attribution-NonCommercial-ShareAlike 3.0 Inter... 514 NaN 4 NaN [] Electric Ave
5 256000 0 NaN 2008-11-26 01:48:20 2008-11-26 206 6 Hip-Hop [21] [21] NaN 1933 en Attribution-NonCommercial-ShareAlike 3.0 Inter... 1151 NaN 6 NaN [] This World
10 192000 0 Kurt Vile 2008-11-25 17:49:06 2008-11-26 161 178 Pop [10] [10] NaN 54881 en Attribution-NonCommercial-NoDerivatives (aka M... 50135 NaN 1 NaN [] Freeway
20 256000 0 NaN 2008-11-26 01:48:56 2008-01-01 311 0 NaN [76, 103] [17, 10, 76, 103] NaN 978 en Attribution-NonCommercial-NoDerivatives (aka M... 361 NaN 3 NaN [] Spiritual Level
comments date_created date_released engineer favorites id information listens producer tags title tracks type
track_id
2 0 2008-11-26 01:44:45 2009-01-05 NaN 4 1 <p></p> 6073 NaN [] AWOL - A Way Of Life 7 Album
3 0 2008-11-26 01:44:45 2009-01-05 NaN 4 1 <p></p> 6073 NaN [] AWOL - A Way Of Life 7 Album
5 0 2008-11-26 01:44:45 2009-01-05 NaN 4 1 <p></p> 6073 NaN [] AWOL - A Way Of Life 7 Album
10 0 2008-11-26 01:45:08 2008-02-06 NaN 4 6 NaN 47632 NaN [] Constant Hitmaker 2 Album
20 0 2008-11-26 01:45:05 2009-01-06 NaN 2 4 <p> "spiritual songs" from Nicky Cook</p> 2710 NaN [] Niris 13 Album
active_year_begin active_year_end associated_labels bio comments date_created favorites id latitude location longitude members name related_projects tags website wikipedia_page
track_id
2 2006-01-01 NaT NaN <p>A Way Of Life, A Collective of Hip-Hop from... 0 2008-11-26 01:42:32 9 1 40.058324 New Jersey -74.405661 Sajje Morocco,Brownbum,ZawidaGod,Custodian of ... AWOL The list of past projects is 2 long but every1... [awol] http://www.AzillionRecords.blogspot.com NaN
3 2006-01-01 NaT NaN <p>A Way Of Life, A Collective of Hip-Hop from... 0 2008-11-26 01:42:32 9 1 40.058324 New Jersey -74.405661 Sajje Morocco,Brownbum,ZawidaGod,Custodian of ... AWOL The list of past projects is 2 long but every1... [awol] http://www.AzillionRecords.blogspot.com NaN
5 2006-01-01 NaT NaN <p>A Way Of Life, A Collective of Hip-Hop from... 0 2008-11-26 01:42:32 9 1 40.058324 New Jersey -74.405661 Sajje Morocco,Brownbum,ZawidaGod,Custodian of ... AWOL The list of past projects is 2 long but every1... [awol] http://www.AzillionRecords.blogspot.com NaN
10 NaT NaT Mexican Summer, Richie Records, Woodsist, Skul... <p><span style="font-family:Verdana, Geneva, A... 3 2008-11-26 01:42:55 74 6 NaN NaN NaN Kurt Vile, the Violators Kurt Vile NaN [philly, kurt vile] http://kurtvile.com NaN
20 1990-01-01 2011-01-01 NaN <p>Songs written by: Nicky Cook</p>\n<p>VOCALS... 2 2008-11-26 01:42:52 10 4 51.895927 Colchester England 0.891874 Nicky Cook\n Nicky Cook NaN [instrumentals, experimental pop, post punk, e... NaN NaN