Lakh MIDI Dataset Tutorial

This IPython notebook demonstrates how to use the data in the Lakh MIDI Dataset. It shows how the dataset is organized and gives examples of how to use annotations extracted from LMD-aligned (the collection of MIDI files which have been matched and aligned to entries in the Million Song Dataset). We will use pretty_midi for parsing the MIDI files, mir_eval for sonification and visualization, and librosa for audio analysis.

In [1]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pretty_midi
import librosa
import mir_eval
import mir_eval.display
import tables
import IPython.display
import os
import json

# Local path constants
DATA_PATH = 'data'
RESULTS_PATH = 'results'
# Path to the file match_scores.json distributed with the LMD
SCORE_FILE = os.path.join(RESULTS_PATH, 'match_scores.json')

# Utility functions for retrieving paths
def msd_id_to_dirs(msd_id):
    """Given an MSD ID, generate the path prefix.
    E.g. TRABCD12345678 -> A/B/C/TRABCD12345678"""
    return os.path.join(msd_id[2], msd_id[3], msd_id[4], msd_id)

def msd_id_to_mp3(msd_id):
    """Given an MSD ID, return the path to the corresponding mp3"""
    return os.path.join(DATA_PATH, 'msd', 'mp3',
                        msd_id_to_dirs(msd_id) + '.mp3')

def msd_id_to_h5(h5):
    """Given an MSD ID, return the path to the corresponding h5"""
    return os.path.join(RESULTS_PATH, 'lmd_matched_h5',
                        msd_id_to_dirs(msd_id) + '.h5')

def get_midi_path(msd_id, midi_md5, kind):
    """Given an MSD ID and MIDI MD5, return path to a MIDI file.
    kind should be one of 'matched' or 'aligned'. """
    return os.path.join(RESULTS_PATH, 'lmd_{}'.format(kind),
                        msd_id_to_dirs(msd_id), midi_md5 + '.mid')

Data layout

The file match_scores.json is a good place to start. It holds a dictionary of dictionaries. The first dictionary is keyed by Million Song Dataset IDs. Each entry in this dictionary is a dictionary which maps MD5 checksums of files in the Lakh MIDI Dataset to scores in the range [0.5, 1.0] which represent the confidence that a given MIDI file matches a given Million Song Dataset entry. This range of confidence scores is due to the fact that below 0.5 it's likely that the match is invalid. Here's an example:

In [2]:
with open(SCORE_FILE) as f:
    scores = json.load(f)
# Grab a Million Song Dataset ID from the scores dictionary
msd_id = scores.keys()[1234]
print 'Million Song Dataset ID {} has {} MIDI file matches:'.format(
    msd_id, len(scores[msd_id]))
for midi_md5, score in scores[msd_id].items():
    print '  {} with confidence score {}'.format(midi_md5, score)
Million Song Dataset ID TRNCZVX128F92F9018 has 5 MIDI file matches:
  a92af10c0349706ba12552011f7f77a8 with confidence score 0.782934859711
  335a5edca8882f4d2725683d7c530aac with confidence score 0.755013750955
  8e2bbe4485b113ba48762b1e3032795a with confidence score 0.768855325835
  df21ff6afbeab449e4d415167c54decf with confidence score 0.689582669694
  22a0142b14b393b1515062d8a006814e with confidence score 0.702707102354

This Million Song Dataset entry has 5 MIDI files matched to it, with scores between ~.69 and .78. There are multiple MIDI files matched to this one Million Song Dataset entry because the Lakh MIDI Dataset has multiple different MIDI transcriptions of this single piece of music.

MIDI files which have been matched to the Million Song Dataset are distributed in the Lakh MIDI Dataset in two formats: LMD-matched provides them in their raw form as they were scraped from the internet; and LMD-aligned contains modified versions which have been aligned to the 7digital preview MP3s which accompany the Million Song Dataset. The directory structure of both of these packages follows the Million Song Dataset. For example, the MIDI files we just inspected above appear in


Utilizing aligned MIDI files

MIDI files provide a wide variety of useful information. Moreover, when they are matched and aligned to audio recordings, they can be used to derive annotations about the recording. Below is a demonstration of how to extract this information from the files in LMD-aligned. We'll start by grabbing a MIDI file which has all of the useful information we will demonstrate.

In [7]:
while True:
    # Grab an MSD ID and its dictionary of matches
    msd_id, matches = scores.popitem()
    # Grab a MIDI from the matches
    midi_md5, score = matches.popitem()
    # Construct the path to the aligned MIDI
    aligned_midi_path = get_midi_path(msd_id, midi_md5, 'aligned')
    # Load/parse the MIDI file with pretty_midi
    pm = pretty_midi.PrettyMIDI(aligned_midi_path)
    # Look for a MIDI file which has lyric and key signature change events
    if len(pm.lyrics) > 5 and len(pm.key_signature_changes) > 0:
In [8]:
# MIDI files in LMD-aligned are aligned to 7digital preview clips from the MSD
# Let's listen to this aligned MIDI along with its preview clip
# Load in the audio data
audio, fs = librosa.load(msd_id_to_mp3(msd_id))
# Synthesize the audio using fluidsynth
midi_audio = pm.fluidsynth(fs)
# Play audio in one channel, synthesized MIDI in the other
IPython.display.Audio([audio, midi_audio[:audio.shape[0]]], rate=fs)