Energy auto-encoder: audio pre-processing

Slice clips in frames, apply a constant-Q transform (CQT) then some local contrast normalization (LCN). Processed audio is stored in HDF5 datasets.

Setup

In [ ]:
import os, time
import numpy as np
import librosa
import h5py

print('Software versions:')
for pkg in [np, librosa]:
    print('  {}: {}'.format(pkg.__name__, pkg.__version__))

# Much faster computation of the CQT if available.
# Provided by scikits.samplerate through libsamplerate (SRC).
print('librosa HAS_SAMPLERATE: {}'.format(librosa.core._HAS_SAMPLERATE))

Input data

Audio data from the GTZAN dataset has been previously stored in the HDF5 format which allows us to read and write data without the need to load the whole dataset into memory via memory mapping .

In [ ]:
filename = os.path.join('data', 'gtzan.hdf5')
gtzan = h5py.File(filename, 'r')

# Display HDF5 attributes.
print('Attributes:')
for name, value in gtzan.attrs.items():
    print('  {} = {}'.format(name, value))

# List the stored datasets.
print('Datasets: {}'.format(', '.join(gtzan)))

Parameters

Frames

Dimensionality increase:

  • Each clip is divided into short frames of $n_a = 1024$ samples.
  • There is a 50% overlap between consecutive frames which adds redundancy in the data.
In [ ]:
na = 1024

# Aligned.
N1 = int(np.floor(float(gtzan.attrs['Nsamples']) / na))
# Overlap (redundant).
N2 = int(np.floor(float(gtzan.attrs['Nsamples']) / na - 0.5))

Nframes = min(N1, N2)
Nclips = gtzan.attrs['Nclips']
Ngenres = gtzan.attrs['Ngenres']
del(N1, N2)

# Data dimensionality and size.
print('Dimensionality increase from {:,} samples '
      'to {} frames x 2 x {} samples = {:,} per clip'
      .format(gtzan.attrs['Nsamples'], Nframes, na, Nframes*2*na))
print('Data size N = {:,} frames of na = {} samples -> {:,} floats'
      .format(Ngenres*Nclips*Nframes*2, na, Ngenres*Nclips*Nframes*2*na))

Constant-Q transform (CQT)

We use the CQT as a dimensionality reduction (from $n_a=1024$ to $n_s=96$) and a feature extraction tool:

  • Span $N_o = 4$ octaves from $C_2$ to $C_6$ where $C_4$ is the middle $C$ in the scientific pitch notation.
  • Western music uses 12TET (twelve-tone equal temperament): 7 notes and 12 semitones per octave. We use 24 bins per octave to achieve a quarter-tone resolution.
  • It gives us $n_s = 96$ filters, the dimensionality of the input data to the auto-encoder.

Open questions:

  • Should we truncate the clip to a mutliple of $n_a$ ?
  • How to handle boundary conditions ? Discard ? Keep ?
In [ ]:
# CQT filters.
ns = 96
No = 4
print('ns = {} filters spanning No = {} octaves'.format(ns, No))
print('  --> resolution of {} bins per octave'.format(ns/No))

# This MIDI implementation assigns middle C (note 60) to C5 not C4 !
# It may also be C3, there is no standardisation.
# It is not consistent with the scientific pitch notation.
assert librosa.note_to_midi('C5') == 60
# Tuning standard A4 = 440 Hz (A440) becomes A5 = 440Hz.
assert librosa.midi_to_hz(librosa.note_to_midi('A5')) == 440
assert librosa.midi_to_hz(69) == 440

# We should thus use C3 and C7 instead of C2 and C6...
nmin, nmax = 'C3', 'C7'
fmin = librosa.midi_to_hz(librosa.note_to_midi(nmin))
fmax = librosa.midi_to_hz(librosa.note_to_midi(nmax))
assert fmax / fmin == 2**No  # By definition of an octave.
print('fmin = {:.2f} Hz ({}), fmax = {:.2f} Hz ({})'.format(
        fmin[0], nmin, fmax[0], nmax))

# librosa CQT parameters.
rosaparams = {'sr':gtzan.attrs['sr'], 'hop_length':na, 'fmin':fmin,
          'n_bins':ns, 'bins_per_octave':ns/No}

# Data dimensionality and size.
print('Dimensionality decrease from {0} frames x 2 x {1} samples = {3:,} '
      'to {0} frames x 2 x {2} frequency bins = {4:,} per clip'
      .format(Nframes, na, ns, Nframes*2*na, Nframes*2*ns))
print('Data size N = {:,} frames of ns = {} samples -> {:,} floats'
      .format(Ngenres*Nclips*Nframes*2, ns, Ngenres*Nclips*Nframes*2*ns))

Output data

Five dimensions:

  1. Genre number in $[0,N_{genres}-1]$. The genres attribute can be indexed with this number to retrieve the name of the genre.
  2. Clip number in $[0,N_{clips}-1]$.
  3. Frame number in $[0,N_{frames}-1]$.
  4. Overlap in $[0,1]$: 0 for the aligned frames, 1 for the overlapped ones.
  5. Frames dimensionality in $[0,n]$.

Three 5-dimensional HDF5 datasets:

  1. Xa: raw audio of the frame, dimensionality $n=n_a$
  2. Xs: CQT spectrogram, dimensionality $n=n_s$
  3. Xn: LCN normalized spectrogram, dimensionality $n=n_s$
In [ ]:
filename = os.path.join('data', 'audio.hdf5')

# Remove existing HDF5 file without warning if non-existent.
try:
    os.remove(filename)
except OSError:
    pass

# Create HDF5 file and datasets.
audio = h5py.File(filename, 'w')

# Metadata.
audio.attrs['sr'] = gtzan.attrs['sr']
genres = gtzan.keys()
dtype = 'S{}'.format(max([len(genre) for genre in genres]))
audio.attrs['labels'] = np.array(genres, dtype=dtype)

# Data.
Xa = audio.create_dataset('Xa', (Ngenres, Nclips, Nframes, 2, na), dtype='float32')
Xs = audio.create_dataset('Xs', (Ngenres, Nclips, Nframes, 2, ns), dtype='float32')
#Xn = f.create_dataset('Xn', (ns, N), dtype='float32')

# Show datasets, their dimensionality and data type.
print('Datasets:')
for dname, dset in audio.items():
    print('  {:2}: {:16}, {}'.format(dname, dset.shape, dset.dtype))

# Display HDF5 attributes.
print('Attributes:')
for name, value in audio.attrs.items():
    print('  {} = {}'.format(name, value))

Load and process audio

  • Fill $X_s \in R^{N_{genres} \times N_{clips} \times N_{frames} \times 2 \times n_s}$ with the CQT spectrogram of all $N$ frames.
  • Store the raw audio in $X_a \in R^{N_{genres} \times N_{clips} \times N_{frames} \times 2 \times n_a}$.
  • The processing function load audio data, compute the CQT and store the result in HDF5 datasets along with raw audio.
  • It has low memory usage as it only keeps one song at a time in RAM. The rest is stored on disk via HDF5.
In [ ]:
params = {'newshape':(Nframes, na), 'order':'C'}

def process(genre, clip):
    """Usage: process(1, 2)"""

    # Load audio.
    y1 = gtzan[genres[genre]][:,clip]  # Aligned frames.
    y2 = y1[na/2:]  # Overlaped frames.

    # Store raw audio.
    Xa[genre,clip,:,0,:] = np.reshape(y1[:na*Nframes], **params)
    Xa[genre,clip,:,1,:] = np.reshape(y2[:na*Nframes], **params)

    # Ensure that the signal is correctly reshaped.
    i = int(np.floor(Nframes * np.random.uniform()))
    assert np.alltrue(Xa[genre,clip,i,0,:] == y1[i*na:i*na+na])
    assert np.alltrue(Xa[genre,clip,i,1,:] == y1[na/2+i*na:na/2+i*na+na])

    # Store spectrogram. Drop the last one which consists mostly
    # of padded data (and keep the same size as Xa).
    Xs[genre,clip,:,0,:] = librosa.cqt(y1, **rosaparams)[:,:-1].T
    Xs[genre,clip,:,1,:] = librosa.cqt(y2, **rosaparams)[:,:-1].T

Process a single clip:

In [ ]:
#process(1, 2)

Process the entire GTZAN dataset:

In [ ]:
#Ngenres, Nclips = 2, 100
tstart = time.time()
for genre in range(Ngenres):
    for clip in range(Nclips):
        process(genre, clip)
t = time.time() - tstart
print('Elapsed time: {:.0f} seconds ({:.1f} seconds per clip)'.format(
        t, t/Ngenres/Nclips))

Close HDF5 data stores

In [ ]:
gtzan.close()
audio.close()