Energy auto-encoder: classification

  • Technically akin to "transductive learning" because the sparse auto-encoder dictionary is learned over the whole dataset, not only the training data. The LeCun paper we compare with does the same. This could be solved by including the dictionary learning step in the classifier. Technical solutions:
    1. Compute the dictionary in our custom classifier.
    2. Create a scikit-learn Pipeline which includes the whole preprocessing and feature extraction steps. In either case the ability to import functions from other notebooks would help. This would be very slow due to the tremendous amount of time needed to train the auto-encoder.
  • We should use "grid search" to find the optimal hyper-parameters (auto-encoders, frames, feature vectors, SVM).
  • We may use a validation set to mitigate the leak of the testing set in the model as we tune the hyper-parameters.
  • Even if not stated in LeCun's paper we should rescale the data before SVM classification.

For the auto-encoder learning to converge we should rescale beforehand. As the transformation preserves energy, there is no need to rescale again.


  • scale: scaling (None, minmax, std)
  • Nvectors: number of feature vectors per song.
  • svm_type: C-SVM (C) or nu-SVM (nu).
  • kernel: C-SVM kernel (linear, rbf).
  • C: penalty parameter C of the error term.
  • nu: an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors.
  • majority_voting: When True, each of the 2Nvectors votes for one label and the accuracy is computed on the classification of the whole clips. When False, the accuracy is computed on the classification of the feature vectors.
  • test_size: proportion of testing data for cross-validation.
  • Ncv: number of cross-validation runs, in multiple of 10.
  • dataset_classification: the dataset to use for classification (X, Z). It allows to compare with the baseline, i.e. spectrograms.
  • Ngenres, Nclips, Nframes: a way to reduce the size of the dataset.
  • folder: relative path to HDF5 files.
  • filename_*: name of the HDF5 file.
In [ ]:
if 'p' in globals().keys():
    # Hyper-parameters passed by the experiment runner.
    for key, value in p.items():
        globals()[key] = value
    scale = 'minmax'
    Nvectors = 6
    svm_type = 'C'
    kernel = 'linear'
    C = 1
    nu = 0.5
    majority_voting = True
    test_size = 0.1
    Ncv = 20
    dataset_classification = 'Z'
    Ngenres, Nclips, Nframes = 10, 100, 644
    folder = 'data'
    filename_features = 'features.hdf5'


In [ ]:
import os, time
import numpy as np
import sklearn
from sklearn import svm
from sklearn import cross_validation
from sklearn import metrics
from sklearn import preprocessing
import h5py
import matplotlib.pyplot as plt
%matplotlib inline

print('Software versions:')
for pkg in [np, sklearn]:
    print('  {}: {}'.format(pkg.__name__, pkg.__version__))

toverall = time.time()

Input data

  1. Retrieve data from the HDF5 data store.
  2. Choose the data we want to work with:
    • raw audio $X_a$,
    • CQT spectrograms $X_s$,
    • normalized spectrograms $X_n$,
    • sparse codes $Z$.
  3. Eventually reduce the number $N_{genres} \cdot N_{clips}$ of samples for quicker analysis.
In [ ]:
def datinfo(X, name='Dataset'):
    r"""Print dataset size and dimensionality"""
          '  size: N={:,} x n={} -> {:,} floats\n'
          '  dim: {:,} features per clip\n'
          '  shape: {}'
          .format(name,[:-1]), X.shape[-1],
        ,[2:]), X.shape))
In [ ]:
filename = os.path.join(folder, filename_features)
with h5py.File(filename, 'r') as audio:

    # Display HDF5 attributes.
    for attr in audio.attrs:
        print('  {} = {}'.format(attr, audio.attrs[attr]))
    labels = audio.attrs['labels']

    # Show datasets, their dimensionality and data type.
    for dname, dset in audio.items():
        print('  {:2}: {:24}, {}'.format(dname, dset.shape, dset.dtype))

    # Choose dataset: Xa, Xs, Z.
    X = audio.get(dataset_classification)

    # Full dataset.
    n = X.shape[-1]
    datinfo(X, 'Full dataset')

    # Load data into memory as a standard NumPy array.
    X = X[:Ngenres,:Nclips,:Nframes,...]
    datinfo(X, 'Reduced dataset')

    # Resize in place without memory loading via hyperslab.
    # Require chunked datasets.
    #X.resize((Ngenres, Nclips, Nframes, 2, n))

Feature vectors through aggregation

Yet another (hopefully intelligent) dimensionality reduction:

  • Aggregation of features from various frames to make up $2N_{vectors} = 12$ feature vectors per clip. Each vector represents approximatly 5 seconds of audio which is way longer than single frames while shorter than the whole clip.
  • There is again a 50% overlap between those feature vectors.
  • Absolute value rectification to prevent components of different sign from canceling each other out. May be worth disabling when working with raw audio ($X_a$). It then makes sense to rescale with 'std' instead of 'minmax'.
  • Can be thought of as an histogram of used dictionary atoms (if using $Z$) or frequency bins (if using $X_s$) along the chosen time window.
In [ ]:
# Flatten consecutive frames in time.
X.resize((Ngenres, Nclips, 2*Nframes, n))
#assert np.all(X1[1,4,3,:] == X[1,4,1,1,:])
datinfo(X, 'Flattened frames')

# Parameters.
Nframes_per_vector = int(np.floor(2 * Nframes / (Nvectors+0.5)))

def aggregate(X, absrect=True):
    # Truncate.
    X = X[:,:,:Nvectors*Nframes_per_vector,:]
    # Group.
    X = X.reshape((Ngenres, Nclips, Nvectors, Nframes_per_vector, n))
    datinfo(X, 'Truncated and grouped')
    # Aggregate.
    if absrect:
        return np.sum(np.abs(X), axis=3)
        return np.sum(X, axis=3)

# Feature vectors.
Y = np.empty((Ngenres, Nclips, Nvectors, 2, n))
Y[:,:,:,0,:] = aggregate(X)  # Aligned.
Y[:,:,:,1,:] = aggregate(X[:,:,Nframes_per_vector/2:,:])  # Ovelapped.
datinfo(Y, 'Feature vectors')

# Free memory.

Feature vectors visualization

Visualize all feature vectors of a given clip.


  • Classical music seems to have a much denser spectrum than blues, which may explain why these two classes are easily identifiable using $X_s$.
  • Country seems to have strong low frequencies.
In [ ]:
genre, clip = 0, 7
fig = plt.figure(figsize=(8,5))
fig.suptitle('12 feature vectors each covering 5 seconds with 50% overlap')
for vector in range(Nvectors):
    for k in range(2):
        i = vector*2+k
        ax = fig.add_subplot(4, 3, i)
        ax.set_xlim((0, n))

Data preparation for classification

  1. Rearrange dataset as a 2D array: number of samples x dimensionality.
  2. Optionally scale the data.
  3. Generate labels.
  4. Optionally split in training and testing sets.
  5. Optionally randomize labels for testing.


  • Scaling is necessary for classification performance (both accuracy and speed). 'std' scaling is not well suited to our histogram-like feature vectors which are not at all Gaussian distributions. Prefer 'minmax', i.e. scale features in [0,1]. Moreover this scaling will preserve the sparsity when dealing with sparse codes $Z$.
In [ ]:
def prepdata(a, b, c, test_size=None, scale=None, rand=False):
    """Prepare data for classification."""
    # Squeeze dataset to a 2D array.
    data = Y.reshape((a*b), c)
    if c == n:
        assert np.all(data[31,:] == Y[0,2,3,1,:])
    elif c == Nvectors*2*n:
        assert np.all(data[Nclips+2,:] == Y[1,2,:,:,:].reshape(-1))

    # Independently rescale each feature.
    # To be put in an sklearn Pipeline to avoid transductive learning.
    if scale is 'std':
        # Features have zero norm and unit standard deviation.
        data = preprocessing.scale(data, axis=0)
    elif scale is 'minmax':
        # Features in [0,1].
        data -= np.min(data, axis=0)
        data /= np.max(data, axis=0)
    #print(np.min(data, axis=0))
    #print(np.max(data, axis=0))
    # Labels.
    target = np.empty((a, b), dtype=np.uint8)
    for genre in range(Ngenres):
        target[genre,:] = genre
    print('{} genres: {}'.format(Ngenres, ', '.join(labels[:Ngenres])))

    # Be sure that classification with random labels is no better than random.
    if rand:
        target = np.floor(np.random.uniform(0, Ngenres, target.shape))
        print('Balance: {} {}'.format(np.sum(target == 0), np.sum(target == 1)))

    # Training and testing sets.
    if test_size is not None:
        X_train, X_test, y_train, y_test = cross_validation.train_test_split(
            data, target, test_size=test_size)  # random_state=1
        print('Training data: {}, {}'.format(X_train.shape, X_train.dtype))
        print('Testing data: {}, {}'.format(X_test.shape, X_test.dtype))
        print('Training labels: {}, {}'.format(y_train.shape, y_train.dtype))
        print('Testing labels: {}, {}'.format(y_test.shape, y_test.dtype))
        return X_train, X_test, y_train, y_test
        print('Data: {}, {}'.format(data.shape, data.dtype))
        print('Labels: {}, {}'.format(target.shape, target.dtype))
        return data, target

Linear SVM

  • Each feature vector gets a genre label.
  • Classification with linear Vector Support Machine (SVM).
  • Fast to train.
  • Scale well to large dataset.
  • Two implementations: liblinear (sklearn LinearSVC) and libsvm (sklearn SVC and NuSVC)
  • Multi-class: "one-vs-one" approach (Knerr et al., 1990) (sklearn SVC and NuSVC) and "one-vs-the-rest" (sklearn LinearSVC)


  • We can predict genre labels of individual frames with good accuracy using CQT spectrograms only.
  • SVC vs NuSVC vs LinearSVC:
    • 10-fold cross-validation with 10 classes (default $C=1$ and $\nu=0.5$):
      • SVC (0.56) yields better accuracy than LinearSVC (0.53) than NuSVC (0.51)
      • SVC (303s) and LinearSVC (296s) faster than NuSVC (501s)
    • SVC does often not converge if data is not scaled
    • LinearSVC may be more scalable (in the number of samples)
  • Hyper-parameters:
    • $C$ seems to have little impact.
    • $\nu$ has a great impact on speed: lower is slower

Open questions:

  • Which multi-class strategy to adopt: one-vs-all or one-vs-one ?
    • sklearn states that one-vs-all is the most common strategy
  • Determine $C$ or $\nu$.
In [ ]:
# Instantiate a classifier.

if svm_type is 'C':
    clf_svm = svm.SVC(kernel=kernel, C=C)
elif svm_type is 'nu':
    clf_svm = svm.NuSVC(kernel=kernel, nu=nu)
#clf_svm = svm.LinearSVC(C=1)
In [ ]:
# Try the single feature vector classifier (linear SVM).
if True:
    # Split data.
    X_train, X_test, y_train, y_test = prepdata(
        Ngenres, Nclips*Nvectors*2, n, test_size=0.4,
        scale=scale, rand=False)
    # Train., y_train)
    # Test.
    y_predict = clf_svm.predict(X_test)
    acc = metrics.accuracy_score(y_test, y_predict)
    print('Accuracy: {:.1f} %'.format(acc*100))

Majority voting

Final dimensionality reduction step:

  • Each of the 12 feature vectors of a clip gives a vote. We choose the genre with the highest number of votes.
  • Implemented as a custom classifier which embeds an SVM for individual feature vectors classification.
  • Alternative implementation: insert in a sklearn pipeline after SVC.


  • Accuracy on whole clips is indeed better than accuracy on individual feature vectors.
  • clf_svm_vote.confidence is useful to observe if a class is harder to differentiate.
In [ ]:
# Define and instantiate our custom classifier.
class svm_vote(sklearn.base.BaseEstimator):
    def __init__(self, svm):
        self.svm = svm
    def _vectors(self, X, y=None):
        """Rearrange data in feature vectors for SVM."""
        X = X.reshape(X.shape[0]*Nvectors*2, n)
        if y is not None:
            y = np.repeat(y, Nvectors*2, axis=0)
            assert y.shape[0] == X.shape[0]
            return (X, y)
            return (X,)
    def fit(self, X, y):
        """Fit the embedded SVC."""*self._vectors(X, y))
    def svm_score(self, X, y):
        """Return SVC accuracy on feature vectors."""
        return self.svm.score(*self._vectors(X, y))
    def svm_predict(self, X):
        """Return SVC predictions on feature vectors."""
        y = self.svm.predict(*self._vectors(X))
        y.resize(X.shape[0], Nvectors*2)
        return y
    def confidence(self, X):
        """Return the number of votes for each class."""
        def bincount(x):
            return np.bincount(x, minlength=Ngenres)
        y = np.apply_along_axis(bincount, 1, self.svm_predict(X))
        assert np.all(np.sum(y, axis=1) == Nvectors*2)
        return y
    def predict(self, X):
        """Return predictions on whole clips."""
        y = self.svm_predict(X)
        return np.apply_along_axis(lambda x: np.bincount(x).argmax(), 1, y)
        #return np.zeros(X.shape[0])  # Pretty bad prediction.
    def score(self, X, y):
        """Return the accuracy score. Used by sklearn cross-validation."""
        return metrics.accuracy_score(y, self.predict(X))

clf_svm_vote = svm_vote(clf_svm)
In [ ]:
# Try the whole clip classifier (linear SVM and majority voting).
if True:
    # Split data.
    X_train, X_test, y_train, y_test = prepdata(
        Ngenres, Nclips, Nvectors*2*n, test_size=0.4,
        scale=scale, rand=False)
    # Train., y_train)
    # Test on single vectors.
    acc = clf_svm_vote.svm_score(X_test, y_test)
    print('Feature vectors accuracy: {:.1f} %'.format(acc*100))
    # Observe individual votes.
    # Test on whole clips.
    y_predict = clf_svm_vote.predict(X_test)
    acc = metrics.accuracy_score(y_test, y_predict)
    assert acc == clf_svm_vote.score(X_test, y_test)
    print('Clips accuracy: {:.1f} %'.format(acc*100))


  • 10-fold cross-validation.
  • 100 randomly chosen clips per fold.
  • 9 folds (900 clips) for training, 1 fold (100 clips) for testing.
  • Determine a classification accuracy using testing set.
  • Repeat 10 times: mean and standard deviation.


  • Data should be shuffled as samples with the same label are contiguous, i.e. data ordering is not arbitrary.
  • ShuffleSplit, StratifiedShuffleSplit, KFold and StratifiedKFold yields similar results as long as data is shuffeld.
  • (Lots of variance between runs.)
  • Data should be rescaled for good performance (both accuracy and speed).


  • With $X_a$ (best observed)
    • Accuracy of 89 (+/- 5.0) for 2 genres (SVC, abs, minmax) (50s)
    • Accuracy of 60 (+/- 7.9) for 2 genres (SVC, noabs, minmax) (100s)
    • Accuracy of 64 (+/- 11.1) for 2 genres (SVC, noabs, std) (1000s)
  • With $X_s$ (best observed)
    • Accuracy of 96 (+/- 4.7) for 2 genres (SVC, minmax) (2s, CDK 1s)
    • Accuracy of 81 (+/- 4) for 4 genres (SVC, minmax) (14s)
    • Accuracy of 76 (+/- 3.1) for 5 genres (SVC, minmax) (CDK 12s)
    • Accuracy of 56 (+/- 5) for 10 genres (SVC, minmax) (300s)
    • Accuracy of 53 (+/- 3) for 10 genres (LinearSVC, minmax) (300s)
    • Accuracy of 51 (+/- 5) for 10 genres (NuSVC, minmax) (500s)
  • With $Z$ (best observed) (all with SVC, no normalization if not mentioned)
    • Accuracy of 96 (+/- 3.2) for 2 genres (ld=10, m=128, minmax, 10 outer) (CDK 1s)
    • Accuracy of 98 (+/- 2.4) for 2 genres (ld=10, m=128, 10 outer) (CDK 1s)
    • Accuracy of 98 (+/- 2.5) for 2 genres (ld=10, m=128, encoder, 10 outer) (CDK 1s)
    • Accuracy of 98 (+/- 2.5) for 2 genres (ld=10, m=128, 20 outer) (CDK 1s)
    • Accuracy of 98 (+/- 2.5) for 2 genres (ld=100, m=128, 15 outer) (CDK 1s)
    • Accuracy of 58 (+/- 10.5) for 2 genres (ld=1, m=128, 15 outer) (CDK 4s)
    • Accuracy of 99 (+/- 3.2) for 2 genres (ld=10, m=512, 15 outer) (CDK 2s)
    • Accuracy of 79 (+/- 2.7) for 5 genres (ld=10, m=512, 20 outer) (CDK 28s)
    • Accuracy of 65 (+/- 3.6) for 10 genres (ld=10, m=512, 15 outer) (CDK 167s)


  • Use the area under the receiver operating characteristing (ROC) curve (AUC). Not sure if applicable to multi-class.
In [ ]:
if majority_voting:
    clf = clf_svm_vote
    b = Nclips
    c = Nvectors*2*n
    clf = clf_svm
    b = Nclips*Nvectors*2
    c = n

data, target = prepdata(Ngenres, b, c, scale=scale)
print('Ratio: {} training, {} testing'.format(
        (1-test_size)*target.size, test_size*target.size))

tstart = time.time()
scores = np.empty(shape=(Ncv, 10))

# Cross-validation iterators.
cv = cross_validation.ShuffleSplit(target.size, n_iter=10, test_size=test_size)
#cv = cross_validation.StratifiedShuffleSplit(target.size, n_iter=10, test_size=test_size)
#cv = cross_validation.KFold(target.size, shuffle=True, n_folds=10)
#cv = cross_validation.StratifiedKFold(target, shuffle=True, n_folds=10)

for i in range(Ncv):
    scores[i,:] = cross_validation.cross_val_score(
                  clf, data, target, cv=cv, n_jobs=1)
    # Performance: accuracy.
    mean, std = scores[i,:].mean()*100, scores[i,:].std()*100
    print(' {:3.0f} (+/-{:4.1f}) <- {}'.format(mean, std, (scores[i,:]*100).astype(

accuracy, accuracy_std = scores.mean()*100, scores.std()*100
print('Accuracy: {:.1f} (+/- {:.2f})'.format(accuracy, accuracy_std))

meantime = (time.time() - tstart) / Ncv
print('Mean time ({} cv): {:.2f} seconds'.format(Ncv, meantime))
print('Overall time: {:.2f} seconds'.format(time.time() - toverall))