Genre recognition: experiment

Goal: Test if the cosine similarity (with zero-mean data) or the euclidean distance with a Gaussian kernel (without zero-mean data) provides a better graph that the cosine distance with a Gaussian kernel (without zero-mean data).

Conclusion: cosine_sim and euclidean metrics provide a 2% increase in accuracy with respect to cosine_dist.

Observations:

  • With Ncv=20 (thanks to decreased variance), 10 minutes per classification (instead of 20).
  • With lg=100, 10 minutes for feature extraction.
  • 10 minutes per graph. Should run for 1h30.
  • cosine_dist failed because it was miss-spelled. Hopefully all the last experiments used this metric. Accuracy was 64.01% in 13c_novoting and 64.03% in 13d_noise_level.
  • euclidean is 7% faster than cosine_sim while cosine_dist is the fastest with an increse of 47% to cosine_sim.

Hyper-parameters

Parameter under test

In [1]:
Pname = 'dm'
Pvalues = ['cosine_sim', 'cosine_dim', 'euclidean']

# Regenerate the graph or the features at each iteration.
regen_graph = True
regen_features = True

Model parameters

In [2]:
p = {}

# Preprocessing.

# Graph.
p['K'] = 10 + 1  # 5 to 10 + 1 for self-reference
p['dm'] = 'cosine'
p['Csigma'] = 1
p['diag'] = True
p['laplacian'] = 'normalized'

# Feature extraction.
p['m'] = 128  # 64, 128, 512
p['ls'] = 1
p['ld'] = 10
p['le'] = None
p['lg'] = 100

# Classification.
p['scale'] = None
p['Nvectors'] = 6
p['svm_type'] = 'C'
p['kernel'] = 'linear'
p['C'] = 1
p['nu'] = 0.5
p['majority_voting'] = False

Data parameters

In [3]:
# HDF5 data stores.
p['folder'] = 'data'
p['filename_gtzan'] = 'gtzan.hdf5'
p['filename_audio'] = 'audio.hdf5'
p['filename_graph'] = 'graph.hdf5'
p['filename_features'] = 'features.hdf5'

# Dataset (10,100,644 | 5,100,149 | 2,10,644).
p['Ngenres'] = 5
p['Nclips'] = 100
p['Nframes'] = 149

# Added white noise.
p['noise_std'] = 0.1

Numerical parameters

In [4]:
# Graph.
p['tol'] = 1e-5

# Feature extraction.
p['rtol'] = 1e-5  # 1e-3, 1e-5, 1e-7
p['N_inner'] = 500
p['N_outer'] = 50

# Classification.
p['Nfolds'] = 10
p['Ncv'] = 20
p['dataset_classification'] = 'Z'

Processing

In [5]:
import numpy as np
import time

texperiment = time.time()

# Result dictionary.
res = ['accuracy', 'accuracy_std']
res += ['sparsity', 'atoms']
res += ['objective_g', 'objective_h', 'objective_i', 'objective_j']
res += ['time_features', 'iterations_inner', 'iterations_outer']
res = dict.fromkeys(res)
for key in res.keys():
    res[key] = []

def separator(name, parameter=False):
    if parameter:
        name += ', {} = {}'.format(Pname, p[Pname])
    dashes = 20 * '-'
    print('\n {} {} {} \n'.format(dashes, name, dashes))
    # Fair comparison when tuning parameters.
    # Randomnesses: dictionary initialization, training and testing sets.
    np.random.seed(1)
In [6]:
#%run gtzan.ipynb
#%run audio_preprocessing.ipynb
if not regen_graph:
    separator('Graph')
    %run audio_graph.ipynb
if not regen_features:
    separator('Features')
    %run audio_features.ipynb

# Hyper-parameter under test.
for p[Pname] in Pvalues:

    if regen_graph:
        separator('Graph', True)
        %run audio_graph.ipynb
    if regen_features:
        separator('Features', True)
        p['filename_features'] = 'features_{}_{}.hdf5'.format(Pname, p[Pname])
        %run audio_features.ipynb
    separator('Classification', True)
    %run audio_classification.ipynb
    
    # Collect results.
    for key in res:
        res[key].append(globals()[key])

# Baseline, i.e. classification with spectrograms.
p['dataset_classification'] = 'X'
p['scale'] = 'minmax'  # Todo: should be done in pre-processing.
if not regen_graph and not regen_features:
    # Classifier parameters are being tested.
    for p[Pname] in Pvalues:
        separator('Baseline', True)
        %run audio_classification.ipynb
else:
    separator('Baseline')
    %run audio_classification.ipynb
res['baseline'] = len(Pvalues) * [accuracy]
res['baseline_std'] = accuracy_std
 -------------------- Graph, dm = cosine_sim -------------------- 

Data: (149000, 96), float32
Elapsed time: 1257.85 seconds
All self-referenced in the first column: True
dist in [0.560503005981, 1.0]
w in [0.560503005981, 1.0]
Ones on the diagonal: 149000 (over 149000)
assert: True
W in [0.0, 1.0]
Datasets:
  L_data    : (2925296,), float32
  L_indices : (2925296,), int32
  L_indptr  : (149001,) , int32
  L_shape   : (2,)      , int64
  W_data    : (2925296,), float32
  W_indices : (2925296,), int32
  W_indptr  : (149001,) , int32
  W_shape   : (2,)      , int64
Attributes:
  K = 11
  dm = cosine_sim
  Csigma = 1
  diag = True
  laplacian = normalized
Overall time: 1267.74 seconds

 -------------------- Features, dm = cosine_sim -------------------- 

Attributes:
  sr = 22050
  labels = ['blues' 'classical' 'country' 'disco' 'hiphop' 'jazz' 'metal' 'pop'
 'reggae' 'rock']
Datasets:
  Xa: (10, 100, 644, 2, 1024) , float32
  Xs: (10, 100, 644, 2, 96)   , float32
Full dataset:
  size: N=1,288,000 x n=96 -> 123,648,000 floats
  dim: 123,648 features per clip
  shape: (10, 100, 644, 2, 96)
<class 'h5py._hl.dataset.Dataset'>
Reduced dataset:
  size: N=149,000 x n=96 -> 14,304,000 floats
  dim: 28,608 features per clip
  shape: (5, 100, 149, 2, 96)
<type 'numpy.ndarray'>
Data: (149000, 96), float32
Attributes:
  K = 11
  dm = cosine_sim
  Csigma = 1
  diag = True
  laplacian = normalized
Datasets:
  L_data    : (2925296,), float32
  L_indices : (2925296,), int32
  L_indptr  : (149001,) , int32
  L_shape   : (2,)      , int64
  W_data    : (2925296,), float32
  W_indices : (2925296,), int32
  W_indptr  : (149001,) , int32
  W_shape   : (2,)      , int64
Size X: 13.6 M --> 54.6 MiB
Size Z: 18.2 M --> 72.8 MiB
Size D: 12.0 k --> 48.0 kiB
Size E: 12.0 k --> 48.0 kiB
Elapsed time: 982 seconds
Inner loop: 384 iterations
g(Z) = ||X-DZ||_2^2 = 7.365278e+05
rdiff: 0.000154546947268
i(Z) = ||Z||_1 = 7.460725e+04
j(Z) = tr(Z^TLZ) = 2.092982e+04
Global objective: 8.320649e+05
Outer loop: 5 iterations

Z in [-0.0605589039624, 0.252947568893]
Sparsity of Z: 8,314,786 non-zero entries out of 19,072,000 entries, i.e. 43.6%.
D in [-0.138522788882, 0.704355895519]
d in [0.999999701977, 1.00000023842]
Constraints on D: True
Datasets:
  D : (128, 96)             , float32
  X : (5, 100, 149, 2, 96)  , float32
  Z : (5, 100, 149, 2, 128) , float32
Attributes:
  sr = 22050
  labels = ['blues' 'classical' 'country' 'disco' 'hiphop' 'jazz' 'metal' 'pop'
 'reggae' 'rock']
Overall time: 990 seconds

 -------------------- Classification, dm = cosine_sim -------------------- 

Software versions:
  numpy: 1.8.2
  sklearn: 0.14.1
Attributes:
  sr = 22050
  labels = ['blues' 'classical' 'country' 'disco' 'hiphop' 'jazz' 'metal' 'pop'
 'reggae' 'rock']
Datasets:
  D : (128, 96)               , float32
  X : (5, 100, 149, 2, 96)    , float32
  Z : (5, 100, 149, 2, 128)   , float32
Full dataset:
  size: N=149,000 x n=128 -> 19,072,000 floats
  dim: 38,144 features per clip
  shape: (5, 100, 149, 2, 128)
<class 'h5py._hl.dataset.Dataset'>
Reduced dataset:
  size: N=149,000 x n=128 -> 19,072,000 floats
  dim: 38,144 features per clip
  shape: (5, 100, 149, 2, 128)
<type 'numpy.ndarray'>
Flattened frames:
  size: N=149,000 x n=128 -> 19,072,000 floats
  dim: 38,144 features per clip
  shape: (5, 100, 298, 128)
Truncated and grouped:
  size: N=135,000 x n=128 -> 17,280,000 floats
  dim: 34,560 features per clip
  shape: (5, 100, 6, 45, 128)
Truncated and grouped:
  size: N=135,000 x n=128 -> 17,280,000 floats
  dim: 34,560 features per clip
  shape: (5, 100, 6, 45, 128)
Feature vectors:
  size: N=6,000 x n=128 -> 768,000 floats
  dim: 1,536 features per clip
  shape: (5, 100, 6, 2, 128)
5 genres: blues, classical, country, disco, hiphop
Training data: (3600, 128), float64
Testing data: (2400, 128), float64
Training labels: (3600,), uint8
Testing labels: (2400,), uint8
Accuracy: 65.3 %
5 genres: blues, classical, country, disco, hiphop
Training data: (300, 1536), float64
Testing data: (200, 1536), float64
Training labels: (300,), uint8
Testing labels: (200,), uint8
Feature vectors accuracy: 60.0 %
Clips accuracy: 67.5 %
5 genres: blues, classical, country, disco, hiphop
Data: (6000, 128), float64
Labels: (6000,), uint8
  67 (+/- 1.5) <- [65 65 64 68 68 68 66 64 66 67]
  66 (+/- 3.0) <- [60 70 68 67 64 64 71 66 65 66]
  66 (+/- 1.7) <- [67 66 62 67 67 66 67 65 65 68]
  66 (+/- 1.1) <- [64 67 66 67 64 66 68 65 66 65]
  66 (+/- 2.1) <- [63 66 68 66 65 64 66 71 65 66]
  67 (+/- 2.1) <- [65 65 65 71 68 66 65 68 63 65]
  67 (+/- 1.6) <- [66 65 66 68 68 65 67 63 68 67]
  67 (+/- 1.4) <- [67 66 65 68 66 68 64 67 68 64]
  66 (+/- 2.1) <- [68 65 66 68 64 61 68 68 67 66]
  67 (+/- 1.8) <- [70 69 65 67 68 67 65 65 66 64]
  67 (+/- 1.6) <- [66 68 66 67 68 69 66 66 66 62]
  67 (+/- 1.9) <- [63 67 67 68 65 68 67 66 62 68]
  67 (+/- 2.3) <- [67 68 68 68 66 65 68 61 63 66]
  67 (+/- 1.9) <- [63 65 67 67 67 66 65 65 67 71]
  67 (+/- 2.7) <- [65 66 63 66 70 62 67 68 71 64]
  66 (+/- 1.4) <- [66 65 65 69 64 66 65 67 65 67]
  66 (+/- 1.7) <- [67 63 66 66 68 68 64 66 65 64]
  66 (+/- 2.4) <- [66 62 67 66 67 71 64 66 64 66]
  67 (+/- 1.6) <- [65 66 67 64 65 66 64 69 67 69]
  66 (+/- 1.7) <- [65 67 65 67 68 66 67 64 62 67]
Accuracy: 66.5 (+/- 1.94)
Mean time (20 cv): 21.83 seconds
Overall time: 441.52 seconds

 -------------------- Graph, dm = cosine_dim -------------------- 

Data: (149000, 96), float32
Elapsed time: 1294.85 seconds
All self-referenced in the first column: True
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-a8e755cefddd> in <module>()
      8     dist = 1 - dist / 4.
      9 else:
---> 10     raise ValueError
     11 
     12 print('dist in [{}, {}]'.format(dist.min(), dist.max()))

ValueError: 
 -------------------- Features, dm = cosine_dim -------------------- 

Attributes:
  sr = 22050
  labels = ['blues' 'classical' 'country' 'disco' 'hiphop' 'jazz' 'metal' 'pop'
 'reggae' 'rock']
Datasets:
  Xa: (10, 100, 644, 2, 1024) , float32
  Xs: (10, 100, 644, 2, 96)   , float32
Full dataset:
  size: N=1,288,000 x n=96 -> 123,648,000 floats
  dim: 123,648 features per clip
  shape: (10, 100, 644, 2, 96)
<class 'h5py._hl.dataset.Dataset'>
Reduced dataset:
  size: N=149,000 x n=96 -> 14,304,000 floats
  dim: 28,608 features per clip
  shape: (5, 100, 149, 2, 96)
<type 'numpy.ndarray'>
Data: (149000, 96), float32
Attributes:
  K = 11
  dm = cosine_sim
  Csigma = 1
  diag = True
  laplacian = normalized
Datasets:
  L_data    : (2925296,), float32
  L_indices : (2925296,), int32
  L_indptr  : (149001,) , int32
  L_shape   : (2,)      , int64
  W_data    : (2925296,), float32
  W_indices : (2925296,), int32
  W_indptr  : (149001,) , int32
  W_shape   : (2,)      , int64
Size X: 13.6 M --> 54.6 MiB
Size Z: 18.2 M --> 72.8 MiB
Size D: 12.0 k --> 48.0 kiB
Size E: 12.0 k --> 48.0 kiB
Elapsed time: 973 seconds
Inner loop: 384 iterations
g(Z) = ||X-DZ||_2^2 = 7.365278e+05
rdiff: 0.000154546947268
i(Z) = ||Z||_1 = 7.460725e+04
j(Z) = tr(Z^TLZ) = 2.092982e+04
Global objective: 8.320649e+05
Outer loop: 5 iterations

Z in [-0.0605589039624, 0.252947568893]
Sparsity of Z: 8,314,786 non-zero entries out of 19,072,000 entries, i.e. 43.6%.
D in [-0.138522788882, 0.704355895519]
d in [0.999999701977, 1.00000023842]
Constraints on D: True
Datasets:
  D : (128, 96)             , float32
  X : (5, 100, 149, 2, 96)  , float32
  Z : (5, 100, 149, 2, 128) , float32
Attributes:
  sr = 22050
  labels = ['blues' 'classical' 'country' 'disco' 'hiphop' 'jazz' 'metal' 'pop'
 'reggae' 'rock']
Overall time: 982 seconds

 -------------------- Classification, dm = cosine_dim -------------------- 

Software versions:
  numpy: 1.8.2
  sklearn: 0.14.1
Attributes:
  sr = 22050
  labels = ['blues' 'classical' 'country' 'disco' 'hiphop' 'jazz' 'metal' 'pop'
 'reggae' 'rock']
Datasets:
  D : (128, 96)               , float32
  X : (5, 100, 149, 2, 96)    , float32
  Z : (5, 100, 149, 2, 128)   , float32
Full dataset:
  size: N=149,000 x n=128 -> 19,072,000 floats
  dim: 38,144 features per clip
  shape: (5, 100, 149, 2, 128)
<class 'h5py._hl.dataset.Dataset'>
Reduced dataset:
  size: N=149,000 x n=128 -> 19,072,000 floats
  dim: 38,144 features per clip
  shape: (5, 100, 149, 2, 128)
<type 'numpy.ndarray'>
Flattened frames:
  size: N=149,000 x n=128 -> 19,072,000 floats
  dim: 38,144 features per clip
  shape: (5, 100, 298, 128)
Truncated and grouped:
  size: N=135,000 x n=128 -> 17,280,000 floats
  dim: 34,560 features per clip
  shape: (5, 100, 6, 45, 128)
Truncated and grouped:
  size: N=135,000 x n=128 -> 17,280,000 floats
  dim: 34,560 features per clip
  shape: (5, 100, 6, 45, 128)
Feature vectors:
  size: N=6,000 x n=128 -> 768,000 floats
  dim: 1,536 features per clip
  shape: (5, 100, 6, 2, 128)
5 genres: blues, classical, country, disco, hiphop
Training data: (3600, 128), float64
Testing data: (2400, 128), float64
Training labels: (3600,), uint8
Testing labels: (2400,), uint8
Accuracy: 65.3 %
5 genres: blues, classical, country, disco, hiphop
Training data: (300, 1536), float64
Testing data: (200, 1536), float64
Training labels: (300,), uint8
Testing labels: (200,), uint8
Feature vectors accuracy: 60.0 %
Clips accuracy: 67.5 %
5 genres: blues, classical, country, disco, hiphop
Data: (6000, 128), float64
Labels: (6000,), uint8
  67 (+/- 1.5) <- [65 65 64 68 68 68 66 64 66 67]
  66 (+/- 3.0) <- [60 70 68 67 64 64 71 66 65 66]
  66 (+/- 1.7) <- [67 66 62 67 67 66 67 65 65 68]
  66 (+/- 1.1) <- [64 67 66 67 64 66 68 65 66 65]
  66 (+/- 2.1) <- [63 66 68 66 65 64 66 71 65 66]
  67 (+/- 2.1) <- [65 65 65 71 68 66 65 68 63 65]
  67 (+/- 1.6) <- [66 65 66 68 68 65 67 63 68 67]
  67 (+/- 1.4) <- [67 66 65 68 66 68 64 67 68 64]
  66 (+/- 2.1) <- [68 65 66 68 64 61 68 68 67 66]
  67 (+/- 1.8) <- [70 69 65 67 68 67 65 65 66 64]
  67 (+/- 1.6) <- [66 68 66 67 68 69 66 66 66 62]
  67 (+/- 1.9) <- [63 67 67 68 65 68 67 66 62 68]
  67 (+/- 2.3) <- [67 68 68 68 66 65 68 61 63 66]
  67 (+/- 1.9) <- [63 65 67 67 67 66 65 65 67 71]
  67 (+/- 2.7) <- [65 66 63 66 70 62 67 68 71 64]
  66 (+/- 1.4) <- [66 65 65 69 64 66 65 67 65 67]
  66 (+/- 1.7) <- [67 63 66 66 68 68 64 66 65 64]
  66 (+/- 2.4) <- [66 62 67 66 67 71 64 66 64 66]
  67 (+/- 1.6) <- [65 66 67 64 65 66 64 69 67 69]
  66 (+/- 1.7) <- [65 67 65 67 68 66 67 64 62 67]
Accuracy: 66.5 (+/- 1.94)
Mean time (20 cv): 21.77 seconds
Overall time: 440.49 seconds

 -------------------- Graph, dm = euclidean -------------------- 

Data: (149000, 96), float32
Elapsed time: 1183.53 seconds
All self-referenced in the first column: True
dist in [0.0, 2.08629345894]
w in [0.198544532061, 1.0]
Ones on the diagonal: 149000 (over 149000)
assert: True
W in [0.0, 1.0]
Datasets:
  L_data    : (2919394,), float32
  L_indices : (2919394,), int32
  L_indptr  : (149001,) , int32
  L_shape   : (2,)      , int64
  W_data    : (2919394,), float32
  W_indices : (2919394,), int32
  W_indptr  : (149001,) , int32
  W_shape   : (2,)      , int64
Attributes:
  K = 11
  dm = euclidean
  Csigma = 1
  diag = True
  laplacian = normalized
Overall time: 1193.83 seconds

 -------------------- Features, dm = euclidean -------------------- 

Attributes:
  sr = 22050
  labels = ['blues' 'classical' 'country' 'disco' 'hiphop' 'jazz' 'metal' 'pop'
 'reggae' 'rock']
Datasets:
  Xa: (10, 100, 644, 2, 1024) , float32
  Xs: (10, 100, 644, 2, 96)   , float32
Full dataset:
  size: N=1,288,000 x n=96 -> 123,648,000 floats
  dim: 123,648 features per clip
  shape: (10, 100, 644, 2, 96)
<class 'h5py._hl.dataset.Dataset'>
Reduced dataset:
  size: N=149,000 x n=96 -> 14,304,000 floats
  dim: 28,608 features per clip
  shape: (5, 100, 149, 2, 96)
<type 'numpy.ndarray'>
Data: (149000, 96), float32
Attributes:
  K = 11
  dm = euclidean
  Csigma = 1
  diag = True
  laplacian = normalized
Datasets:
  L_data    : (2919394,), float32
  L_indices : (2919394,), int32
  L_indptr  : (149001,) , int32
  L_shape   : (2,)      , int64
  W_data    : (2919394,), float32
  W_indices : (2919394,), int32
  W_indptr  : (149001,) , int32
  W_shape   : (2,)      , int64
Size X: 13.6 M --> 54.6 MiB
Size Z: 18.2 M --> 72.8 MiB
Size D: 12.0 k --> 48.0 kiB
Size E: 12.0 k --> 48.0 kiB
Elapsed time: 905 seconds
Inner loop: 358 iterations
g(Z) = ||X-DZ||_2^2 = 7.395564e+05
rdiff: 0.000189513942163
i(Z) = ||Z||_1 = 7.337558e+04
j(Z) = tr(Z^TLZ) = 2.211262e+04
Global objective: 8.350446e+05
Outer loop: 5 iterations

Z in [-0.108043551445, 0.236873194575]
Sparsity of Z: 8,158,022 non-zero entries out of 19,072,000 entries, i.e. 42.8%.
D in [-0.135033249855, 0.678504288197]
d in [0.999999701977, 1.00000023842]
Constraints on D: True
Datasets:
  D : (128, 96)             , float32
  X : (5, 100, 149, 2, 96)  , float32
  Z : (5, 100, 149, 2, 128) , float32
Attributes:
  sr = 22050
  labels = ['blues' 'classical' 'country' 'disco' 'hiphop' 'jazz' 'metal' 'pop'
 'reggae' 'rock']
Overall time: 915 seconds

 -------------------- Classification, dm = euclidean -------------------- 

Software versions:
  numpy: 1.8.2
  sklearn: 0.14.1
Attributes:
  sr = 22050
  labels = ['blues' 'classical' 'country' 'disco' 'hiphop' 'jazz' 'metal' 'pop'
 'reggae' 'rock']
Datasets:
  D : (128, 96)               , float32
  X : (5, 100, 149, 2, 96)    , float32
  Z : (5, 100, 149, 2, 128)   , float32
Full dataset:
  size: N=149,000 x n=128 -> 19,072,000 floats
  dim: 38,144 features per clip
  shape: (5, 100, 149, 2, 128)
<class 'h5py._hl.dataset.Dataset'>
Reduced dataset:
  size: N=149,000 x n=128 -> 19,072,000 floats
  dim: 38,144 features per clip
  shape: (5, 100, 149, 2, 128)
<type 'numpy.ndarray'>
Flattened frames:
  size: N=149,000 x n=128 -> 19,072,000 floats
  dim: 38,144 features per clip
  shape: (5, 100, 298, 128)
Truncated and grouped:
  size: N=135,000 x n=128 -> 17,280,000 floats
  dim: 34,560 features per clip
  shape: (5, 100, 6, 45, 128)
Truncated and grouped:
  size: N=135,000 x n=128 -> 17,280,000 floats
  dim: 34,560 features per clip
  shape: (5, 100, 6, 45, 128)
Feature vectors:
  size: N=6,000 x n=128 -> 768,000 floats
  dim: 1,536 features per clip
  shape: (5, 100, 6, 2, 128)
5 genres: blues, classical, country, disco, hiphop
Training data: (3600, 128), float64
Testing data: (2400, 128), float64
Training labels: (3600,), uint8
Testing labels: (2400,), uint8
Accuracy: 65.3 %
5 genres: blues, classical, country, disco, hiphop
Training data: (300, 1536), float64
Testing data: (200, 1536), float64
Training labels: (300,), uint8
Testing labels: (200,), uint8
Feature vectors accuracy: 59.3 %
Clips accuracy: 66.5 %
5 genres: blues, classical, country, disco, hiphop
Data: (6000, 128), float64
Labels: (6000,), uint8
  67 (+/- 1.6) <- [65 69 63 68 68 68 66 66 68 67]
  67 (+/- 2.7) <- [62 70 69 67 63 67 69 64 64 68]
  67 (+/- 2.1) <- [67 66 62 68 69 66 68 66 64 69]
  67 (+/- 1.6) <- [66 67 65 68 64 67 70 65 66 66]
  67 (+/- 2.3) <- [63 65 65 66 64 66 68 72 66 67]
  67 (+/- 2.4) <- [66 65 66 71 68 65 65 71 64 64]
  67 (+/- 1.8) <- [67 63 67 68 67 67 66 64 66 69]
  67 (+/- 1.2) <- [66 65 67 66 65 66 66 68 69 65]
  67 (+/- 1.8) <- [68 65 66 67 65 63 68 67 66 69]
  67 (+/- 1.2) <- [69 67 66 67 66 67 66 65 67 65]
  67 (+/- 1.9) <- [68 66 67 64 66 70 66 65 67 64]
  67 (+/- 1.9) <- [64 68 68 68 65 68 65 67 64 68]
  67 (+/- 2.2) <- [68 69 68 67 67 65 66 61 64 68]
  67 (+/- 2.5) <- [65 64 66 69 66 66 65 63 67 72]
  67 (+/- 3.1) <- [65 68 63 65 70 60 68 69 70 64]
  66 (+/- 2.1) <- [68 65 63 70 66 67 64 68 65 64]
  66 (+/- 1.9) <- [66 62 66 66 70 67 64 67 66 65]
  66 (+/- 2.2) <- [65 62 67 67 67 69 65 67 62 67]
  67 (+/- 1.7) <- [64 66 66 64 67 66 65 67 68 70]
  67 (+/- 2.0) <- [66 67 65 68 68 67 67 64 62 69]
Accuracy: 66.8 (+/- 2.08)
Mean time (20 cv): 21.58 seconds
Overall time: 436.91 seconds

 -------------------- Baseline -------------------- 

Software versions:
  numpy: 1.8.2
  sklearn: 0.14.1
Attributes:
  sr = 22050
  labels = ['blues' 'classical' 'country' 'disco' 'hiphop' 'jazz' 'metal' 'pop'
 'reggae' 'rock']
Datasets:
  D : (128, 96)               , float32
  X : (5, 100, 149, 2, 96)    , float32
  Z : (5, 100, 149, 2, 128)   , float32
Full dataset:
  size: N=149,000 x n=96 -> 14,304,000 floats
  dim: 28,608 features per clip
  shape: (5, 100, 149, 2, 96)
<class 'h5py._hl.dataset.Dataset'>
Reduced dataset:
  size: N=149,000 x n=96 -> 14,304,000 floats
  dim: 28,608 features per clip
  shape: (5, 100, 149, 2, 96)
<type 'numpy.ndarray'>
Flattened frames:
  size: N=149,000 x n=96 -> 14,304,000 floats
  dim: 28,608 features per clip
  shape: (5, 100, 298, 96)
Truncated and grouped:
  size: N=135,000 x n=96 -> 12,960,000 floats
  dim: 25,920 features per clip
  shape: (5, 100, 6, 45, 96)
Truncated and grouped:
  size: N=135,000 x n=96 -> 12,960,000 floats
  dim: 25,920 features per clip
  shape: (5, 100, 6, 45, 96)
Feature vectors:
  size: N=6,000 x n=96 -> 576,000 floats
  dim: 1,152 features per clip
  shape: (5, 100, 6, 2, 96)
5 genres: blues, classical, country, disco, hiphop
Training data: (3600, 96), float64
Testing data: (2400, 96), float64
Training labels: (3600,), uint8
Testing labels: (2400,), uint8
Accuracy: 59.3 %
5 genres: blues, classical, country, disco, hiphop
Training data: (300, 1152), float64
Testing data: (200, 1152), float64
Training labels: (300,), uint8
Testing labels: (200,), uint8
Feature vectors accuracy: 54.8 %
Clips accuracy: 62.5 %
5 genres: blues, classical, country, disco, hiphop
Data: (6000, 96), float64
Labels: (6000,), uint8
  58 (+/- 1.5) <- [57 59 55 59 59 59 57 55 59 59]
  59 (+/- 2.7) <- [52 61 62 58 59 58 63 58 57 58]
  59 (+/- 2.4) <- [60 57 53 61 59 58 59 58 56 62]
  59 (+/- 1.8) <- [57 60 56 61 56 59 60 60 59 56]
  59 (+/- 2.2) <- [55 57 56 58 57 59 60 62 58 61]
  59 (+/- 1.7) <- [56 59 57 61 60 56 59 59 58 55]
  59 (+/- 1.0) <- [59 57 58 61 58 58 59 57 59 59]
  59 (+/- 2.1) <- [57 57 57 57 57 58 59 61 63 55]
  59 (+/- 2.3) <- [59 56 57 57 55 55 59 61 59 62]
  59 (+/- 1.9) <- [59 58 60 62 56 62 58 59 59 56]
  59 (+/- 2.5) <- [57 61 58 60 55 64 56 56 58 55]
  59 (+/- 1.5) <- [60 59 61 59 56 60 59 59 57 56]
  59 (+/- 2.8) <- [59 62 57 58 52 62 60 55 59 58]
  59 (+/- 2.0) <- [59 55 58 58 56 59 57 57 62 60]
  58 (+/- 2.6) <- [57 60 54 59 60 54 58 57 62 58]
  59 (+/- 2.1) <- [56 61 55 61 56 61 57 59 57 58]
  59 (+/- 1.8) <- [58 57 59 61 60 61 55 59 60 56]
  59 (+/- 1.4) <- [59 58 60 58 58 57 58 57 55 60]
  58 (+/- 1.3) <- [57 57 56 57 58 59 59 60 59 59]
  59 (+/- 2.4) <- [58 60 57 61 62 61 56 54 58 59]
Accuracy: 58.7 (+/- 2.07)
Mean time (20 cv): 19.52 seconds
Overall time: 395.24 seconds

Results

In [7]:
print('{} = {}'.format(Pname, Pvalues))
for key, value in res.items():
    if key is not 'atoms':
        print('res[\'{}\'] = {}'.format(key, value))

def plot(*args, **kwargs):
    plt.figure(figsize=(8,5))
    x = range(len(Pvalues))
    log = 'log' in kwargs and kwargs['log'] is True
    pltfunc = plt.semilogy if log else plt.plot
    params = {}
    params['linestyle'] = '-'
    params['marker'] = '.'
    params['markersize'] = 10
    for i, var in enumerate(args):
        if 'err' in kwargs:
            pltfunc = plt.errorbar
            params['yerr'] = res[kwargs['err'][i]]
            params['capsize'] = 5
        pltfunc(x, res[var], label=var, **params)
        for i,j in zip(x, res[var]):
            plt.annotate('{:.2f}'.format(j), xy=(i,j), xytext=(5,5), textcoords='offset points')
    margin = 0.25
    params['markersize'] = 10
    plt.xlim(-margin, len(Pvalues)-1+margin)
    if 'ylim' in kwargs:
        plt.ylim(kwargs['ylim'])
    plt.title('{} vs {}'.format(', '.join(args), Pname))
    plt.xlabel(Pname)
    plt.ylabel(' ,'.join(args))
    plt.xticks(x, Pvalues)
    plt.grid(True); plt.legend(loc='best'); plt.show()

def div(l):
    div = Pvalues if Pname is l else [p[l]]
    return np.array([1 if v is None else v for v in div])

# Classification results.
res['chance'] = len(Pvalues) * [100./p['Ngenres']]
res['chance_std'] = 0
err=['accuracy_std', 'baseline_std', 'chance_std']
plot('accuracy', 'baseline', 'chance', err=err, ylim=[0,100])

# Features extraction results.
if regen_features:
    plot('objective_g', 'objective_i', 'objective_j', log=True)
    # Unweighted objectives.
    print('g(Z) = ||X-DZ||_2^2, h(Z) = ||Z-EX||_2^2, i(Z) = ||Z||_1, j(Z) = tr(Z^TLZ)')
    res['objective_g_un'] = res['objective_g'] / div('ld')
    res['objective_i_un'] = res['objective_i'] / div('ls')
    res['objective_j_un'] = res['objective_j'] / div('lg')
    plot('objective_g_un', 'objective_i_un', 'objective_j_un', log=True)
    plot('sparsity', ylim=[0,100])
    plot('time_features')
    plot('iterations_inner')
    plot('iterations_outer')

    for i, fig in enumerate(res['atoms']):
        print('Dictionary atoms for {} = {}'.format(Pname, Pvalues[i]))
        fig.show()

print('Experiment time: {:.0f} seconds'.format(time.time() - texperiment))
dm = ['cosine_sim', 'cosine_dim', 'euclidean']
res['accuracy_std'] = [1.9390674663066498, 1.9390674663066498, 2.0771909450559005]
res['objective_j'] = [20929.82177734375, 20929.82177734375, 22112.623596191406]
res['objective_i'] = [74607.25, 74607.25, 73375.578125]
res['objective_h'] = [0, 0, 0]
res['objective_g'] = [736527.8125, 736527.8125, 739556.40625]
res['baseline'] = [58.740000000000023, 58.740000000000023, 58.740000000000023]
res['time_features'] = [981.9027650356293, 973.1077239513397, 905.0378549098969]
res['baseline_std'] = 2.07362002305
res['sparsity'] = [43.596822567114096, 43.596822567114096, 42.77486367449664]
res['iterations_inner'] = [384, 384, 358]
res['iterations_outer'] = [5, 5, 5]
res['accuracy'] = [66.520833333333357, 66.520833333333357, 66.76666666666668]
g(Z) = ||X-DZ||_2^2, h(Z) = ||Z-EX||_2^2, i(Z) = ||Z||_1, j(Z) = tr(Z^TLZ)
Dictionary atoms for dm = cosine_sim
Dictionary atoms for dm = cosine_dim
Dictionary atoms for dm = euclidean
Experiment time: 8369 seconds
/usr/lib/pymodules/python2.7/matplotlib/figure.py:371: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure
  "matplotlib is currently using a non-GUI backend, "