Applying Rocket to Raw Audio¶

Note: this notebook is extremely messy as a result of me trying to rapidly prototype rocket for raw audio and not following best practices. It also uses the beta version of fastai v2 for removing silence and preprocessing. If you are interested in experimenting yourself and can't make sense of something here, please reach out to me by PM, or in the Deep Learning with Audio or Time Series threads

This notebook will apply the findings of the recent Rocket Paper by Angus Dempster, François Petitjean, Geoffrey I. Webb to 1D raw audio signals for the task of voice recognition. Some of this code is also adapted from Ignacio Oguiza and his Time Series Module for FastAI v1

Initially the signals were too long and slow to train at a sample rate of 16000 (was going to take ~30-40 minutes for 1s clips). Training a 3800 audio 10 class dataset (small problem, trains to 99%+ accuracy in 2 minutes using typical audio pipeline of spectrogram + CNN), . To speed things up I added a stride which sped up results without a drop in accuracy, but still only led to 85% accuracy after 4 minutes of training. Removing silence and doubling the amount of time to 2s allowed us to get great results (95% accuracy in 6s, 98.6% in 20 seconds, 99.2% in 1 min 20 sec).

Unfortunately so far, I have not been able to scale the results to harder problems, such as a 250 speaker dataset.

Summary of Findings¶

In order to spare you having to go through this network, here's a summary of the interesting results

95% accuracy in 6s, 98.6% in 20 seconds, 99.2% in 1 min 20 sec on a 10 class problem using raw audio and no augmentation
Having a stride of 5-7 seems to be an optimal balance between computational cost and accuracy
Bigger filter sizes are better only up to 7x7 or 9x9 and then they actually decrease accuracy with increased cost
A variety of filter sizes (7,9,11) doesnt seem to beat out just individual filter sizes 7,9,11 but more testing is needed with a harder dataset
Testing tons of individual kernels on random subsets of the data and then selecting the best ones actually results in less accuracy vs the same number of random kernels (this is likely due to increased correlation between the good random kernels, making them worse in an ensemble)

In [ ]:

%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [ ]:

from local.torch_basics import *
from local.test import *
from local.basics import *
from local.data.all import *
from local.vision.core import *
from local.notebook.showdoc import show_doc
from local.audio.core import *
from local.audio.augment import *
from local.vision.learner import *
from local.vision.models.xresnet import *
from local.metrics import *
from local.callback.schedule import *
import torchaudio
from fastprogress import progress_bar as pb
import time
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifierCV
from rocket import generate_kernels, apply_kernel, apply_kernels

In [ ]:

p10speakers = Config()['data_path'] / 'ST-AEDS-20180100_1-OS'
untar_data(URLs.SPEAKERS10, fname=str(p10speakers)+'.tar', dest=p10speakers)

Out[ ]:

PosixPath('/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/ST-AEDS-20180100_1-OS')

In [ ]:

get_audio = AudioGetter("", recurse=True, folders=None)
files_10 = get_audio(p10speakers)

In [ ]:

files_10

Out[ ]:

(#3842) [/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/f0004_us_f0004_00446.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/m0002_us_m0002_00128.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/f0003_us_f0003_00279.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/f0001_us_f0001_00168.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/f0005_us_f0005_00286.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/m0005_us_m0005_00282.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/f0005_us_f0005_00432.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/f0005_us_f0005_00054.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/m0004_us_m0004_00110.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/m0003_us_m0003_00180.wav...]

In [ ]:

audio_opener = OpenAudio(files_10)
p10_labeler = lambda x: str(x).split('/')[-1][:5] #grab the label from each file

In [ ]:

CLIP_LENGTH = 2000

In [ ]:

sigs, labels = [],[]
cropper = CropSignal(CLIP_LENGTH, pad_mode='repeat')
remove_silence = RemoveSilence()
for i in pb(range(len(files_10))):
    sigs.append(cropper(remove_silence(audio_opener(i))).sig)
    labels.append(p10_labeler(files_10[i]))

100.00% [3842/3842 00:12<00:00]

In [ ]:

len(sigs), len(labels)

Out[ ]:

(3842, 3842)

In [ ]:

total_size = len(sigs)
train_size = int(total_size*.8)
train_idxs = torch.randperm(total_size)[:train_size]
valid_idxs = [i for i in range(total_size) if i not in train_idxs]

In [ ]:

assert len(train_idxs) + len(valid_idxs) == len(sigs)

In [ ]:

x_train = [sigs[idx].squeeze(0).numpy() for idx in train_idxs]
y_train = [labels[idx] for idx in train_idxs]
x_valid = [sigs[idx].squeeze(0).numpy() for idx in valid_idxs]
y_valid = [labels[idx] for idx in valid_idxs]

In [ ]:

list(map(len, (x_train, y_train, x_valid, y_valid)))

Out[ ]:

[3073, 3073, 769, 769]

In [ ]:

np_x_train = np.stack(x_train).astype(np.float64)
np_x_valid = np.stack(x_valid).astype(np.float64)
np_x_train.shape, np_x_valid.shape

Out[ ]:

((3073, 32000), (769, 32000))

In [ ]:

o2i_f = lambda x: 5*(x[0]=='m') + int(x[-1]) - 1

In [ ]:

np_y_train = np.array(list(map(o2i_f, y_train)))
np_y_valid = np.array(list(map(o2i_f, y_valid)))

In [ ]:

np_y_train

Out[ ]:

array([0, 7, 2, ..., 0, 7, 1])

In [ ]:

np_x_train.shape, np_y_train.shape, np_x_valid.shape, np_y_valid.shape

Out[ ]:

((3073, 32000), (3073,), (769, 32000), (769,))

In [ ]:

np_x_train.mean()

Out[ ]:

-4.49039649777175e-05

Normalize the training data¶

In [ ]:

np_x_train = (np_x_train - np_x_train.mean(axis = 1, keepdims = True)) / (np_x_train.std(axis = 1, keepdims = True) + 1e-8)
np_x_valid = (np_x_valid - np_x_valid.mean(axis = 1, keepdims = True)) / (np_x_valid.std(axis = 1, keepdims = True) + 1e-8)

In [ ]:

np_x_train.mean(), np_x_train.std()

Out[ ]:

(-8.10809639770585e-20, 0.9999995545301024)

In [ ]:

np_x_train.dtype

Out[ ]:

dtype('float64')

Start Here¶

In [ ]:

def timing_test(runs, candidate_lengths, stride, num_kernels, seq_length, show_progress=True):
    times, scores = [],[]
    for i in range(runs):
        kernels = generate_kernels(seq_length, num_kernels, candidate_lengths, stride)
        start = time.time()
        x_train_tfm = apply_kernels(np_x_train, kernels)
        x_valid_tfm = apply_kernels(np_x_valid, kernels)
        classifier = RidgeClassifierCV(alphas=np.logspace(-3, 3, 7), normalize=True)
        classifier.fit(x_train_tfm, np_y_train)
        score = classifier.score(x_valid_tfm, np_y_valid)
        t = time.time()-start
        scores.append(score)
        times.append(t)
        if(show_progress): print("Finished Run", i+1, "Score:", round(score, 3), "Time:", round(t,3))
    return times, scores

Initial attempts¶

In [ ]:

timing_test(5, np.array((7,9,11)), stride=5, num_kernels=200, seq_length=16000)

Kernel Sizes	Strides	Results	Time
{7,9,11}	7	.85	4:02
{7,9,11}	5	.899	5:20
{7,9,11}	3	.903	8:15
{800,1000,1200}	400	.46	3:43

Silence removed results 10000 kernels

Kernel Sizes	Strides	Results	Time
{7,9,11}	5	.979	5:08
{7,9,11}	5	.976	5:10
{7,9,11}	5	.980	5:26

Silence removed results 2000 kernels

Kernel Sizes	Strides	Results	Time
{7,9,11}	5	.972	1:10
{7,9,11}	5	.979	1:03
{7,9,11}	5	.976	1:01

Silence removed results 1000 kernels

Kernel Sizes	Strides	Results	Time
{7,9,11}	5	.974	0:31
{7,9,11}	5	.974	0:31
{7,9,11}	5	.966	0:31