07 - Audio¶

(Largely based on rbracco's tutorial, big thanks to him for his work on getting this going for us!)

fastai's audio module has been in development for a while by active forum members:

What makes Audio different?¶

While it is possible to train on raw audio (we simply pass in a 1D tensor of the signal), what is done now is to convert the audio to what is called a spectrogram to train on.

Installing the `fastai_audio` library:¶

We'll be installing from their git repository, similar to how we did for the dev version of fastai

In [ ]:

!pip install git+https://github.com/fastaudio/fastaudio.git

Collecting git+https://github.com/fastaudio/fastaudio.git
  Cloning https://github.com/fastaudio/fastaudio.git to /tmp/pip-req-build-qbyvllun
  Running command git clone -q https://github.com/fastaudio/fastaudio.git /tmp/pip-req-build-qbyvllun
Collecting fastai>=2.0
  Downloading https://files.pythonhosted.org/packages/98/2e/d4dcc69f67b4557c8543a4c65d3e136b1929b01136b227ceb986e2596825/fastai-2.0.15-py3-none-any.whl (185kB)
     |████████████████████████████████| 194kB 4.6MB/s 
Collecting torchaudio>=0.6
  Downloading https://files.pythonhosted.org/packages/96/34/c651430dea231e382ddf2eb5773239bf4885d9528f640a4ef39b12894cb8/torchaudio-0.6.0-cp36-cp36m-manylinux1_x86_64.whl (6.7MB)
     |████████████████████████████████| 6.7MB 9.9MB/s 
Collecting librosa==0.8
  Downloading https://files.pythonhosted.org/packages/26/4d/c22d8ca74ca2c13cd4ac430fa353954886104321877b65fa871939e78591/librosa-0.8.0.tar.gz (183kB)
     |████████████████████████████████| 184kB 53.8MB/s 
Collecting colorednoise>=1.1
  Downloading https://files.pythonhosted.org/packages/a3/3e/85645bcaa5ba6003c6e3c650fe23c6352f7aa4a36eb1d700f3609e52963e/colorednoise-1.1.1.tar.gz
Collecting IPython>=7.16
  Downloading https://files.pythonhosted.org/packages/23/6a/210816c943c9aeeb29e4e18a298f14bf0e118fe222a23e13bfcc2d41b0a4/ipython-7.16.1-py3-none-any.whl (785kB)
     |████████████████████████████████| 788kB 55.2MB/s 
Collecting fastcore==1.0.8
  Downloading https://files.pythonhosted.org/packages/53/3a/b2e5e18f20cf13285d7e5a87165386e916104b5eaad8c5d5a0f8c4c76f98/fastcore-1.0.8-py3-none-any.whl
Requirement already satisfied: pip in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (19.3.1)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (3.13)
Requirement already satisfied: spacy in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2.2.4)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.1.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.4.1)
Requirement already satisfied: pillow in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (7.0.0)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (3.2.2)
Requirement already satisfied: torch>=1.6.0 in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.6.0+cu101)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2.23.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (20.4)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (0.22.2.post1)
Requirement already satisfied: fastprogress>=0.2.4 in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.0.0)
Requirement already satisfied: torchvision>=0.7 in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (0.7.0+cu101)
Requirement already satisfied: audioread>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (2.1.8)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.6/dist-packages (from librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (1.18.5)
Requirement already satisfied: joblib>=0.14 in /usr/local/lib/python3.6/dist-packages (from librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (0.16.0)
Requirement already satisfied: decorator>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (4.4.2)
Requirement already satisfied: resampy>=0.2.2 in /usr/local/lib/python3.6/dist-packages (from librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (0.2.2)
Requirement already satisfied: numba>=0.43.0 in /usr/local/lib/python3.6/dist-packages (from librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (0.48.0)
Collecting soundfile>=0.9.0
  Downloading https://files.pythonhosted.org/packages/eb/f2/3cbbbf3b96fb9fa91582c438b574cff3f45b29c772f94c400e2c99ef5db9/SoundFile-0.10.3.post1-py2.py3-none-any.whl
Collecting pooch>=1.0
  Downloading https://files.pythonhosted.org/packages/ce/11/d7a1dc8173a4085759710e69aae6e070d0d432db84013c7c343e4e522b76/pooch-1.2.0-py3-none-any.whl (47kB)
     |████████████████████████████████| 51kB 7.3MB/s 
Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (2.6.1)
Requirement already satisfied: jedi>=0.10 in /usr/local/lib/python3.6/dist-packages (from IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (0.17.2)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.6/dist-packages (from IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (4.3.3)
Requirement already satisfied: pexpect; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (4.8.0)
Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.6/dist-packages (from IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (50.3.0)
Collecting prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0
  Downloading https://files.pythonhosted.org/packages/2b/c1/53ac685833200eb77ef485c2220dac5bfc255418e660790a9eb5cf3abf25/prompt_toolkit-3.0.7-py3-none-any.whl (355kB)
     |████████████████████████████████| 358kB 59.8MB/s 
Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/dist-packages (from IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (0.7.5)
Requirement already satisfied: backcall in /usr/local/lib/python3.6/dist-packages (from IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (0.2.0)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.0.2)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (0.8.0)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.1.3)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.0.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.0.2)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (4.41.1)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2.0.3)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (0.4.1)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (7.4.0)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (3.0.2)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.6/dist-packages (from pandas->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.2.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (0.10.0)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from torch>=1.6.0->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (0.16.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2020.6.20)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (3.0.4)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from packaging->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.15.0)
Requirement already satisfied: llvmlite<0.32.0,>=0.31.0dev0 in /usr/local/lib/python3.6/dist-packages (from numba>=0.43.0->librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (0.31.0)
Requirement already satisfied: cffi>=1.0 in /usr/local/lib/python3.6/dist-packages (from soundfile>=0.9.0->librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (1.14.3)
Collecting appdirs
  Downloading https://files.pythonhosted.org/packages/3b/00/2344469e2084fb287c2e0b57b72910309874c3245463acd6cf5e3db69324/appdirs-1.4.4-py2.py3-none-any.whl
Requirement already satisfied: parso<0.8.0,>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from jedi>=0.10->IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (0.7.1)
Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.2->IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (0.2.0)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.6/dist-packages (from pexpect; sys_platform != "win32"->IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (0.6.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (0.2.5)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2.0.0)
Requirement already satisfied: pycparser in /usr/local/lib/python3.6/dist-packages (from cffi>=1.0->soundfile>=0.9.0->librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (2.20)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (3.2.0)
Building wheels for collected packages: fastaudio, librosa, colorednoise
  Building wheel for fastaudio (setup.py) ... done
  Created wheel for fastaudio: filename=fastaudio-0.0.post0.dev126+g66b6230-py2.py3-none-any.whl size=17840 sha256=01caa3b75c3aab49eb9e983a09336dd2bd19931abc2c4b0a1af5eb3b2142be98
  Stored in directory: /tmp/pip-ephem-wheel-cache-q63rwr60/wheels/20/af/b1/ea2a6d91971f5e3f435c6a0aa2ae8b7a010b644cc01e24b0ce
  Building wheel for librosa (setup.py) ... done
  Created wheel for librosa: filename=librosa-0.8.0-cp36-none-any.whl size=201376 sha256=7539329c7e483ae2e9cd85c107e36b4200faa3b8d676bf6ea73df406bffcc2d3
  Stored in directory: /root/.cache/pip/wheels/ee/10/1e/382bb4369e189938d5c02e06d10c651817da8d485bfd1647c9
  Building wheel for colorednoise (setup.py) ... done
  Created wheel for colorednoise: filename=colorednoise-1.1.1-cp36-none-any.whl size=3958 sha256=bbcdeb21c3616e629e50a8e2bae6b87c1071deee3c0af7aab0efb6e8a1011a5d
  Stored in directory: /root/.cache/pip/wheels/84/be/f3/3e7e1c80ebab3f6f0dbd3e34e787b902d2280d66706485fef4
Successfully built fastaudio librosa colorednoise
ERROR: jupyter-console 5.2.0 has requirement prompt-toolkit<2.0.0,>=1.0.0, but you'll have prompt-toolkit 3.0.7 which is incompatible.
ERROR: google-colab 1.0.0 has requirement ipython~=5.5.0, but you'll have ipython 7.16.1 which is incompatible.
Installing collected packages: fastcore, fastai, torchaudio, soundfile, appdirs, pooch, librosa, colorednoise, prompt-toolkit, IPython, fastaudio
  Found existing installation: fastai 1.0.61
    Uninstalling fastai-1.0.61:
      Successfully uninstalled fastai-1.0.61
  Found existing installation: librosa 0.6.3
    Uninstalling librosa-0.6.3:
      Successfully uninstalled librosa-0.6.3
  Found existing installation: prompt-toolkit 1.0.18
    Uninstalling prompt-toolkit-1.0.18:
      Successfully uninstalled prompt-toolkit-1.0.18
  Found existing installation: ipython 5.5.0
    Uninstalling ipython-5.5.0:
      Successfully uninstalled ipython-5.5.0
Successfully installed IPython-7.16.1 appdirs-1.4.4 colorednoise-1.1.1 fastai-2.0.15 fastaudio-0.0.post0.dev126+g66b6230 fastcore-1.0.8 librosa-0.8.0 pooch-1.2.0 prompt-toolkit-3.0.7 soundfile-0.10.3.post1 torchaudio-0.6.0

We'll also need torchaudio

Free Digit Dataset¶

Essentially the audio version of MNIST, it contains 2,000 recordings from 10 speakers saying each digit 5 times. First, we'll grab the data and use a custom extract function:

In [ ]:

from fastai.vision.all import *
from fastaudio.core.all import *
from fastaudio.augment.all import *

tar_extract_at_filename simply extracts at the file name (as the name suggests)

In [ ]:

path_dig = untar_data(URLs.SPEAKERS10, extract_func=tar_extract_at_filename)

Now we want to grab just the audio files.

In [ ]:

audio_extensions[:5]

Out[ ]:

('.aif', '.aifc', '.aiff', '.au', '.m3u')

In [ ]:

fnames = get_files(path_dig, extensions=audio_extensions)

In [ ]:

fnames[:5]

Out[ ]:

(#5) [Path('/root/.fastai/data/ST-AEDS-20180100_1-OS/f0003_us_f0003_00136.wav'),Path('/root/.fastai/data/ST-AEDS-20180100_1-OS/m0001_us_m0001_00326.wav'),Path('/root/.fastai/data/ST-AEDS-20180100_1-OS/f0002_us_f0002_00307.wav'),Path('/root/.fastai/data/ST-AEDS-20180100_1-OS/f0001_us_f0001_00411.wav'),Path('/root/.fastai/data/ST-AEDS-20180100_1-OS/m0005_us_m0005_00155.wav')]

We can convert any audio file to a tensor with AudioTensor. Let's try opening a file:

In [ ]:

at = AudioTensor.create(fnames[0])

In [ ]:

at, at.shape

Out[ ]:

(AudioTensor([[ 0.0000,  0.0000,  0.0000,  ..., -0.0002, -0.0002, -0.0003]]),
 torch.Size([1, 58240]))

In [ ]:

at.show()

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f91f3cd1ac8>

Preparing the dataset¶

fastai_audio has a AudioConfig class which allows us to prepare different settings for our dataset. Currently it has:

BasicMelSpectrogram
BasicMFCC
BasicSpectrogram
Voice

We'll be using the Voice module today, as this dataset just contains human voices.

In [ ]:

cfg = AudioConfig.Voice()

Our configuration will limit options like the frequency range and the sampling rate

In [ ]:

cfg.f_max, cfg.sample_rate

Out[ ]:

(8000.0, 16000)

We can then make a transform from this configuration to turn raw audio into a workable spectrogram per our settings:

In [ ]:

aud2spec = AudioToSpec.from_cfg(cfg)

For our example, we'll crop out the original audio file to 1000 ms

In [ ]:

crop1s = ResizeSignal(1000)

Let's build a Pipeline how we'd expect our data to come in

In [ ]:

pipe = Pipeline([AudioTensor.create, crop1s, aud2spec])

And try visualizing what our newly made data becomes.

First, we'll remove that cropping:

In [ ]:

pipe = Pipeline([AudioTensor.create, aud2spec])

In [ ]:

for fn in fnames[:3]:
  audio = AudioTensor.create(fn)
  audio.show()
  pipe(fn).show()

You can see that they're not all the same size here. Let's add that cropping back in:

In [ ]:

pipe = Pipeline([AudioTensor.create, crop1s, aud2spec])

In [ ]:

for fn in fnames[:3]:
  audio = AudioTensor.create(fn)
  audio.show()
  pipe(fn).show()

And now everythign is 128x63

Using the `DataBlock` API:¶

We'll want to use our same transforms we used for the Pipeline
An appropriate getter
An appropriate labeller

For our transforms, we'll want the same ones we used before

In [ ]:

item_tfms = [ResizeSignal(1000), aud2spec]

Our filenames are labelled by the number followed by the name of the individual:

4_theo_37.wav
2_nicolas_7.wav

In [ ]:

get_y = lambda x: x.name[0]

In [ ]:

aud_digit = DataBlock(blocks=(AudioBlock, CategoryBlock),  
                 get_items=get_audio_files, 
                 splitter=RandomSplitter(),
                 item_tfms = item_tfms,
                 get_y=get_y)

And now we can build our DataLoaders

In [ ]:

dls = aud_digit.dataloaders(path_dig, bs=64)

Let's look at a batch

In [ ]:

dls.show_batch(max_n=3)

Training¶

Now that we have our Dataloaders, we need to make a model. We'll make a function that changes a Learner's first layer to accept a 1 channel input (similar to how we did for the Bengali.AI model)

In [ ]:

def alter_learner(learn, n_channels=1):
  "Adjust a `Learner`'s model to accept `1` channel"
  layer = learn.model[0][0]
  layer.in_channels=n_channels
  layer.weight = nn.Parameter(layer.weight[:,1,:,:].unsqueeze(1))
  learn.model[0][0] = layer

In [ ]:

learn = Learner(dls, xresnet18(), CrossEntropyLossFlat(), metrics=accuracy)

Now we need to grab our number of channels:

In [ ]:

n_c = dls.one_batch()[0].shape[1]; n_c

Out[ ]:

In [ ]:

alter_learner(learn, n_c)

Now we can find our learning rate and fit!

In [ ]:

learn.lr_find()

Out[ ]:

SuggestedLRs(lr_min=0.03630780577659607, lr_steep=0.0020892962347716093)

In [ ]:

learn.fit_one_cycle(5, 1e-2)

epoch	train_loss	valid_loss	accuracy	time
0	1.052515	0.323332	0.881510	00:11
1	0.401659	0.156771	0.924479	00:11
2	0.210763	0.696115	0.798177	00:11
3	0.119320	0.239732	0.923177	00:11
4	0.063962	0.029512	0.993490	00:11

In [ ]:

learn.fit_one_cycle(5, 1e-3)

epoch	train_loss	valid_loss	accuracy	time
0	0.030099	0.034156	0.986979	00:11
1	0.031116	0.021291	0.996094	00:12
2	0.032417	0.017661	0.997396	00:11
3	0.025745	0.017490	0.994792	00:11
4	0.022410	0.016554	0.997396	00:11

Not bad for zero data augmentation! But let's see if augmentation can help us out here!

Data Augmentation¶

We can use the SpectrogramTransformer class to prepare some transforms for us

In [ ]:

DBMelSpec = SpectrogramTransformer(mel=True, to_db=True)

Let's take a look at our original settings:

In [ ]:

aud2spec.settings

Out[ ]:

{'mel': 'True',
 'to_db': 'False',
 'sample_rate': 16000,
 'n_fft': 1024,
 'win_length': 1024,
 'hop_length': 128,
 'f_min': 50.0,
 'f_max': 8000.0,
 'pad': 0,
 'n_mels': 128,
 'window_fn': <function _VariableFunctionsClass.hann_window>,
 'power': 2.0,
 'normalized': False,
 'wkwargs': None,
 'stype': 'power',
 'top_db': None,
 'sr': 16000,
 'nchannels': 1}

And we'll narrow this down a bit

In [ ]:

aud2spec = DBMelSpec(n_mels=128, f_max=10000, n_fft=1024, hop_length=128, top_db=100)

For our transforms, we'll use:

RemoveSilence
- Splits a signal at points of silence more than 2 * pad_ms (default is 20)
CropSignal
- Crops a signal by duration and adds padding if needed
aud2spec
- Our SpectrogramTransformer with parameters
MaskTime
- Wrapper for MaskFre, which applies einsum operations
MaskFreq
- SpecAugment Time Masking

Let's look a bit more at the padding CropSignal uses:

There are three different types:

AudioPadTypes.Zeros: The default, random zeros before and after
AudioPadType.Repeat: Repeat the signal until proper length (great for coustic scene classification and voice recognition, terrible for speech recognition)
AudioPadtype.ZerosAfter: This is the default for many other libraries, just pad with zeros until you get the specified length.

Now let's rebuild our DataBlock:

In [ ]:

item_tfms = [RemoveSilence(), ResizeSignal(1000), aud2spec, MaskTime(size=4), MaskFreq(size=10)]

In [ ]:

aud_digit = DataBlock(blocks=(AudioBlock, CategoryBlock),  
                 get_items=get_audio_files, 
                 splitter=RandomSplitter(),
                 item_tfms = item_tfms,
                 get_y=get_y)

In [ ]:

dls = aud_digit.dataloaders(path_dig, bs=128)

Let's look at some augmented data:

In [ ]:

dls.show_batch(max_n=3)

Let's try training again. Also, since we have to keep making an adustment to our model, let's make an audio_learner function similar to cnn_learner:

In [ ]:

def audio_learner(dls, arch, loss_func, metrics):
  "Prepares a `Learner` for audio processing"
  learn = Learner(dls, arch, loss_func, metrics=metrics)
  n_c = dls.one_batch()[0].shape[1]
  if n_c == 1: alter_learner(learn)
  return learn

In [ ]:

learn = audio_learner(dls, xresnet18(), CrossEntropyLossFlat(), accuracy)

In [ ]:

learn.lr_find()

Out[ ]:

SuggestedLRs(lr_min=0.04365158379077912, lr_steep=0.002511886414140463)

In [ ]:

learn.fit_one_cycle(10, 3e-3)

epoch	train_loss	valid_loss	accuracy	time
0	4.650460	1.274344	0.776042	00:17
1	1.882244	0.165905	0.928385	00:17
2	0.958342	0.023563	0.993490	00:17
3	0.537469	0.020922	0.994792	00:17
4	0.317845	0.012210	0.996094	00:17
5	0.193368	0.030548	0.990885	00:17
6	0.118598	0.014498	0.996094	00:17
7	0.074519	0.002216	1.000000	00:17
8	0.047171	0.003482	1.000000	00:17
9	0.030317	0.002047	1.000000	00:17

In [ ]:

learn.fit_one_cycle(10, 3e-4)

epoch	train_loss	valid_loss	accuracy	time
0	0.003665	0.003438	1.000000	00:17
1	0.004991	0.003278	1.000000	00:17
2	0.005746	0.017678	0.997396	00:17
3	0.005073	0.002702	0.998698	00:17
4	0.003955	0.002549	1.000000	00:17
5	0.003551	0.001628	1.000000	00:17
6	0.003751	0.001688	1.000000	00:17
7	0.003312	0.003636	0.998698	00:17
8	0.003430	0.003046	1.000000	00:17
9	0.003002	0.002288	1.000000	00:17

With the help of some of our data augmentation, we were able to perform a bit higher!

Mel Frequency Cepstral Coefficient (MFCC)¶

Now let's look at that MFCC option we said earlier. MFCC's are a "linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency" - Wikipedia. But what does that mean?

Let's try it out!

In [ ]:

aud2mfcc = AudioToMFCC(n_mfcc=40, melkwargs={'n_fft':2048, 'hop_length':256,
                                             'n_mels':128})

In [ ]:

item_tfms = [ResizeSignal(1000), aud2mfcc]

There's a shortcut for replacing the item transforms in a DataBlock:

In [ ]:

aud_digit.item_tfms

Out[ ]:

(#8) [ToTensor:
encodes: (PILMask,object) -> encodes
(PILBase,object) -> encodes
decodes: ,Resample:
encodes: (AudioTensor,object) -> encodes
decodes: ,DownmixMono:
encodes: (AudioTensor,object) -> encodes
decodes: ,RemoveSilence:
encodes: (AudioTensor,object) -> encodes
decodes: ,ResizeSignal:
encodes: (AudioTensor,object) -> encodes
decodes: ,AudioToSpec:
encodes: (AudioTensor,object) -> encodes
decodes: ,MaskTime:
encodes: (AudioSpectrogram,object) -> encodes
decodes: ,MaskFreq:
encodes: (AudioSpectrogram,object) -> encodes
decodes: ]

In [ ]:

aud_digit.item_tfms = item_tfms

In [ ]:

dls = aud_digit.dataloaders(path_dig, bs=128)

In [ ]:

dls.show_batch(max_n=3)

Now let's build our learner and train again!

In [ ]:

learn = audio_learner(dls, xresnet18(), CrossEntropyLossFlat(), accuracy)

In [ ]:

learn.lr_find()

Out[ ]:

SuggestedLRs(lr_min=0.07585775852203369, lr_steep=0.0020892962347716093)

In [ ]:

learn.fit_one_cycle(5, 1e-2)

epoch	train_loss	valid_loss	accuracy	time
0	1.969720	1.324741	0.812500	00:09
1	0.805437	0.774530	0.738281	00:09
2	0.427395	0.071427	0.970052	00:09
3	0.251631	0.052916	0.980469	00:09
4	0.156765	0.026019	0.989583	00:09

Now we can begin to see why choosing your augmentation is important!

MFCC + Delta:¶

The last transform we'll discuss is the Delta transform:

Local estimate of the derivative of the input data along the selected axis.

This allows multiple-channeled inputs from one signal

In [ ]:

item_tfms = [ResizeSignal(1000), aud2mfcc, Delta()]

In [ ]:

aud_digit.item_tfms = item_tfms

In [ ]:

dls = aud_digit.dataloaders(path_dig, bs=128)

In [ ]:

dls.show_batch(max_n=3)

Let's try training one more time:

In [ ]:

learn = audio_learner(dls, xresnet18(), CrossEntropyLossFlat(), accuracy)

In [ ]:

learn.lr_find()

Out[ ]:

SuggestedLRs(lr_min=0.15848932266235352, lr_steep=0.0014454397605732083)

In [ ]:

learn.fit_one_cycle(5, 1e-2)

epoch	train_loss	valid_loss	accuracy	time
0	1.581891	0.626583	0.882812	00:13
1	0.678377	0.197535	0.912760	00:13
2	0.367090	0.094837	0.962240	00:13
3	0.220476	0.022505	0.992188	00:13
4	0.141336	0.033642	0.988281	00:13

Let's try fitting for a few more:

In [ ]:

learn.fit_one_cycle(5, 1e-2/10)

epoch	train_loss	valid_loss	accuracy	time
0	0.029490	0.022874	0.993490	00:13
1	0.027482	0.022706	0.992188	00:13
2	0.025360	0.015615	0.994792	00:13
3	0.026397	0.026350	0.990885	00:13
4	0.027179	0.022311	0.993490	00:13

07 - Audio¶

What makes Audio different?¶

Installing the fastai_audio library:¶

Free Digit Dataset¶

Preparing the dataset¶

Using the DataBlock API:¶

Training¶

Data Augmentation¶

Mel Frequency Cepstral Coefficient (MFCC)¶

MFCC + Delta:¶

Installing the `fastai_audio` library:¶

Using the `DataBlock` API:¶