(Largely based on rbracco's tutorial, big thanks to him for his work on getting this going for us!)
fastai
's audio module has been in development for a while by active forum members:
While it is possible to train on raw audio (we simply pass in a 1D tensor of the signal), what is done now is to convert the audio to what is called a spectrogram to train on.
fastai_audio
library:¶We'll be installing from their git
repository, similar to how we did for the dev version of fastai
!pip install git+https://github.com/fastaudio/fastaudio.git
Collecting git+https://github.com/fastaudio/fastaudio.git Cloning https://github.com/fastaudio/fastaudio.git to /tmp/pip-req-build-qbyvllun Running command git clone -q https://github.com/fastaudio/fastaudio.git /tmp/pip-req-build-qbyvllun Collecting fastai>=2.0 Downloading https://files.pythonhosted.org/packages/98/2e/d4dcc69f67b4557c8543a4c65d3e136b1929b01136b227ceb986e2596825/fastai-2.0.15-py3-none-any.whl (185kB) |████████████████████████████████| 194kB 4.6MB/s Collecting torchaudio>=0.6 Downloading https://files.pythonhosted.org/packages/96/34/c651430dea231e382ddf2eb5773239bf4885d9528f640a4ef39b12894cb8/torchaudio-0.6.0-cp36-cp36m-manylinux1_x86_64.whl (6.7MB) |████████████████████████████████| 6.7MB 9.9MB/s Collecting librosa==0.8 Downloading https://files.pythonhosted.org/packages/26/4d/c22d8ca74ca2c13cd4ac430fa353954886104321877b65fa871939e78591/librosa-0.8.0.tar.gz (183kB) |████████████████████████████████| 184kB 53.8MB/s Collecting colorednoise>=1.1 Downloading https://files.pythonhosted.org/packages/a3/3e/85645bcaa5ba6003c6e3c650fe23c6352f7aa4a36eb1d700f3609e52963e/colorednoise-1.1.1.tar.gz Collecting IPython>=7.16 Downloading https://files.pythonhosted.org/packages/23/6a/210816c943c9aeeb29e4e18a298f14bf0e118fe222a23e13bfcc2d41b0a4/ipython-7.16.1-py3-none-any.whl (785kB) |████████████████████████████████| 788kB 55.2MB/s Collecting fastcore==1.0.8 Downloading https://files.pythonhosted.org/packages/53/3a/b2e5e18f20cf13285d7e5a87165386e916104b5eaad8c5d5a0f8c4c76f98/fastcore-1.0.8-py3-none-any.whl Requirement already satisfied: pip in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (19.3.1) Requirement already satisfied: pyyaml in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (3.13) Requirement already satisfied: spacy in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2.2.4) Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.1.2) Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.4.1) Requirement already satisfied: pillow in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (7.0.0) Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (3.2.2) Requirement already satisfied: torch>=1.6.0 in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.6.0+cu101) Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2.23.0) Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (20.4) Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (0.22.2.post1) Requirement already satisfied: fastprogress>=0.2.4 in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.0.0) Requirement already satisfied: torchvision>=0.7 in /usr/local/lib/python3.6/dist-packages (from fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (0.7.0+cu101) Requirement already satisfied: audioread>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (2.1.8) Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.6/dist-packages (from librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (1.18.5) Requirement already satisfied: joblib>=0.14 in /usr/local/lib/python3.6/dist-packages (from librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (0.16.0) Requirement already satisfied: decorator>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (4.4.2) Requirement already satisfied: resampy>=0.2.2 in /usr/local/lib/python3.6/dist-packages (from librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (0.2.2) Requirement already satisfied: numba>=0.43.0 in /usr/local/lib/python3.6/dist-packages (from librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (0.48.0) Collecting soundfile>=0.9.0 Downloading https://files.pythonhosted.org/packages/eb/f2/3cbbbf3b96fb9fa91582c438b574cff3f45b29c772f94c400e2c99ef5db9/SoundFile-0.10.3.post1-py2.py3-none-any.whl Collecting pooch>=1.0 Downloading https://files.pythonhosted.org/packages/ce/11/d7a1dc8173a4085759710e69aae6e070d0d432db84013c7c343e4e522b76/pooch-1.2.0-py3-none-any.whl (47kB) |████████████████████████████████| 51kB 7.3MB/s Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (2.6.1) Requirement already satisfied: jedi>=0.10 in /usr/local/lib/python3.6/dist-packages (from IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (0.17.2) Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.6/dist-packages (from IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (4.3.3) Requirement already satisfied: pexpect; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (4.8.0) Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.6/dist-packages (from IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (50.3.0) Collecting prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 Downloading https://files.pythonhosted.org/packages/2b/c1/53ac685833200eb77ef485c2220dac5bfc255418e660790a9eb5cf3abf25/prompt_toolkit-3.0.7-py3-none-any.whl (355kB) |████████████████████████████████| 358kB 59.8MB/s Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/dist-packages (from IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (0.7.5) Requirement already satisfied: backcall in /usr/local/lib/python3.6/dist-packages (from IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (0.2.0) Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.0.2) Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (0.8.0) Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.1.3) Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.0.0) Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.0.2) Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (4.41.1) Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2.0.3) Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (0.4.1) Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (7.4.0) Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (3.0.2) Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2018.9) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.6/dist-packages (from pandas->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2.8.1) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.2.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2.4.7) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (0.10.0) Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from torch>=1.6.0->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (0.16.0) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2020.6.20) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.24.3) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2.10) Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (3.0.4) Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from packaging->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (1.15.0) Requirement already satisfied: llvmlite<0.32.0,>=0.31.0dev0 in /usr/local/lib/python3.6/dist-packages (from numba>=0.43.0->librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (0.31.0) Requirement already satisfied: cffi>=1.0 in /usr/local/lib/python3.6/dist-packages (from soundfile>=0.9.0->librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (1.14.3) Collecting appdirs Downloading https://files.pythonhosted.org/packages/3b/00/2344469e2084fb287c2e0b57b72910309874c3245463acd6cf5e3db69324/appdirs-1.4.4-py2.py3-none-any.whl Requirement already satisfied: parso<0.8.0,>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from jedi>=0.10->IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (0.7.1) Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.2->IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (0.2.0) Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.6/dist-packages (from pexpect; sys_platform != "win32"->IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (0.6.0) Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->IPython>=7.16->fastaudio==0.0.post0.dev126+g66b6230) (0.2.5) Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (2.0.0) Requirement already satisfied: pycparser in /usr/local/lib/python3.6/dist-packages (from cffi>=1.0->soundfile>=0.9.0->librosa==0.8->fastaudio==0.0.post0.dev126+g66b6230) (2.20) Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy->fastai>=2.0->fastaudio==0.0.post0.dev126+g66b6230) (3.2.0) Building wheels for collected packages: fastaudio, librosa, colorednoise Building wheel for fastaudio (setup.py) ... done Created wheel for fastaudio: filename=fastaudio-0.0.post0.dev126+g66b6230-py2.py3-none-any.whl size=17840 sha256=01caa3b75c3aab49eb9e983a09336dd2bd19931abc2c4b0a1af5eb3b2142be98 Stored in directory: /tmp/pip-ephem-wheel-cache-q63rwr60/wheels/20/af/b1/ea2a6d91971f5e3f435c6a0aa2ae8b7a010b644cc01e24b0ce Building wheel for librosa (setup.py) ... done Created wheel for librosa: filename=librosa-0.8.0-cp36-none-any.whl size=201376 sha256=7539329c7e483ae2e9cd85c107e36b4200faa3b8d676bf6ea73df406bffcc2d3 Stored in directory: /root/.cache/pip/wheels/ee/10/1e/382bb4369e189938d5c02e06d10c651817da8d485bfd1647c9 Building wheel for colorednoise (setup.py) ... done Created wheel for colorednoise: filename=colorednoise-1.1.1-cp36-none-any.whl size=3958 sha256=bbcdeb21c3616e629e50a8e2bae6b87c1071deee3c0af7aab0efb6e8a1011a5d Stored in directory: /root/.cache/pip/wheels/84/be/f3/3e7e1c80ebab3f6f0dbd3e34e787b902d2280d66706485fef4 Successfully built fastaudio librosa colorednoise ERROR: jupyter-console 5.2.0 has requirement prompt-toolkit<2.0.0,>=1.0.0, but you'll have prompt-toolkit 3.0.7 which is incompatible. ERROR: google-colab 1.0.0 has requirement ipython~=5.5.0, but you'll have ipython 7.16.1 which is incompatible. Installing collected packages: fastcore, fastai, torchaudio, soundfile, appdirs, pooch, librosa, colorednoise, prompt-toolkit, IPython, fastaudio Found existing installation: fastai 1.0.61 Uninstalling fastai-1.0.61: Successfully uninstalled fastai-1.0.61 Found existing installation: librosa 0.6.3 Uninstalling librosa-0.6.3: Successfully uninstalled librosa-0.6.3 Found existing installation: prompt-toolkit 1.0.18 Uninstalling prompt-toolkit-1.0.18: Successfully uninstalled prompt-toolkit-1.0.18 Found existing installation: ipython 5.5.0 Uninstalling ipython-5.5.0: Successfully uninstalled ipython-5.5.0 Successfully installed IPython-7.16.1 appdirs-1.4.4 colorednoise-1.1.1 fastai-2.0.15 fastaudio-0.0.post0.dev126+g66b6230 fastcore-1.0.8 librosa-0.8.0 pooch-1.2.0 prompt-toolkit-3.0.7 soundfile-0.10.3.post1 torchaudio-0.6.0
We'll also need torchaudio
Essentially the audio version of MNIST
, it contains 2,000 recordings from 10 speakers saying each digit 5 times. First, we'll grab the data and use a custom extract function:
from fastai.vision.all import *
from fastaudio.core.all import *
from fastaudio.augment.all import *
tar_extract_at_filename
simply extracts at the file name (as the name suggests)
path_dig = untar_data(URLs.SPEAKERS10, extract_func=tar_extract_at_filename)
Now we want to grab just the audio files.
audio_extensions[:5]
('.aif', '.aifc', '.aiff', '.au', '.m3u')
fnames = get_files(path_dig, extensions=audio_extensions)
fnames[:5]
(#5) [Path('/root/.fastai/data/ST-AEDS-20180100_1-OS/f0003_us_f0003_00136.wav'),Path('/root/.fastai/data/ST-AEDS-20180100_1-OS/m0001_us_m0001_00326.wav'),Path('/root/.fastai/data/ST-AEDS-20180100_1-OS/f0002_us_f0002_00307.wav'),Path('/root/.fastai/data/ST-AEDS-20180100_1-OS/f0001_us_f0001_00411.wav'),Path('/root/.fastai/data/ST-AEDS-20180100_1-OS/m0005_us_m0005_00155.wav')]
We can convert any audio file to a tensor with AudioTensor
. Let's try opening a file:
at = AudioTensor.create(fnames[0])
at, at.shape
(AudioTensor([[ 0.0000, 0.0000, 0.0000, ..., -0.0002, -0.0002, -0.0003]]), torch.Size([1, 58240]))
at.show()
<matplotlib.axes._subplots.AxesSubplot at 0x7f91f3cd1ac8>
fastai_audio
has a AudioConfig
class which allows us to prepare different settings for our dataset. Currently it has:
We'll be using the Voice module today, as this dataset just contains human voices.
cfg = AudioConfig.Voice()
Our configuration will limit options like the frequency range and the sampling rate
cfg.f_max, cfg.sample_rate
(8000.0, 16000)
We can then make a transform from this configuration to turn raw audio into a workable spectrogram per our settings:
aud2spec = AudioToSpec.from_cfg(cfg)
For our example, we'll crop out the original audio file to 1000 ms
crop1s = ResizeSignal(1000)
Let's build a Pipeline
how we'd expect our data to come in
pipe = Pipeline([AudioTensor.create, crop1s, aud2spec])
And try visualizing what our newly made data becomes.
First, we'll remove that cropping:
pipe = Pipeline([AudioTensor.create, aud2spec])
for fn in fnames[:3]:
audio = AudioTensor.create(fn)
audio.show()
pipe(fn).show()
You can see that they're not all the same size here. Let's add that cropping back in:
pipe = Pipeline([AudioTensor.create, crop1s, aud2spec])
for fn in fnames[:3]:
audio = AudioTensor.create(fn)
audio.show()
pipe(fn).show()
And now everythign is 128x63
DataBlock
API:¶Pipeline
getter
For our transforms, we'll want the same ones we used before
item_tfms = [ResizeSignal(1000), aud2spec]
Our filenames are labelled by the number followed by the name of the individual:
4_theo_37.wav
2_nicolas_7.wav
get_y = lambda x: x.name[0]
aud_digit = DataBlock(blocks=(AudioBlock, CategoryBlock),
get_items=get_audio_files,
splitter=RandomSplitter(),
item_tfms = item_tfms,
get_y=get_y)
And now we can build our DataLoaders
dls = aud_digit.dataloaders(path_dig, bs=64)
Let's look at a batch
dls.show_batch(max_n=3)
Now that we have our Dataloaders
, we need to make a model. We'll make a function that changes a Learner
's first layer to accept a 1 channel input (similar to how we did for the Bengali.AI model)
def alter_learner(learn, n_channels=1):
"Adjust a `Learner`'s model to accept `1` channel"
layer = learn.model[0][0]
layer.in_channels=n_channels
layer.weight = nn.Parameter(layer.weight[:,1,:,:].unsqueeze(1))
learn.model[0][0] = layer
learn = Learner(dls, xresnet18(), CrossEntropyLossFlat(), metrics=accuracy)
Now we need to grab our number of channels:
n_c = dls.one_batch()[0].shape[1]; n_c
1
alter_learner(learn, n_c)
Now we can find our learning rate and fit!
learn.lr_find()
SuggestedLRs(lr_min=0.03630780577659607, lr_steep=0.0020892962347716093)
learn.fit_one_cycle(5, 1e-2)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 1.052515 | 0.323332 | 0.881510 | 00:11 |
1 | 0.401659 | 0.156771 | 0.924479 | 00:11 |
2 | 0.210763 | 0.696115 | 0.798177 | 00:11 |
3 | 0.119320 | 0.239732 | 0.923177 | 00:11 |
4 | 0.063962 | 0.029512 | 0.993490 | 00:11 |
learn.fit_one_cycle(5, 1e-3)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.030099 | 0.034156 | 0.986979 | 00:11 |
1 | 0.031116 | 0.021291 | 0.996094 | 00:12 |
2 | 0.032417 | 0.017661 | 0.997396 | 00:11 |
3 | 0.025745 | 0.017490 | 0.994792 | 00:11 |
4 | 0.022410 | 0.016554 | 0.997396 | 00:11 |
Not bad for zero data augmentation! But let's see if augmentation can help us out here!
We can use the SpectrogramTransformer
class to prepare some transforms for us
DBMelSpec = SpectrogramTransformer(mel=True, to_db=True)
Let's take a look at our original settings:
aud2spec.settings
{'mel': 'True', 'to_db': 'False', 'sample_rate': 16000, 'n_fft': 1024, 'win_length': 1024, 'hop_length': 128, 'f_min': 50.0, 'f_max': 8000.0, 'pad': 0, 'n_mels': 128, 'window_fn': <function _VariableFunctionsClass.hann_window>, 'power': 2.0, 'normalized': False, 'wkwargs': None, 'stype': 'power', 'top_db': None, 'sr': 16000, 'nchannels': 1}
And we'll narrow this down a bit
aud2spec = DBMelSpec(n_mels=128, f_max=10000, n_fft=1024, hop_length=128, top_db=100)
For our transforms, we'll use:
RemoveSilence
pad_ms
(default is 20)CropSignal
duration
and adds padding if neededaud2spec
SpectrogramTransformer
with parametersMaskTime
MaskFre
, which applies einsum
operationsMaskFreq
Let's look a bit more at the padding CropSignal
uses:
There are three different types:
AudioPadTypes.Zeros
: The default, random zeros before and afterAudioPadType.Repeat
: Repeat the signal until proper length (great for coustic scene classification and voice recognition, terrible for speech recognition)AudioPadtype.ZerosAfter
: This is the default for many other libraries, just pad with zeros until you get the specified length.Now let's rebuild our DataBlock
:
item_tfms = [RemoveSilence(), ResizeSignal(1000), aud2spec, MaskTime(size=4), MaskFreq(size=10)]
aud_digit = DataBlock(blocks=(AudioBlock, CategoryBlock),
get_items=get_audio_files,
splitter=RandomSplitter(),
item_tfms = item_tfms,
get_y=get_y)
dls = aud_digit.dataloaders(path_dig, bs=128)
Let's look at some augmented data:
dls.show_batch(max_n=3)
Let's try training again. Also, since we have to keep making an adustment to our model, let's make an audio_learner
function similar to cnn_learner
:
def audio_learner(dls, arch, loss_func, metrics):
"Prepares a `Learner` for audio processing"
learn = Learner(dls, arch, loss_func, metrics=metrics)
n_c = dls.one_batch()[0].shape[1]
if n_c == 1: alter_learner(learn)
return learn
learn = audio_learner(dls, xresnet18(), CrossEntropyLossFlat(), accuracy)
learn.lr_find()
SuggestedLRs(lr_min=0.04365158379077912, lr_steep=0.002511886414140463)
learn.fit_one_cycle(10, 3e-3)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 4.650460 | 1.274344 | 0.776042 | 00:17 |
1 | 1.882244 | 0.165905 | 0.928385 | 00:17 |
2 | 0.958342 | 0.023563 | 0.993490 | 00:17 |
3 | 0.537469 | 0.020922 | 0.994792 | 00:17 |
4 | 0.317845 | 0.012210 | 0.996094 | 00:17 |
5 | 0.193368 | 0.030548 | 0.990885 | 00:17 |
6 | 0.118598 | 0.014498 | 0.996094 | 00:17 |
7 | 0.074519 | 0.002216 | 1.000000 | 00:17 |
8 | 0.047171 | 0.003482 | 1.000000 | 00:17 |
9 | 0.030317 | 0.002047 | 1.000000 | 00:17 |
learn.fit_one_cycle(10, 3e-4)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.003665 | 0.003438 | 1.000000 | 00:17 |
1 | 0.004991 | 0.003278 | 1.000000 | 00:17 |
2 | 0.005746 | 0.017678 | 0.997396 | 00:17 |
3 | 0.005073 | 0.002702 | 0.998698 | 00:17 |
4 | 0.003955 | 0.002549 | 1.000000 | 00:17 |
5 | 0.003551 | 0.001628 | 1.000000 | 00:17 |
6 | 0.003751 | 0.001688 | 1.000000 | 00:17 |
7 | 0.003312 | 0.003636 | 0.998698 | 00:17 |
8 | 0.003430 | 0.003046 | 1.000000 | 00:17 |
9 | 0.003002 | 0.002288 | 1.000000 | 00:17 |
With the help of some of our data augmentation, we were able to perform a bit higher!
Now let's look at that MFCC option we said earlier. MFCC's are a "linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency" - Wikipedia. But what does that mean?
Let's try it out!
aud2mfcc = AudioToMFCC(n_mfcc=40, melkwargs={'n_fft':2048, 'hop_length':256,
'n_mels':128})
item_tfms = [ResizeSignal(1000), aud2mfcc]
There's a shortcut for replacing the item transforms in a DataBlock
:
aud_digit.item_tfms
(#8) [ToTensor: encodes: (PILMask,object) -> encodes (PILBase,object) -> encodes decodes: ,Resample: encodes: (AudioTensor,object) -> encodes decodes: ,DownmixMono: encodes: (AudioTensor,object) -> encodes decodes: ,RemoveSilence: encodes: (AudioTensor,object) -> encodes decodes: ,ResizeSignal: encodes: (AudioTensor,object) -> encodes decodes: ,AudioToSpec: encodes: (AudioTensor,object) -> encodes decodes: ,MaskTime: encodes: (AudioSpectrogram,object) -> encodes decodes: ,MaskFreq: encodes: (AudioSpectrogram,object) -> encodes decodes: ]
aud_digit.item_tfms = item_tfms
dls = aud_digit.dataloaders(path_dig, bs=128)
dls.show_batch(max_n=3)
Now let's build our learner and train again!
learn = audio_learner(dls, xresnet18(), CrossEntropyLossFlat(), accuracy)
learn.lr_find()
SuggestedLRs(lr_min=0.07585775852203369, lr_steep=0.0020892962347716093)
learn.fit_one_cycle(5, 1e-2)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 1.969720 | 1.324741 | 0.812500 | 00:09 |
1 | 0.805437 | 0.774530 | 0.738281 | 00:09 |
2 | 0.427395 | 0.071427 | 0.970052 | 00:09 |
3 | 0.251631 | 0.052916 | 0.980469 | 00:09 |
4 | 0.156765 | 0.026019 | 0.989583 | 00:09 |
Now we can begin to see why choosing your augmentation is important!
The last transform we'll discuss is the Delta
transform:
Local estimate of the derivative of the input data along the selected axis.
This allows multiple-channeled inputs from one signal
item_tfms = [ResizeSignal(1000), aud2mfcc, Delta()]
aud_digit.item_tfms = item_tfms
dls = aud_digit.dataloaders(path_dig, bs=128)
dls.show_batch(max_n=3)
Let's try training one more time:
learn = audio_learner(dls, xresnet18(), CrossEntropyLossFlat(), accuracy)
learn.lr_find()
SuggestedLRs(lr_min=0.15848932266235352, lr_steep=0.0014454397605732083)
learn.fit_one_cycle(5, 1e-2)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 1.581891 | 0.626583 | 0.882812 | 00:13 |
1 | 0.678377 | 0.197535 | 0.912760 | 00:13 |
2 | 0.367090 | 0.094837 | 0.962240 | 00:13 |
3 | 0.220476 | 0.022505 | 0.992188 | 00:13 |
4 | 0.141336 | 0.033642 | 0.988281 | 00:13 |
Let's try fitting for a few more:
learn.fit_one_cycle(5, 1e-2/10)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.029490 | 0.022874 | 0.993490 | 00:13 |
1 | 0.027482 | 0.022706 | 0.992188 | 00:13 |
2 | 0.025360 | 0.015615 | 0.994792 | 00:13 |
3 | 0.026397 | 0.026350 | 0.990885 | 00:13 |
4 | 0.027179 | 0.022311 | 0.993490 | 00:13 |