%reload_ext autoreload
%autoreload 2
%matplotlib inline
This notebook will show you the fastest way to get started with FastAI audio by demonstrating only the most essential functionality. In the examples
folder, we have included a number of other notebooks that show more features, and teach you about audio in general. If you'd like to follow along in a colab notebook, please click here and copy this into your own google drive.
First, import fastai audio, this will import all the dependecies you will need to work with Audio.
from audio import *
Here we create an AudioItem
to load an audio file and listen to it by passing the filename (either str
or PosixPath
) to open_audio()
, we can also see some information about the audio.
path_example = Path('data/misc/whale/Right_whale.wav')
sound = open_audio(path_example)
sound
This clip is 87.73 seconds long. Audio is a continuous wave that is "sampled" by measuring the amplitude of the wave at a given point in time. How many times you sample per second is called the "sample rate" and can be thought of as the resolution of the audio. In our example, the audio was sampled 44100 times per second, so our data is a rank 1 tensor with length 44100*time in seconds = 3869019 samples.
If any of this is new to you, definitely check out our Intro to Audio Notebook in the examples
folder.
sound.shape
#sig means signal, it's a rank one tensor with the amplitudes sampled from the raw sound wave
sound.sig
#sr means sample rate
sound.sr
#path is a reference to the location of the sound file
sound.path
We'll work with a fairly small dataset that has 10 speakers, 5 male and 5 female, with the goal of recognizing who is speaking.
We can download the data into our default fastai data directory
data_url = 'http://www.openslr.org/resources/45/ST-AEDS-20180100_1-OS'
data_folder = datapath4file(url2name(data_url))
untar_data(data_url, dest=data_folder)
We first create an AudioList. This extends fastai ItemList so you can use other methods like from_csv()
to load your data as well
audios = AudioList.from_folder(data_folder)
Because audio data can be so variable, we provide a convenience function .stats()
that will display a list of sample rates, and how many files have that sample rate, as well as a plot of the lengths, in seconds, of the audio files in your AudioList
. You can also specify prec
to set the number of digits the file lengths are rounded to before plotting the graph (default is 0). Expect it to take about 2 seconds per 5000 files in your dataset, a progress bar is provided.
len_dict = audios.stats(prec=1)
stats
will pass you back a dictionary with the file lengths, and file names, so that you may do with it what you want.
One option is to call get_outliers
which will return a sorted list of tuples containing the filename, and length of files that are more than devs
(float) standard deviations from the mean length. This can be helpful for weeding out bad data.
outliers = get_outliers(len_dict, devs=3)
print("Total Outliers:", len(outliers))
outliers[:10]
The stats
method showed us that this dataset has only one sample rate. If you have multiple sample rates, you will need to resample to a single sample rate by setting resample_to
in the configuration settings. If you want to do any customization, you'll need to pass a config object to the AudioList constructor, so before we go any further, here's how to use it.
All config settings are managed through an AudioConfig object
. It also contains within it a SpectrogramConfig
object that holds settings related to spectrograms and MFCC (mel-frequency cepstral coefficients). The inner config can be changed just like the outer one by nesting. config.sg_cfg.top_db=80
for instance
config = AudioConfig()
config
As you can see there are tons of features here, most of which you will not need to adjust to get pretty good results. If you plan on doing a lot of work on audio, or have a dataset with lots of silence, or a wide variety of audio lengths, check out our Features Notebook in the example folder, it shows when and how to adjust each of these settings.
For now we will only cover the most essential features resample_to
, max_to_pad
and duration
duration
and max_to_pad
¶Eventually, our audio will become spectrograms (visual representations of audio that can be passed to an image classifier).
Like images, it is important that our spectrograms be the same size so that the GPU can handle them efficiently. Since audio clips rarely have precisely equal length, we give you two options for generating fixed width spectrograms. Which one is best for you will depend on the nature of your data. If your data varies in length by even a moderate amount, you will want to use duration
.
Specify the duration
setting of your config. This will compute the spectrogram using the entire clip regardless of length, but at train time will grab random sections that are duration
milliseconds long. If duration is greater than the length of the clip, it will pad your spectrogram with zeroes to be the same length as the others.
Set the max_to_pad
attribute of your config (in milliseconds) to be the length you want your audio to be. This will pad or trim the underlying audio, and then generate spectrograms from that resulting audio. It will zero-pad clips that are too short, and trim clips that are too long, throwing away the remaining data.
For this dataset, let's use duration so we don't throwaway data from the longer clips, and let's use 4000ms (4s).
config.duration = 4000
resample_to
¶Also it is important that all of the data is the same sample rate. If one spectrogram has a sample rate of 44100, and another's is 16000, the x-axis of the spectrogram will represent different amounts of time, and thus they won't be comparable. So if you see more than one sample rate when you call the .stats()
method above, you will need to set resample_to
to be an int representing the sample rate you wish to use. It is best practice to use common sample rates (44100, 22050, 16000 or 8000) as they will be faster to resample.
For our data, there is no need to resample, but if we did, the code to downsample to 8000 would just be config.resample_to=8000
Now we follow the normal fastai datablock API, making sure to pass our config to the AudioList
label_pattern = r'_([mf]\d+)_'
audios = AudioList.from_folder(data_folder, config=config).split_by_rand_pct(.2, seed=4).label_from_re(label_pattern)
Fastai Audio performs on the fly data augmentation directly on spectrograms. Try uncommenting the second line and playing around with the transform manager and for more detail check out the Features Notebook
tfms = None
#tfms = get_spectro_transforms(mask_time=False, mask_freq=True, roll=False, num_rows=12)
db = audios.transform(tfms).databunch(bs=64)
db.show_batch(20)
When audio is longer than the duration you've selected for training, it is clipped at random, but those items will tell you what time portion of the original audio clip the spectrogram and displayed audio represent. It will appear as '2.53s-6.53s of original clip'. Clips that are shorter than duration are padded with zeros, this will appear as a blue-green bar on the right hand side of the spectrogram
An Audio learner takes a databunch, base_arch(optional, defaults to resnet18 for now), and metrics(optional, defaults to accuracy) and returns a cnn_learner. For now it is just a wrapper, but additional functionality is coming soon.
learn = audio_learner(db)
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(5, slice(2e-3, 2e-2))
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()
With 30 seconds of compute, and no preprocessing or fine tuning, you just created a voice-recognition system with 99% accuracy. But this is really just scratching the surface, so please check out our other notebooks in the examples folder and see what else is possible.
This library builds on the work of many others. It is of course built on top of fastai, so thank you to Jeremy, Rachel, Stas, Sylvain and countless others. It is a fork of https://github.com/zcaceres/fastai-audio and so we owe a lot to @aamir7117 @marii @simonjhb @ste @ThomM @zachcaceres. And it is built on top of torchaudio which helps us do things many things much faster than would otherwise be possible. Thanks as well to those who have been active in the fastai audio thread.
Also we would love feedback, bug reports, feature requests, and whatever else you have to offer. We welcome contributors of all skill levels. If you need to get in touch for any reason, please post in the fastai audio thread or contact us via PM @baz or @madeupmasters Let's build an audio machine learning community!