View the full project on GitHub
Render the notebook with nbviewer
This project builds a scalable attention based speech recognition platform in Keras/Tensorflow for inference on the Nvidia Jetson. This real world application of automatic speech recognition was inspired by my career in mental health. This project begins a journey towards building a platform for real time therapeutic intervention inference and feedback. The ultimate intent was to build a tool that can give therapists real time feedback on the efficacy of their interventions, but this has many applications in mobile, robotics, or other areas where cloud based deep learning is not desirable.
This notebook explores three common ways of visualizing/mathematically representing audio for use in machine learning models. This project then walks you through the construction of a series of increasingly complex character-level phonetics sequencing models. For this project, I have chosen Recurrent Neural Networks, as they allow us to harness the power of deep neural networks for time sequencing issues and allow fast training on GPU's compared to other models. I chose character level phonetics modeling as it provides a more accurate depiction of language and would allow building a system that can pick up on the nuances of human-to-human communication in deeply personal conversations. Additionally, this notebook explores measures of model performance and makes predictions based on the trained models.
The final production model has a word error rate of roughly 16% and a cosine similarity score of about 79%.
Speech recognition models are based on a statistical optimization problem called the fundamental equation of speech recognition. Given a sequence of observations, we look for the most likely character or word sequence. So, using Bayes Theory, we are looking for the sequence which maximizes the posterior probability of the character given the observation. The speech recognition problem is a search over this model for the best character sequence.
Character level speech recognition can be broken into two parts; the acoustic model, that describes the distribution over acoustic observations, O, given the character sequence, C; and the language model based solely on the character sequence which assigns a probability to every possible character sequence. This sequence to sequence model combines both the acoustic and language models into one neural network, though pretrained acoustic models are available from kaldi if you would like to speed up training.
My goal was to build a character-level ASR system using an encoder/decoder based recurrent neural network with an attention mechanism in TensorFlow that can run inference on an Nvidia Jetson with a word error rate of <20%.
# Common, File Based, and Math Imports
import pandas as pd
import numpy as np
import collections
import os
from os.path import isdir, join
from pathlib import Path
from subprocess import check_output
import sys
import math
import pickle
from glob import glob
import random
from random import sample
import json
from mpl_toolkits.axes_grid1 import make_axes_locatable
from numpy.lib.stride_tricks import as_strided
from tqdm import tqdm
# Audio processing
from scipy import signal
from scipy.fftpack import dct
import soundfile
import json
from python_speech_features import mfcc
import scipy.io.wavfile as wav
from scipy.fftpack import fft
# Neural Network
import keras
from keras.utils.generic_utils import get_custom_objects
from keras import backend as K
from keras import regularizers, callbacks
from keras.constraints import max_norm
from keras.models import Model, Sequential, load_model
from keras.layers import Input, Lambda, Dense, Dropout, Flatten, Embedding, merge, Activation, GRUCell, LSTMCell,SimpleRNNCell
from keras.layers import Convolution2D, MaxPooling2D, Convolution1D, Conv1D, SimpleRNN, GRU, LSTM, CuDNNLSTM, CuDNNGRU, Conv2D
from keras.layers.advanced_activations import LeakyReLU, PReLU, ThresholdedReLU, ELU
from keras.layers import LeakyReLU, PReLU, ThresholdedReLU, ELU
from keras.layers import BatchNormalization, TimeDistributed, Bidirectional
from keras.layers import activations, Wrapper
from keras.regularizers import l2
from keras.optimizers import Adam, SGD, RMSprop, Adagrad, Adadelta, Adamax, Nadam
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
from keras import constraints, initializers, regularizers
from keras.engine.topology import Layer
import keras.losses
from keras.backend.tensorflow_backend import set_session
from keras.engine import InputSpec
import tensorflow as tf
# Model metrics
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Visualization
import IPython.display as ipd
from IPython.display import Markdown, display, Audio
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls
py.init_notebook_mode(connected=True)
color = sns.color_palette()
sns.set_style('darkgrid')
py.init_notebook_mode(connected=True)
%matplotlib inline
# Setting Random Seeds
np.random.seed(95)
RNG_SEED = 95
# Suppressing some of Tensorflow's warnings
tf.logging.set_verbosity(tf.logging.ERROR)
Using TensorFlow backend.
# Simple matrix multiplication test to check if tf is using GPU device.
with tf.device('/gpu:0'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# If there is an output then it is able to access the device.
with tf.Session() as sess:
print (sess.run(c))
[[22. 28.] [49. 64.]]
The primary dataset used is the LibriSpeech ASR corpus which includes 1000 hours of recorded speech. A 960 hour subset of the dataset of audio files was used for model development and training. The dataset consists of 16kHz audio files of spoken English derived from read audiobooks from the LibriVox project. Some issues identified with this data set are the age of some of the works (the Declaration of Independence probably doesn't relate well to modern spoken English), the fact that there is much overlap in words spoken between the books, a lack of 'white noise' and other non-voice noises to help the model differentiate spoken words from background noise, and the fact that this does not include conversational English. An overview of the difficulties of working with data such as this can be found here.
The dataset is prepared using a set of scripts borrowed from Baidu Research's Deep Speech GitHub Repo.
The dataset consists of 16kHz audio files between 2-15 seconds long. Using the prepared scripts, the audio files were converted to single channel (mono) WAV/WAVE files (.wav extension) with a 64k bit rate, and a 16kHz sample rate. They were encoded in PCM format, and then cut/padded to an equal length of 10 seconds. The pre-processing techniques used for the text transcriptions include the removal of any punctuation other than apostrophes, and transforming all characters to lowercase. Full instructions on how to download the dataset, convert the .flac files to .wav, and build the corpus are found in the README on the GitHub repository or in the wiki.
train_corpus = pd.read_json('train_corpus.json', lines=True)
valid_corpus = pd.read_json('valid_corpus.json', lines=True)
test_corpus = pd.read_json('test_corpus.json', lines=True)
train_duration_mean = train_corpus.duration.mean()
valid_duration_mean = valid_corpus.duration.mean()
test_duration_mean = test_corpus.duration.mean()
print('Train Set Duration Mean:', train_duration_mean)
print('Valid Set Duration Mean:', valid_duration_mean)
print('Test Set Duration Mean:', test_duration_mean)
Train Set Duration Mean: 12.301810444600907 Valid Set Duration Mean: 6.795830092509413 Test Set Duration Mean: 6.958454892966327
train_duration_median = train_corpus.duration.median()
valid_duration_median = valid_corpus.duration.median()
test_duration_median = test_corpus.duration.median()
print('Train Set Duration Median:', train_duration_median)
print('Valid Set Duration Median:', valid_duration_median)
print('Test Set Duration Median:', test_duration_median)
Train Set Duration Median: 13.79 Valid Set Duration Median: 5.53 Test Set Duration Median: 5.455
# Function for shuffling data which is important as neural networks make multiple passes through the data
def shuffle_dataset(audio_paths, durations, texts):
p = np.random.permutation(len(audio_paths))
audio_paths = [audio_paths[i] for i in p]
durations = [durations[i] for i in p]
texts = [texts[i] for i in p]
return audio_paths, durations, texts
# Function for sorting data by duration
def sort_dataset(audio_paths, durations, texts):
p = np.argsort(durations).tolist()
audio_paths = [audio_paths[i] for i in p]
durations = [durations[i] for i in p]
texts = [texts[i] for i in p]
return audio_paths, durations, texts
# Mapping each character that could be spoken at each time step
char_map_str = """
' 0
<SPACE> 1
a 2
b 3
c 4
d 5
e 6
f 7
g 8
h 9
i 10
j 11
k 12
l 13
m 14
n 15
o 16
p 17
q 18
r 19
s 20
t 21
u 22
v 23
w 24
x 25
y 26
z 27
"""
# This leaves "blank" character mapped to number 28
char_map = {}
index_map = {}
for line in char_map_str.strip().split('\n'):
ch, index = line.split()
char_map[ch] = int(index)
index_map[int(index)+1] = ch
index_map[2] = ' '
# Function for converting text to an integer sequence
def text_to_int_seq(text):
int_sequence = []
for c in text:
if c == ' ':
ch = char_map['<SPACE>']
else:
ch = char_map[c]
int_sequence.append(ch)
return int_sequence
# Function for converting an integer sequence to text
def int_seq_to_text(int_sequence):
text = []
for c in int_sequence:
ch = index_map[c]
text.append(ch)
return text
# Function for calculating feature dimensions.
def calc_feat_dim(window, max_freq):
return int(0.001 * window * max_freq) + 1
This class provides options for training models on both MFCC's and Spectrograms of the data but is set to use spectrograms by default.
class AudioGenerator():
def __init__(self, step=10, window=20, max_freq=8000, mfcc_dim=13,
minibatch_size=20, desc_file=None, spectrogram=True, max_duration=10.0,
sort_by_duration=False):
# Initializing variables
self.feat_dim = calc_feat_dim(window, max_freq)
self.mfcc_dim = mfcc_dim
self.feats_mean = np.zeros((self.feat_dim,))
self.feats_std = np.ones((self.feat_dim,))
self.rng = random.Random(RNG_SEED)
if desc_file is not None:
self.load_metadata_from_desc_file(desc_file)
self.step = step
self.window = window
self.max_freq = max_freq
self.cur_train_index = 0
self.cur_valid_index = 0
self.cur_test_index = 0
self.max_duration=max_duration
self.minibatch_size = minibatch_size
self.spectrogram = spectrogram
self.sort_by_duration = sort_by_duration
def get_batch(self, partition):
# Obtain a batch of audio files
if partition == 'train':
audio_paths = self.train_audio_paths
cur_index = self.cur_train_index
texts = self.train_texts
elif partition == 'valid':
audio_paths = self.valid_audio_paths
cur_index = self.cur_valid_index
texts = self.valid_texts
elif partition == 'test':
audio_paths = self.test_audio_paths
cur_index = self.test_valid_index
texts = self.test_texts
else:
raise Exception("Invalid partition. Must be train/validation/test")
features = [self.normalize(self.featurize(a)) for a in
audio_paths[cur_index:cur_index+self.minibatch_size]]
# Calculate size
max_length = max([features[i].shape[0]
for i in range(0, self.minibatch_size)])
max_string_length = max([len(texts[cur_index+i])
for i in range(0, self.minibatch_size)])
# Initialize arrays
X_data = np.zeros([self.minibatch_size, max_length,
self.feat_dim*self.spectrogram + self.mfcc_dim*(not self.spectrogram)])
labels = np.ones([self.minibatch_size, max_string_length]) * 28
input_length = np.zeros([self.minibatch_size, 1])
label_length = np.zeros([self.minibatch_size, 1])
for i in range(0, self.minibatch_size):
# Calculate input_length
feat = features[i]
input_length[i] = feat.shape[0]
X_data[i, :feat.shape[0], :] = feat
# Calculate label_length
label = np.array(text_to_int_seq(texts[cur_index+i]))
labels[i, :len(label)] = label
label_length[i] = len(label)
# Output arrays
outputs = {'ctc': np.zeros([self.minibatch_size])}
inputs = {'the_input': X_data,
'the_labels': labels,
'input_length': input_length,
'label_length': label_length
}
return (inputs, outputs)
def shuffle_dataset_by_partition(self, partition):
# More shuffling
if partition == 'train':
self.train_audio_paths, self.train_durations, self.train_texts = shuffle_dataset(
self.train_audio_paths, self.train_durations, self.train_texts)
elif partition == 'valid':
self.valid_audio_paths, self.valid_durations, self.valid_texts = shuffle_dataset(
self.valid_audio_paths, self.valid_durations, self.valid_texts)
else:
raise Exception("Invalid partition. "
"Must be train/val")
def sort_dataset_by_duration(self, partition):
# Extra shuffling
if partition == 'train':
self.train_audio_paths, self.train_durations, self.train_texts = sort_dataset(
self.train_audio_paths, self.train_durations, self.train_texts)
elif partition == 'valid':
self.valid_audio_paths, self.valid_durations, self.valid_texts = sort_dataset(
self.valid_audio_paths, self.valid_durations, self.valid_texts)
else:
raise Exception("Invalid partition. "
"Must be train/val")
def next_train(self):
# Get a batch of training data
while True:
ret = self.get_batch('train')
self.cur_train_index += self.minibatch_size
if self.cur_train_index >= len(self.train_texts) - self.minibatch_size:
self.cur_train_index = 0
self.shuffle_dataset_by_partition('train')
yield ret
def next_valid(self):
# Get a batch of validation data
while True:
ret = self.get_batch('valid')
self.cur_valid_index += self.minibatch_size
if self.cur_valid_index >= len(self.valid_texts) - self.minibatch_size:
self.cur_valid_index = 0
self.shuffle_dataset_by_partition('valid')
yield ret
def next_test(self):
# Get a batch of testing data
while True:
ret = self.get_batch('test')
self.cur_test_index += self.minibatch_size
if self.cur_test_index >= len(self.test_texts) - self.minibatch_size:
self.cur_test_index = 0
yield ret
# Load datasets
def load_train_data(self, desc_file='train_corpus.json'):
self.load_metadata_from_desc_file(desc_file, 'train')
self.fit_train()
if self.sort_by_duration:
self.sort_dataset_by_duration('train')
def load_validation_data(self, desc_file='valid_corpus.json'):
self.load_metadata_from_desc_file(desc_file, 'validation')
if self.sort_by_duration:
self.sort_dataset_by_duration('valid')
def load_test_data(self, desc_file='test_corpus.json'):
self.load_metadata_from_desc_file(desc_file, 'test')
if self.sort_by_duration:
self.sort_dataset_by_duration('test')
def load_metadata_from_desc_file(self, desc_file, partition):
# Get metadata from json corpus
audio_paths, durations, texts = [], [], []
with open(desc_file) as json_line_file:
for line_num, json_line in enumerate(json_line_file):
try:
spec = json.loads(json_line)
if float(spec['duration']) > self.max_duration:
continue
audio_paths.append(spec['key'])
durations.append(float(spec['duration']))
texts.append(spec['text'])
except Exception as e:
print('Error reading line #{}: {}'
.format(line_num, json_line))
if partition == 'train':
self.train_audio_paths = audio_paths
self.train_durations = durations
self.train_texts = texts
elif partition == 'validation':
self.valid_audio_paths = audio_paths
self.valid_durations = durations
self.valid_texts = texts
elif partition == 'test':
self.test_audio_paths = audio_paths
self.test_durations = durations
self.test_texts = texts
else:
raise Exception("Invalid partition. "
"Must be train/validation/test")
def fit_train(self, k_samples=100):
# Estimate descriptive stats for training set based on sample of 100 instances
k_samples = min(k_samples, len(self.train_audio_paths))
samples = self.rng.sample(self.train_audio_paths, k_samples)
feats = [self.featurize(s) for s in samples]
feats = np.vstack(feats)
self.feats_mean = np.mean(feats, axis=0)
self.feats_std = np.std(feats, axis=0)
def featurize(self, audio_clip):
# Create features from data, either spectrogram or mfcc
if self.spectrogram:
return spectrogram_from_file(
audio_clip, step=self.step, window=self.window,
max_freq=self.max_freq)
else:
(rate, sig) = wav.read(audio_clip)
return mfcc(sig, rate, numcep=self.mfcc_dim)
def normalize(self, feature, eps=1e-14):
# Scale the data to improve neural network performance and reduce the size of the gradients
return (feature - self.feats_mean) / (self.feats_std + eps)
There are 3 primary methods for extracting features for speech recognition. This includes using raw audio forms, spectrograms, and mfcc's. For this project, I will be creating a character level sequencing model. This allows me to train a model on a data set with a limited vocabulary that can generalize to more unique/rare words better. The downsides are that these models are more computationally expensive, more difficult to interpret/understand, and they are more susceptible to the problems of vanishing or exploding gradients as the sequences can be quite long.
# Defining 3 different ways of converting audio files to spectrograms
def spectrogram(samples, fft_length=256, sample_rate=2, hop_length=128):
# Create a spectrogram from audio signals
assert not np.iscomplexobj(samples), "You shall not pass in complex numbers"
window = np.hanning(fft_length)[:, None]
window_norm = np.sum(window**2)
scale = window_norm * sample_rate
trunc = (len(samples) - fft_length) % hop_length
x = samples[:len(samples) - trunc]
# Reshape to include the overlap
nshape = (fft_length, (len(x) - fft_length) // hop_length + 1)
nstrides = (x.strides[0], x.strides[0] * hop_length)
x = as_strided(x, shape=nshape, strides=nstrides)
# Window stride sanity check
assert np.all(x[:, 1] == samples[hop_length:(hop_length + fft_length)])
# Broadcast window, and then compute fft over columns and square mod
x = np.fft.rfft(x * window, axis=0)
x = np.absolute(x)**2
# Scale 2.0 for everything except dc and fft_length/2
x[1:-1, :] *= (2.0 / scale)
x[(0, -1), :] /= scale
freqs = float(sample_rate) / fft_length * np.arange(x.shape[0])
return x, freqs
def spectrogram_from_file(filename, step=10, window=20, max_freq=None, eps=1e-14):
# Calculate log(linear spectrogram) from FFT energy
with soundfile.SoundFile(filename) as sound_file:
audio = sound_file.read(dtype='float32')
sample_rate = sound_file.samplerate
if audio.ndim >= 2:
audio = np.mean(audio, 1)
if max_freq is None:
max_freq = sample_rate / 2
if max_freq > sample_rate / 2:
raise ValueError("max_freq can not be > than 0.5 of "
" sample rate")
if step > window:
raise ValueError("step size can not be > than window size")
hop_length = int(0.001 * step * sample_rate)
fft_length = int(0.001 * window * sample_rate)
pxx, freqs = spectrogram(
audio, fft_length=fft_length, sample_rate=sample_rate,
hop_length=hop_length)
ind = np.where(freqs <= max_freq)[0][-1] + 1
return np.transpose(np.log(pxx[:ind, :] + eps))
def log_spectrogram_feature(samples, sample_rate, window_size=20, step_size=10, eps=1e-14):
nperseg = int(round(window_size * sample_rate / 1e3))
noverlap = int(round(step_size * sample_rate / 1e3))
freqs, times, spec = signal.spectrogram(samples,
fs=sample_rate,
window='hann',
nperseg=nperseg,
noverlap=noverlap,
detrend=False)
freqs = (freqs*2)
return freqs, times, np.log(spec.T.astype(np.float64) + eps)
def vis_train_features(index):
# Function for visualizing a single audio file based on index chosen
# Get spectrogram
audio_gen = AudioGenerator(spectrogram=True)
audio_gen.load_train_data()
vis_audio_path = audio_gen.train_audio_paths[index]
vis_spectrogram_feature = audio_gen.normalize(audio_gen.featurize(vis_audio_path))
# Get mfcc
audio_gen = AudioGenerator(spectrogram=False)
audio_gen.load_train_data()
vis_mfcc_feature = audio_gen.normalize(audio_gen.featurize(vis_audio_path))
# Obtain text label
vis_text = audio_gen.train_texts[index]
# Obtain raw audio
sample_rate, samples = wav.read(vis_audio_path)
# Print total number of training examples
print('There are %d total training examples.' % len(audio_gen.train_audio_paths))
# Return labels for plotting
return vis_text, vis_mfcc_feature, vis_spectrogram_feature, vis_audio_path, sample_rate, samples
# Creating visualisations for audio file at index number 2012
vis_text, vis_mfcc_feature, vis_spectrogram_feature, vis_audio_path, sample_rate, samples, = vis_train_features(index=2012)
There are 64220 total training examples.
This method uses the raw wave forms of the audio files and is a 1D vector of the amplitude where X = [x1, x2, x3...]
This is used by the Pannous Sequence to Sequence models built in Caffe and TensorFlow.
def plot_raw_audio(sample_rate, samples):
# Plot the raw audio signal
time = np.arange(0, float(samples.shape[0]), 1) / sample_rate
fig = plt.figure(figsize=(12,5))
ax = fig.add_subplot(111)
ax.plot(time, samples, linewidth=1, alpha=0.7, color='#76b900')
plt.title('Raw Audio Signal')
plt.xlabel('Time')
plt.ylabel('Amplitude')
plt.show()
# Plot the raw audio file
plot_raw_audio(sample_rate, samples)
# Print the transcript corresponding to the audio file
display(Markdown('**Audio File Transcription** : ' + str(vis_text)))
# Play the raw audio file
Audio(vis_audio_path)
Audio File Transcription : in front of the table benches arranged in zigzag form like the circumvallations of a retrenchment formed a succession of bastions and curtains
This is what we will use by default for this project. A spectrogram transforms the raw audio wave forms into a 2D tensor (using the Fourier transform) where the first dimension corresponds to time (the horizontal axis), and the second dimension corresponds to frequency (the vertical axis). We lose a little bit of information in this conversion process as we take the log of the power of FFT. This can be written as log |FFT(X)|^2. This gives us 161 features, so each feature corresponds to something between 99-100 Hz. The full transformation process is documented here.
Spectrograms are used in Baidu's Deep Speech system.
def plot_spectrogram_feature(vis_spectrogram_feature):
# Plot a normalized spectrogram
fig = plt.figure(figsize=(12,5))
ax1 = fig.add_subplot(111)
im = ax1.imshow(vis_spectrogram_feature.T, cmap=plt.cm.viridis, aspect='auto', origin='lower')
plt.title('Normalized Log Spectrogram')
plt.ylabel('Frequency')
plt.xlabel('Time (s)')
divider = make_axes_locatable(ax1)
cax = divider.append_axes("right", size="5%", pad=0.05)
plt.colorbar(im, cax=cax)
plt.show()
# Plot the spectrogram for the selected file
plot_spectrogram_feature(vis_spectrogram_feature)
# Print shape of the spectrogram for the selected file
display(Markdown('**Shape of the Spectrogram** : ' + str(vis_spectrogram_feature.shape)))
Shape of the Spectrogram : (904, 161)
If we have 161 features for each frame, and frequencies are between 0 and 16000, then each feature corresponds to around 100 Hz. Humans have a resolution of around 3.6 Hz, so our hearing is much more precise than what this transformation allows. This graph looks a little noisy so let's look at it with a finer grain of detail:
freqs, times, log_spectrogram = log_spectrogram_feature(samples, sample_rate)
mean = np.mean(log_spectrogram, axis=0)
std = np.std(log_spectrogram, axis=0)
log_spectrogram = (log_spectrogram - mean) / std
def plot_log_spectrogram_feature(freqs, times, log_spectrogram):
fig = plt.figure(figsize=(12,5))
ax2 = fig.add_subplot(111)
ax2.imshow(log_spectrogram.T, aspect='auto', origin='lower', cmap=plt.cm.viridis,
extent=[times.min(), times.max(), freqs.min(), freqs.max()])
ax2.set_yticks(freqs[::20])
ax2.set_xticks(times[::20])
ax2.set_title('Normalized Log Spectrogram')
ax2.set_ylabel('Frequency')
ax2.set_xlabel('Time (s)')
plot_log_spectrogram_feature(freqs, times, log_spectrogram)
data = [go.Surface(z=log_spectrogram.T, colorscale='Viridis')]
layout = go.Layout(
title='3D Spectrogram',
autosize=True,
margin=dict(l=50, r=50, b=50, t=50))
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)
data = [go.Contour(z=log_spectrogram.T, colorscale='Viridis')]
layout = go.Layout(
title='Contour Graph',
autosize=True)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)
Like the spectrogram, this turns the audio wave form into a 2D array. This works by mapping the powers of the Fourier transform of the signal, and then taking the discrete cosine transform of the logged mel powers. This produces a 2D array with reduced dimensions when compared to spectrograms, effectively allowing for compression of the spectrogram and speeding up training as we are left with 13 features. The full process for deriving MFCC's from audio is outlined here.
This is used in Mozilla's implementation of Deep Speech in TensorFlow.
def plot_mfcc_feature(vis_mfcc_feature):
# Plot a normalized MFCC feature
fig = plt.figure(figsize=(12,5))
ax = fig.add_subplot(111)
im = ax.imshow(vis_mfcc_feature, cmap=plt.cm.viridis, aspect='auto')
plt.title('Normalized MFCC')
plt.ylabel('Time')
plt.xlabel('MFCC Coefficient')
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
plt.colorbar(im, cax=cax)
ax.set_xticks(np.arange(0, 13, 2), minor=False);
plt.show()
# Plot the MFCC of the selected file
plot_mfcc_feature(vis_mfcc_feature)
# Print the shape of the MFCC of the selected file
display(Markdown('**Shape of the MFCC** : ' + str(vis_mfcc_feature.shape)))
Shape of the MFCC : (904, 13)
The two most common tools for automatic speech recognition are Hidden Markov Models (HMM's), and Deep Neural Networks. For this project, the architecture chosen is a (Recurrent) Deep Neural Network (RNN) as it is easy to implement, and scales well. Though the most effective and sophisticated models implement "hybrid" systems or DNN-HMM, this is beyond the scope of this project. While HMM's using weighted finite state transducers are still considered the most powerful speech recognition tools, they were ignored for this program due to their complexity and increased computing requirements. HMM's also require the development of an extensive vocabulary of phonemes and graphemes that could not be produced under the time constraints of this project.
Recurrent neurons are similar to feedforward neurons, except they also have connections pointing backward. At each step in time, each neuron recieves an input as well as its own output form the previous time step. Each neuron has two sets of weights, one for the input and one for the output at the last time step. Each layer takes vectors as inputs and outputs some vector. This model works by calculating forword propogation through each time step, t, and then back propagation through each time step. At each time step, the speaker is assumed to have spoken 1 of 29 possible characters (26 letters, 1 space character, 1 apostrophe, and 1 blank/empty character used to pad short files since inputs will have varying length). The output of this model at each time step will be a list of probabilitites for each possible character.
The RNN is comprised of a combined acoustic model and language model. The acoustic model scores sequences of acoustic model labels over a time frame and the language model scores sequences of characters. A decoding graph then maps valid acoustic label sequences to the corresponding character sequences. Speech recognition is a path search algorithm through the decoding graph, where the score of the path is the sum of the score given to it by the decoding graph, and the score given to it by the acoustic model. So, to put it simply, speech recognition is the process of finding the character sequence that maximizes both the language and acoustic model scores.
In this notebook, I have created several end-to-end RNN's for ASR. I have addressed the common issues with RNN's; exploding gradients, and vanishing gradients through gradient clipping, and the use of GRU, and LSTM cells respectively. I have also employed batch normalization and recurrent dropout.
For more information on the use of deep learning in speech recognition, read George Dahl's paper.
# Custom CTC loss function (discussed below)
def ctc_lambda_func(args):
y_pred, labels, input_length, label_length = args
return K.ctc_batch_cost(labels, y_pred, input_length, label_length)
def add_ctc_loss(input_to_softmax):
the_labels = Input(name='the_labels', shape=(None,), dtype='float32')
input_lengths = Input(name='input_length', shape=(1,), dtype='int64')
label_lengths = Input(name='label_length', shape=(1,), dtype='int64')
output_lengths = Lambda(input_to_softmax.output_length)(input_lengths)
# CTC loss is implemented in a lambda layer
loss_out = Lambda(ctc_lambda_func, output_shape=(1,), name='ctc')(
[input_to_softmax.output, the_labels, output_lengths, label_lengths])
model = Model(
inputs=[input_to_softmax.input, the_labels, input_lengths, label_lengths],
outputs=loss_out)
return model
# Function for modifying CNN layers for sequence problems
def cnn_output_length(input_length, filter_size, border_mode, stride,
dilation=1):
# Compute the length of cnn output seq after 1D convolution across time
if input_length is None:
return None
assert border_mode in {'same', 'valid', 'causal'}
dilated_filter_size = filter_size + (filter_size - 1) * (dilation - 1)
if border_mode == 'same':
output_length = input_length
elif border_mode == 'valid':
output_length = input_length - dilated_filter_size + 1
elif border_mode == 'causal':
output_length = input_length
return (output_length + stride - 1) // stride
The loss function I am using is a custom implementation of Connectionist Temporal Classification (CTC), which is a special case of sequential objective functions that addresses some of the modeling burden in cross-entropy that forces the model to link every frame of input data to a label. CTC's label set includes a "blank" symbol in its alphabet so if a frame of data doesn’t contain any utterance, the CTC system can output "blank" indicating that there isn't enough information to classify an output. This also has the added benefits of allowing us to have inputs/outputs of varying length as short files can be padded with the "blank" character and allowing us to model words using a character level classification system. This function only observes the sequence of labels along a path, ignoring the alignment of the labels to the acoustic data.
More information on CTC can be found in Alex Grave's paper.
def train_model(input_to_softmax,
pickle_path,
save_model_path,
train_json='train_corpus.json',
valid_json='valid_corpus.json',
minibatch_size=16, # You will want to change this depending on the GPU you are training on
spectrogram=True,
mfcc_dim=13,
optimizer=Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False, clipnorm=1, clipvalue=.5),
epochs=30, # You will want to change this depending on the model you are training and data you are using
verbose=1,
sort_by_duration=False,
max_duration=10.0):
# Obtain batches of data
audio_gen = AudioGenerator(minibatch_size=minibatch_size,
spectrogram=spectrogram, mfcc_dim=mfcc_dim, max_duration=max_duration,
sort_by_duration=sort_by_duration)
# Load the datasets
audio_gen.load_train_data(train_json)
audio_gen.load_validation_data(valid_json)
# Calculate steps per epoch
num_train_examples=len(audio_gen.train_audio_paths)
steps_per_epoch = num_train_examples//minibatch_size
# Calculate validation steps
num_valid_samples = len(audio_gen.valid_audio_paths)
validation_steps = num_valid_samples//minibatch_size
# Add custom CTC loss function to the nn
model = add_ctc_loss(input_to_softmax)
# Dummy lambda function for loss since CTC loss is implemented above
model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer=optimizer)
# Make initial results/ directory for saving model pickles
if not os.path.exists('results'):
os.makedirs('results')
# Add callbacks
checkpointer = ModelCheckpoint(filepath='results/'+save_model_path, verbose=0)
terminator = callbacks.TerminateOnNaN()
time_machiner = callbacks.History()
logger = callbacks.CSVLogger('training.log')
stopper = callbacks.EarlyStopping(monitor='val_loss', patience=2, verbose=1, mode='auto')
reducer = callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=10, verbose=0, mode='auto', min_delta=0.0001, cooldown=0, min_lr=0)
tensor_boarder = callbacks.TensorBoard(log_dir='./logs', batch_size=16,
write_graph=True, write_grads=True, write_images=True,)
# Fit/train model
hist = model.fit_generator(generator=audio_gen.next_train(), steps_per_epoch=steps_per_epoch,
epochs=epochs, validation_data=audio_gen.next_valid(), validation_steps=validation_steps,
callbacks=[checkpointer, terminator, logger, time_machiner, tensor_boarder, stopper, reducer], verbose=verbose)
# Save model loss
with open('results/'+pickle_path, 'wb') as f:
pickle.dump(hist.history, f)
The Adam optimizer was chosen as it has momentum and has been shown to work well in speech recognition.
The preferred method of exporting Keras models for inference is to use the built-in saver/checkpointer (and this is what is used for the inference engine). This uses H5PY, which converts the data to the HDF5 binary data format, letting you store huge amounts of numerical data, and easily manipulate that data from NumPy. The models were set to save checkpoints in a .h5 file after each epoch and at the end of training. These are stored in the /results/ directory and will be used by the inference engine in the flask web app.
# Creating a TensorFlow session
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 1.0
set_session(tf.Session(config=config))
This notebook walks through the development of a series of models that successively get more complex. We will train all of the developmental models (model 0-7) for 20 epochs on a 100 hour subset of the data, and then test an aggregate model (model 8 and 9) by training for 30 epochs using both spectrograms and MFCC's on a 460 hour subset before training the final model architecture (model 10) for 30 epochs on the full 960 hour training set.
This model explores a simple RNN with 1 layer of Gated Recurrent Units, a simplified type of Long-Short Term Memory Recurrent Neuron with fewer parameters than typical LSTM's. These work via a memory update gate and provide most of the performance of traditional LSTM's at a fraction of the computing costs.
For more information on the use of reccurent neural networks in speech recognition, read Alex Graves' paper.
To learn more about GRU's, you can check out this paper.
def regular_rnn_model(input_dim, output_dim=29):
# Input
input_data = Input(name='the_input', shape=(None, input_dim))
# Recurrent layer
simp_rnn = GRU(output_dim, return_sequences=True,
implementation=2, name='rnn')(input_data)
# Softmax Activation Layer
y_pred = Activation('softmax', name='softmax')(simp_rnn)
# Specifying the model
model = Model(inputs=input_data, outputs=y_pred)
model.output_length = lambda x: x
print(model.summary())
return model
model_0 = regular_rnn_model(input_dim=161) # 161 for Spectrogram/13 for MFCC
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= the_input (InputLayer) (None, None, 161) 0 _________________________________________________________________ rnn (GRU) (None, None, 29) 16617 _________________________________________________________________ softmax (Activation) (None, None, 29) 0 ================================================================= Total params: 16,617 Trainable params: 16,617 Non-trainable params: 0 _________________________________________________________________ None
train_model(input_to_softmax=model_0,
pickle_path='model_0.pickle',
save_model_path='model_0.h5',
spectrogram=True,
) # True for Spectrogram/False for MFCC
This model came a long way. Its performance is alright, but let's see if we can improve it with a more complex model.
This model explores the addition of layers of normal Dense neurons to every temporal slice of an input. This model also uses batch normalization, which normalizes the activations of the layers with a mean close to 0 and standard deviation close to 1.
This model uses LSTM's. These cells include forget and output gates, which allow more control over the cell's memory by allowing separate control of what is forgotten and what is passed through to the next hidden layer of cells. This will also make it easier to implement 'peepholes' later, which allow the cell to look at both the previous output state and hidden state when making this determination.
More information on the use of LSTM's in speech recognition can be found in this paper from Google and for more info on TimeDistributed Layers, check out the Keras Documentation.
def rnn_tdd_model(input_dim, units, activation, output_dim=29):
# Input
input_data = Input(name='the_input', shape=(None, input_dim))
# Recurrent layer
simp_rnn = LSTM(units, activation=activation,
return_sequences=True, implementation=2, name='rnn')(input_data)
bn_rnn = BatchNormalization()(simp_rnn)
# TimeDistributed Dense layer
time_dense = TimeDistributed(Dense(output_dim))(bn_rnn)
# Softmax activation layer
y_pred = Activation('softmax', name='softmax')(time_dense)
# Specifying the model
model = Model(inputs=input_data, outputs=y_pred)
model.output_length = lambda x: x
print(model.summary())
return model
model_1 = rnn_tdd_model(input_dim=161, # 161 for Spectrogram/13 for MFCC
units=200,
activation='relu')
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= the_input (InputLayer) (None, None, 161) 0 _________________________________________________________________ rnn (LSTM) (None, None, 200) 289600 _________________________________________________________________ batch_normalization_1 (Batch (None, None, 200) 800 _________________________________________________________________ time_distributed_1 (TimeDist (None, None, 29) 5829 _________________________________________________________________ softmax (Activation) (None, None, 29) 0 ================================================================= Total params: 296,229 Trainable params: 295,829 Non-trainable params: 400 _________________________________________________________________ None
train_model(input_to_softmax=model_1,
pickle_path='model_1.pickle',
save_model_path='model_1.h5',
spectrogram=True) # True for Spectrogram/False for MFCC
This model did significantly better, so let's see if deepening the framework can improve our scores.
This model explores the addition of a Convolutional Neural Network to the RNN.
def cnn_rnn_td_model(input_dim, filters, activation, kernel_size, conv_stride,
conv_border_mode, units, output_dim=29):
# Input
input_data = Input(name='the_input', shape=(None, input_dim))
# Convolutional layer
conv_1d = Conv1D(filters, kernel_size,
strides=conv_stride,
padding=conv_border_mode,
activation=activation,
name='conv1d')(input_data)
# Batch normalization
bn_cnn = BatchNormalization(name='bn_conv1d')(conv_1d)
# Recurrent layer
simp_rnn = GRU(units, activation=activation,
return_sequences=True, implementation=2, name='rnn')(bn_cnn)
# Batch Normalization
bn_rnn = BatchNormalization()(simp_rnn)
# TimeDistributed Dense layer
time_dense = TimeDistributed(Dense(output_dim))(bn_rnn)
# Softmax activation layer
y_pred = Activation('softmax', name='softmax')(time_dense)
# Specifying the model
model = Model(inputs=input_data, outputs=y_pred)
model.output_length = lambda x: cnn_output_length(
x, kernel_size, conv_border_mode, conv_stride)
print(model.summary())
return model
model_2 = cnn_rnn_td_model(input_dim=161, # 161 for Spectrogram/13 for MFCC
filters=200,
kernel_size=11,
conv_stride=2,
conv_border_mode='valid',
activation='relu',
units=200)
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= the_input (InputLayer) (None, None, 161) 0 _________________________________________________________________ conv1d (Conv1D) (None, None, 200) 354400 _________________________________________________________________ bn_conv1d (BatchNormalizatio (None, None, 200) 800 _________________________________________________________________ rnn (GRU) (None, None, 200) 240600 _________________________________________________________________ batch_normalization_2 (Batch (None, None, 200) 800 _________________________________________________________________ time_distributed_2 (TimeDist (None, None, 29) 5829 _________________________________________________________________ softmax (Activation) (None, None, 29) 0 ================================================================= Total params: 602,429 Trainable params: 601,629 Non-trainable params: 800 _________________________________________________________________ None
train_model(input_to_softmax=model_2,
pickle_path='model_2.pickle',
save_model_path='model_2.h5',
spectrogram=True) # True for Spectrogram/False for MFCC
Adding a convolution layer greatly improved our score, but what about adding another RNN layer?
This model explores deepening of the network with additional recurrent layers.
def deep_rnn_tdd_model(input_dim, units, recur_layers, activation, output_dim=29):
# Input
input_data = Input(name='the_input', shape=(None, input_dim))
# 1st Recurrent layer
simp_rnn = GRU(units, activation=activation,
return_sequences=True, implementation=2, name='rnn_0')(input_data)
# Batch normalization
bn_rnn = BatchNormalization()(simp_rnn)
# Loop for additional layers
for i in range(recur_layers - 1):
name = 'rnn_' + str(i + 1)
simp_rnn = GRU(units, activation=activation,
return_sequences=True, implementation=2, name=name)(bn_rnn)
bn_rnn = BatchNormalization()(simp_rnn)
# TimeDistributed Dense layer
time_dense = TimeDistributed(Dense(output_dim))(bn_rnn)
# Softmax activation layer
y_pred = Activation('softmax', name='softmax')(time_dense)
# Specifying the model
model = Model(inputs=input_data, outputs=y_pred)
model.output_length = lambda x: x
print(model.summary())
return model
model_3 = deep_rnn_tdd_model(input_dim=161, units=200, recur_layers=2, activation='relu') # 161 for Spectrogram/13 for MFCC
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= the_input (InputLayer) (None, None, 161) 0 _________________________________________________________________ rnn_0 (GRU) (None, None, 200) 217200 _________________________________________________________________ batch_normalization_3 (Batch (None, None, 200) 800 _________________________________________________________________ rnn_1 (GRU) (None, None, 200) 240600 _________________________________________________________________ batch_normalization_4 (Batch (None, None, 200) 800 _________________________________________________________________ time_distributed_3 (TimeDist (None, None, 29) 5829 _________________________________________________________________ softmax (Activation) (None, None, 29) 0 ================================================================= Total params: 465,229 Trainable params: 464,429 Non-trainable params: 800 _________________________________________________________________ None
train_model(input_to_softmax=model_3,
pickle_path='model_3.pickle',
save_model_path='model_3.h5',
spectrogram=True) # True for Spectrogram/False for MFCC
This one did pretty well, but didn't perform quite as well as the convolution layer.
This model explores connecting two hidden layers of opposite directions to the same output, making their future input information reachable from the current state. To put it simply, this creates two layers of neurons; 1 that goes through the sequence forward in time and 1 that goes through it backward through time. This allows the output layer to get information from past and future states meaning that it will have knowledge of the letters located before and after the current utterance. This can lead to great improvements in performance but comes at a cost of increased latency.
Inspiration for bidirectional layers came from this paper.
Note: The original implementation of this model ran into the problem of exploding gradients (which can be recognized by your loss being nan) and clipnorm=1 was added to the Adam optimizer above to clip the gradients and address this issue. This blog post gives a great overview of the various approaches for dealing with exploding gradients.
def brnn_tdd_model(input_dim, units, activation, output_dim=29):
# Input
input_data = Input(name='the_input', shape=(None, input_dim))
# Bidirectional recurrent layer
brnn = Bidirectional(LSTM(units, activation=activation,
return_sequences=True, implementation=2, name='brnn'))(input_data)
# TimeDistributed Dense layer
time_dense = TimeDistributed(Dense(output_dim))(brnn)
# Softmax activation layer
y_pred = Activation('softmax', name='softmax')(time_dense)
# Specifying the model
model = Model(inputs=input_data, outputs=y_pred)
model.output_length = lambda x: x
print(model.summary())
return model
model_4 = brnn_tdd_model(input_dim=161, units=200, activation='relu') # 161 for Spectrogram/13 for MFCC
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= the_input (InputLayer) (None, None, 161) 0 _________________________________________________________________ bidirectional_1 (Bidirection (None, None, 400) 579200 _________________________________________________________________ time_distributed_4 (TimeDist (None, None, 29) 11629 _________________________________________________________________ softmax (Activation) (None, None, 29) 0 ================================================================= Total params: 590,829 Trainable params: 590,829 Non-trainable params: 0 _________________________________________________________________ None
train_model(input_to_softmax=model_4,
pickle_path='model_4.pickle',
save_model_path='model_4.h5',
spectrogram=True) # True for Spectrogram/False for MFCC
This also lead to some improvements in the model so let's see if we can combine these techniques for increased performance.
This model combines all of the ideas in the precedding models.
def cnn_deep_brnn_tdd_model(input_dim, filters, activation, kernel_size, conv_stride,
conv_border_mode, recur_layers, units, output_dim=29):
# Input
input_data = Input(name='the_input', shape=(None, input_dim))
# Convolutional layer
conv_1d = Conv1D(filters, kernel_size,
strides=conv_stride,
padding=conv_border_mode,
activation=activation,
name='conv1d')(input_data)
# Batch normalization
bn_cnn = BatchNormalization()(conv_1d)
# Bidirectional recurrent layer
brnn = Bidirectional(GRU(units, activation=activation,
return_sequences=True, name='brnn'))(bn_cnn)
# Batch normalization
bn_rnn = BatchNormalization()(brnn)
# Loop for additional layers
for i in range(recur_layers - 1):
name = 'brnn_' + str(i + 1)
brnn = Bidirectional(GRU(units, activation=activation,
return_sequences=True, implementation=2, name=name))(bn_rnn)
bn_rnn = BatchNormalization()(brnn)
# TimeDistributed Dense layer
time_dense = TimeDistributed(Dense(output_dim))(bn_rnn)
# Softmax activation layer
y_pred = Activation('softmax', name='softmax')(time_dense)
# Specifying the model
model = Model(inputs=input_data, outputs=y_pred)
model.output_length = lambda x: cnn_output_length(
x, kernel_size, conv_border_mode, conv_stride)
print(model.summary())
return model
model_5 = cnn_deep_brnn_tdd_model(input_dim=161, # 161 for Spectrogram/13 for MFCC
filters=200,
activation='relu',
kernel_size=11,
conv_stride=2,
conv_border_mode='valid',
recur_layers=2,
units=200)
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= the_input (InputLayer) (None, None, 161) 0 _________________________________________________________________ conv1d (Conv1D) (None, None, 200) 354400 _________________________________________________________________ batch_normalization_5 (Batch (None, None, 200) 800 _________________________________________________________________ bidirectional_2 (Bidirection (None, None, 400) 481200 _________________________________________________________________ batch_normalization_6 (Batch (None, None, 400) 1600 _________________________________________________________________ bidirectional_3 (Bidirection (None, None, 400) 721200 _________________________________________________________________ batch_normalization_7 (Batch (None, None, 400) 1600 _________________________________________________________________ time_distributed_5 (TimeDist (None, None, 29) 11629 _________________________________________________________________ softmax (Activation) (None, None, 29) 0 ================================================================= Total params: 1,572,429 Trainable params: 1,570,429 Non-trainable params: 2,000 _________________________________________________________________ None
train_model(input_to_softmax=model_5,
pickle_path='model_5.pickle',
save_model_path='model_5.h5',
spectrogram=True) # True for Spectrogram/False for MFCC
This ASR program scored very well, so let's see if we can squeeze a little more out of our model using a few tricks.
This model adds randomized dropout of inputs to the aggregate model to prevent the model from over fitting.
Note: The dropout rate is 1%. Any larger than this will lead to exploding gradients. Due to this, this idea won't be pursued any further but one proposed solution can be found in this paper and an example of this in action can be found in this paper.
def cnn_deep_brnn_dropout_model(input_dim, filters, activation, kernel_size, conv_stride,
conv_border_mode, recur_layers, units, output_dim=29):
# Input
input_data = Input(name='the_input', shape=(None, input_dim))
# Convolutional layer
conv_1d = Conv1D(filters, kernel_size,
strides=conv_stride,
padding=conv_border_mode,
activation=activation,
name='conv1d')(input_data)
# Batch normalization
bn_cnn = BatchNormalization()(conv_1d)
# Bidirectional recurrent layer
brnn = Bidirectional(GRU(units, activation=activation,
return_sequences=True, implementation=2, recurrent_dropout=0.01, name='brnn'))(bn_cnn)
# Batch normalization
bn_rnn = BatchNormalization()(brnn)
# Loop for additional layers
for i in range(recur_layers - 1):
name = 'brnn_' + str(i + 1)
brnn = Bidirectional(GRU(units, activation=activation,
return_sequences=True, implementation=2, name=name))(bn_rnn)
bn_rnn = BatchNormalization()(brnn)
# TimeDistributed Dense layer
time_dense = TimeDistributed(Dense(output_dim))(bn_rnn)
# Softmax activation layer
y_pred = Activation('softmax', name='softmax')(time_dense)
# Specifying the model
model = Model(inputs=input_data, outputs=y_pred)
model.output_length = lambda x: cnn_output_length(
x, kernel_size, conv_border_mode, conv_stride)
print(model.summary())
return model
model_6 = cnn_deep_brnn_dropout_model(input_dim=161, # 161 for Spectrogram/13 for MFCC
filters=200,
activation='relu',
kernel_size=11,
conv_stride=2,
conv_border_mode='valid',
recur_layers=2,
units=200)
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= the_input (InputLayer) (None, None, 161) 0 _________________________________________________________________ conv1d (Conv1D) (None, None, 200) 354400 _________________________________________________________________ batch_normalization_8 (Batch (None, None, 200) 800 _________________________________________________________________ bidirectional_4 (Bidirection (None, None, 400) 481200 _________________________________________________________________ batch_normalization_9 (Batch (None, None, 400) 1600 _________________________________________________________________ bidirectional_5 (Bidirection (None, None, 400) 721200 _________________________________________________________________ batch_normalization_10 (Batc (None, None, 400) 1600 _________________________________________________________________ time_distributed_6 (TimeDist (None, None, 29) 11629 _________________________________________________________________ softmax (Activation) (None, None, 29) 0 ================================================================= Total params: 1,572,429 Trainable params: 1,570,429 Non-trainable params: 2,000 _________________________________________________________________ None
train_model(input_to_softmax=model_6,
pickle_path='model_6.pickle',
save_model_path='model_6.h5',
spectrogram=True) # True for Spectrogram/False for MFCC
Despite the small amount of instances dropped, this still lead to improved performance.
This model adds dilated CNN's to the aggregate model. Dilation introduces gaps into the CNN's kernels, so that the receptive field must encircle areas rather than simply slide over the window in a systematic way. This means that the convolutional layer can pick up on the global context of what it is looking at while still only having as many weights/inputs as the standard form.
Inspiration for this technique came from IBM's Watson Team.
def cnn_deep_brnn_dilated_model(input_dim, filters, activation, kernel_size, conv_stride,
conv_border_mode, recur_layers, dilation_rate, units, conv_layers, output_dim=29):
input_data = Input(name='the_input', shape=(None, input_dim))
conv_1d = Conv1D(filters, kernel_size,
strides=conv_stride,
padding=conv_border_mode,
activation=activation,
name='conv1d')(input_data)
# Batch normalization
bn_cnn = BatchNormalization()(conv_1d)
for i in range(conv_layers - 1):
conv_1d = Conv1D(filters, kernel_size,
padding='causal',
activation='relu',
dilation_rate=2**i,
name="conv_1d_"+str(i))(bn_cnn)
bn_cnn = BatchNormalization()(conv_1d)
# Bidirectional recurrent layer
brnn = Bidirectional(GRU(units, activation=activation,
return_sequences=True, implementation=2, name='brnn'))(bn_cnn)
# Batch normalization
bn_rnn = BatchNormalization()(brnn)
# Loop for additional layers
for i in range(recur_layers - 1):
name = 'brnn_' + str(i + 1)
brnn = Bidirectional(GRU(units, activation=activation,
return_sequences=True, implementation=2, name=name))(bn_rnn)
bn_rnn = BatchNormalization()(brnn)
time_dense = TimeDistributed(Dense(output_dim))(bn_rnn)
y_pred = Activation('softmax', name='softmax')(time_dense)
model = Model(inputs=input_data, outputs=y_pred)
model.output_length = lambda x: cnn_output_length(
x, kernel_size, 'causal', 1)
print(model.summary())
return model
model_7 = cnn_deep_brnn_dilated_model(input_dim=161, # 161 for Spectrogram/13 for MFCC
filters=200,
activation='relu',
kernel_size=11,
conv_stride=1,
conv_border_mode='causal',
recur_layers=2,
conv_layers=2,
dilation_rate=2,
units=200)
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= the_input (InputLayer) (None, None, 161) 0 _________________________________________________________________ conv1d (Conv1D) (None, None, 200) 354400 _________________________________________________________________ batch_normalization_11 (Batc (None, None, 200) 800 _________________________________________________________________ conv_1d_0 (Conv1D) (None, None, 200) 440200 _________________________________________________________________ batch_normalization_12 (Batc (None, None, 200) 800 _________________________________________________________________ bidirectional_6 (Bidirection (None, None, 400) 481200 _________________________________________________________________ batch_normalization_13 (Batc (None, None, 400) 1600 _________________________________________________________________ bidirectional_7 (Bidirection (None, None, 400) 721200 _________________________________________________________________ batch_normalization_14 (Batc (None, None, 400) 1600 _________________________________________________________________ time_distributed_7 (TimeDist (None, None, 29) 11629 _________________________________________________________________ softmax (Activation) (None, None, 29) 0 ================================================================= Total params: 2,013,429 Trainable params: 2,011,029 Non-trainable params: 2,400 _________________________________________________________________ None
train_model(input_to_softmax=model_7,
pickle_path='model_7.pickle',
save_model_path='model_7.h5',
spectrogram=True) # True for Spectrogram/False for MFCC
This model also does pretty well (at the cost of increased training time) so let's see if we can combine all of these ideas into one very deep and very powerful speech recognition platform.
The aggregate Keras model is a fine tuned implementation of model_5 (CNN + Deep BRNN + TDD). The model will consist of 1 convolutional layer, 2 GRU layers, and 1 Time Distributed Dense layer. The convolutional layer conducts feature/pattern extraction, while the RNN layers develop predictions on those features. This model won't make use of dropout or dilated convolutions as they both led to gradient explosions in tests. We have also increased the number of neurons in each layer. This model is trained for 30 epochs on a 460 hour subset of the data. This is the model deployed in the heyjetson.com project.
Inspiration for the aggregate architecture came from Baidu's Deep Speech 2 engine.
def agg_model(input_dim, filters, activation, kernel_size, conv_stride,
conv_border_mode, recur_layers, units, output_dim=29):
# Input
input_data = Input(name='the_input', shape=(None, input_dim))
# Convolutional layer
conv_1d = Conv1D(filters, kernel_size,
strides=conv_stride,
padding=conv_border_mode,
activation=activation,
name='conv1d')(input_data)
# Batch normalization
bn_cnn = BatchNormalization()(conv_1d)
# Bidirectional recurrent layer
brnn = Bidirectional(GRU(units, activation=activation,
return_sequences=True, name='brnn'))(bn_cnn)
# Batch normalization
bn_rnn = BatchNormalization()(brnn)
# Loop for additional layers
for i in range(recur_layers - 1):
name = 'brnn_' + str(i + 1)
brnn = Bidirectional(GRU(units, activation=activation,
return_sequences=True, implementation=2, name=name))(bn_rnn)
bn_rnn = BatchNormalization()(brnn)
# TimeDistributed Dense layer
time_dense = TimeDistributed(Dense(output_dim))(bn_rnn)
# Softmax activation layer
y_pred = Activation('softmax', name='softmax')(time_dense)
# Specifying the model
model = Model(inputs=input_data, outputs=y_pred)
model.output_length = lambda x: cnn_output_length(
x, kernel_size, conv_border_mode, conv_stride)
print(model.summary())
return model
model_8 = agg_model(input_dim=161, # 161 for Spectrogram/13 for MFCC
filters=256,
activation='relu',
kernel_size=11,
conv_stride=2,
conv_border_mode='valid',
recur_layers=2,
units=256)
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= the_input (InputLayer) (None, None, 161) 0 _________________________________________________________________ conv1d (Conv1D) (None, None, 256) 453632 _________________________________________________________________ batch_normalization_15 (Batc (None, None, 256) 1024 _________________________________________________________________ bidirectional_8 (Bidirection (None, None, 512) 787968 _________________________________________________________________ batch_normalization_16 (Batc (None, None, 512) 2048 _________________________________________________________________ bidirectional_9 (Bidirection (None, None, 512) 1181184 _________________________________________________________________ batch_normalization_17 (Batc (None, None, 512) 2048 _________________________________________________________________ time_distributed_8 (TimeDist (None, None, 29) 14877 _________________________________________________________________ softmax (Activation) (None, None, 29) 0 ================================================================= Total params: 2,442,781 Trainable params: 2,440,221 Non-trainable params: 2,560 _________________________________________________________________ None
train_model(input_to_softmax=model_8,
pickle_path='model_8.pickle',
save_model_path='model_8.h5',
spectrogram=True) # True for Spectrogram/False for MFCC
Let's train this model using MFCC's just to see if there is a difference in performance:
model_9 = agg_model(input_dim=13, # 161 for Spectrogram/13 for MFCC
filters=256,
activation='relu',
kernel_size=11,
conv_stride=2,
conv_border_mode='valid',
recur_layers=2,
units=256)
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= the_input (InputLayer) (None, None, 13) 0 _________________________________________________________________ conv1d (Conv1D) (None, None, 256) 36864 _________________________________________________________________ batch_normalization_18 (Batc (None, None, 256) 1024 _________________________________________________________________ bidirectional_10 (Bidirectio (None, None, 512) 787968 _________________________________________________________________ batch_normalization_19 (Batc (None, None, 512) 2048 _________________________________________________________________ bidirectional_11 (Bidirectio (None, None, 512) 1181184 _________________________________________________________________ batch_normalization_20 (Batc (None, None, 512) 2048 _________________________________________________________________ time_distributed_9 (TimeDist (None, None, 29) 14877 _________________________________________________________________ softmax (Activation) (None, None, 29) 0 ================================================================= Total params: 2,026,013 Trainable params: 2,023,453 Non-trainable params: 2,560 _________________________________________________________________ None
train_model(input_to_softmax=model_9,
pickle_path='model_9.pickle',
save_model_path='model_9.h5',
spectrogram=False) # True for Spectrogram/False for MFCC
Looks like using MFCC's lead to a model that didn't perform quite as well, but did come with a speed up.
Now we will train a deeper version of this model architecture on the full 960 hour data set. For this, we will reintroduce dilated convolutions and recurrent dropout.
def hey_jetson(input_dim, filters, activation, kernel_size, conv_stride,
conv_border_mode, recur_layers, dilation_rate, units, conv_layers, output_dim=29):
# Input
input_data = Input(name='the_input', shape=(None, input_dim))
# Convolutional layer
conv_1d = Conv1D(filters, kernel_size,
strides=conv_stride,
padding=conv_border_mode,
activation=activation,
name='conv1d')(input_data)
# Batch normalization
bn_cnn = BatchNormalization()(conv_1d)
for i in range(conv_layers - 1):
conv_1d = Conv1D(filters, kernel_size,
padding=conv_border_mode,
activation=activation,
dilation_rate=2**i,
name="conv_1d_"+str(i))(bn_cnn)
bn_cnn = BatchNormalization()(conv_1d)
# Bidirectional recurrent layer
brnn = Bidirectional(GRU(units, activation=activation,
return_sequences=True, implementation=2, recurrent_dropout=0.01, name='brnn'))(bn_cnn)
# Batch normalization
bn_rnn = BatchNormalization()(brnn)
# Loop for additional layers
for i in range(recur_layers - 1):
name = 'brnn_' + str(i + 1)
brnn = Bidirectional(GRU(units, activation=activation,
return_sequences=True, implementation=2, name=name))(bn_rnn)
bn_rnn = BatchNormalization()(brnn)
# TimeDistributed Dense layer
time_distributed_dense = TimeDistributed(Dense(1024))(bn_rnn)
time_dense = TimeDistributed(Dense(output_dim))(time_distributed_dense)
# Softmax activation layer
y_pred = Activation('softmax', name='softmax')(time_dense)
# Specifying the model
model = Model(inputs=input_data, outputs=y_pred)
model.output_length = lambda x: cnn_output_length(
x, kernel_size, conv_border_mode, conv_stride)
print(model.summary())
return model
model_10 = hey_jetson(input_dim=161, # 161 for Spectrogram/13 for MFCC
filters=256,
activation='relu',
kernel_size=5,
conv_stride=2,
recur_layers=7,
conv_border_mode='causal',
conv_layers=3,
dilation_rate=2,
units=256)
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= the_input (InputLayer) (None, None, 161) 0 _________________________________________________________________ conv1d (Conv1D) (None, None, 256) 206336 _________________________________________________________________ batch_normalization_1 (Batch (None, None, 256) 1024 _________________________________________________________________ conv_1d_0 (Conv1D) (None, None, 256) 327936 _________________________________________________________________ batch_normalization_2 (Batch (None, None, 256) 1024 _________________________________________________________________ conv_1d_1 (Conv1D) (None, None, 256) 327936 _________________________________________________________________ batch_normalization_3 (Batch (None, None, 256) 1024 _________________________________________________________________ bidirectional_1 (Bidirection (None, None, 512) 787968 _________________________________________________________________ batch_normalization_4 (Batch (None, None, 512) 2048 _________________________________________________________________ bidirectional_2 (Bidirection (None, None, 512) 1181184 _________________________________________________________________ batch_normalization_5 (Batch (None, None, 512) 2048 _________________________________________________________________ bidirectional_3 (Bidirection (None, None, 512) 1181184 _________________________________________________________________ batch_normalization_6 (Batch (None, None, 512) 2048 _________________________________________________________________ bidirectional_4 (Bidirection (None, None, 512) 1181184 _________________________________________________________________ batch_normalization_7 (Batch (None, None, 512) 2048 _________________________________________________________________ bidirectional_5 (Bidirection (None, None, 512) 1181184 _________________________________________________________________ batch_normalization_8 (Batch (None, None, 512) 2048 _________________________________________________________________ bidirectional_6 (Bidirection (None, None, 512) 1181184 _________________________________________________________________ batch_normalization_9 (Batch (None, None, 512) 2048 _________________________________________________________________ bidirectional_7 (Bidirection (None, None, 512) 1181184 _________________________________________________________________ batch_normalization_10 (Batc (None, None, 512) 2048 _________________________________________________________________ time_distributed_1 (TimeDist (None, None, 1024) 525312 _________________________________________________________________ time_distributed_2 (TimeDist (None, None, 29) 29725 _________________________________________________________________ softmax (Activation) (None, None, 29) 0 ================================================================= Total params: 9,309,725 Trainable params: 9,301,021 Non-trainable params: 8,704 _________________________________________________________________ None
train_model(input_to_softmax=model_10,
pickle_path='model_10.pickle',
save_model_path='model_10.h5',
spectrogram=True) # True for Spectrogram/False for MFCC
Now, we will add a single Attention layer to the model that will take as input the output from the encoder(the RNN layers). The decoder portion of the model will include the ability to "attend" to different parts of the audio clip at each time step. This lets the model learn what to pay attention to based on the input and what it has predicted the output to be so far. Attention allows the network to refer back to the input sequence by giving the network access to its internal memory, which is the hidden state of the encoder.
Let's define the class that will make up the Attention layer.
class Attention(keras.layers.Layer):
ATTENTION_TYPE_ADD = 'additive'
ATTENTION_TYPE_MUL = 'multiplicative'
def __init__(self,
units=512,
attention_width=None,
attention_type=ATTENTION_TYPE_MUL,
return_attention=False,
history_only=False,
kernel_initializer='glorot_normal',
bias_initializer='zeros',
kernel_regularizer=None,
bias_regularizer=None,
kernel_constraint=None,
bias_constraint=None,
use_additive_bias=True,
use_attention_bias=True,
attention_activation=None,
attention_regularizer_weight=0.0,
**kwargs):
"""Layer initialization.
For additive attention, see: https://arxiv.org/pdf/1806.01264.pdf
:param units: The dimension of the vectors that used to calculate the attention weights.
:param attention_width: The width of local attention.
:param attention_type: 'additive' or 'multiplicative'.
:param return_attention: Whether to return the attention weights for visualization.
:param history_only: Only use historical pieces of data.
:param kernel_initializer: The initializer for weight matrices.
:param bias_initializer: The initializer for biases.
:param kernel_regularizer: The regularization for weight matrices.
:param bias_regularizer: The regularization for biases.
:param kernel_constraint: The constraint for weight matrices.
:param bias_constraint: The constraint for biases.
:param use_additive_bias: Whether to use bias while calculating the relevance of inputs features
in additive mode.
:param use_attention_bias: Whether to use bias while calculating the weights of attention.
:param attention_activation: The activation used for calculating the weights of attention.
:param attention_regularizer_weight: The weights of attention regularizer.
:param kwargs: Parameters for parent class.
"""
super(Attention, self).__init__(**kwargs)
self.supports_masking = True
self.units = units
self.attention_width = attention_width
self.attention_type = attention_type
self.return_attention = return_attention
self.history_only = history_only
if history_only and attention_width is None:
self.attention_width = int(1e9)
self.use_additive_bias = use_additive_bias
self.use_attention_bias = use_attention_bias
self.kernel_initializer = keras.initializers.get(kernel_initializer)
self.bias_initializer = keras.initializers.get(bias_initializer)
self.kernel_regularizer = keras.regularizers.get(kernel_regularizer)
self.bias_regularizer = keras.regularizers.get(bias_regularizer)
self.kernel_constraint = keras.constraints.get(kernel_constraint)
self.bias_constraint = keras.constraints.get(bias_constraint)
self.attention_activation = keras.activations.get(attention_activation)
self.attention_regularizer_weight = attention_regularizer_weight
self._backend = keras.backend.backend()
if attention_type == Attention.ATTENTION_TYPE_ADD:
self.Wx, self.Wt, self.bh = None, None, None
self.Wa, self.ba = None, None
elif attention_type == Attention.ATTENTION_TYPE_MUL:
self.Wa, self.ba = None, None
else:
raise NotImplementedError('No implementation for attention type : ' + attention_type)
def get_config(self):
config = {
'units': self.units,
'attention_width': self.attention_width,
'attention_type': self.attention_type,
'return_attention': self.return_attention,
'history_only': self.history_only,
'use_additive_bias': self.use_additive_bias,
'use_attention_bias': self.use_attention_bias,
'kernel_initializer': keras.regularizers.serialize(self.kernel_initializer),
'bias_initializer': keras.regularizers.serialize(self.bias_initializer),
'kernel_regularizer': keras.regularizers.serialize(self.kernel_regularizer),
'bias_regularizer': keras.regularizers.serialize(self.bias_regularizer),
'kernel_constraint': keras.constraints.serialize(self.kernel_constraint),
'bias_constraint': keras.constraints.serialize(self.bias_constraint),
'attention_activation': keras.activations.serialize(self.attention_activation),
'attention_regularizer_weight': self.attention_regularizer_weight,
}
base_config = super(Attention, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
def build(self, input_shape):
if self.attention_type == Attention.ATTENTION_TYPE_ADD:
self._build_additive_attention(input_shape)
elif self.attention_type == Attention.ATTENTION_TYPE_MUL:
self._build_multiplicative_attention(input_shape)
super(Attention, self).build(input_shape)
def _build_additive_attention(self, input_shape):
feature_dim = int(input_shape[2])
self.Wt = self.add_weight(shape=(feature_dim, self.units),
name='{}_Add_Wt'.format(self.name),
initializer=self.kernel_initializer,
regularizer=self.kernel_regularizer,
constraint=self.kernel_constraint)
self.Wx = self.add_weight(shape=(feature_dim, self.units),
name='{}_Add_Wx'.format(self.name),
initializer=self.kernel_initializer,
regularizer=self.kernel_regularizer,
constraint=self.kernel_constraint)
if self.use_additive_bias:
self.bh = self.add_weight(shape=(self.units,),
name='{}_Add_bh'.format(self.name),
initializer=self.bias_initializer,
regularizer=self.bias_regularizer,
constraint=self.bias_constraint)
self.Wa = self.add_weight(shape=(self.units, 1),
name='{}_Add_Wa'.format(self.name),
initializer=self.kernel_initializer,
regularizer=self.kernel_regularizer,
constraint=self.kernel_constraint)
if self.use_attention_bias:
self.ba = self.add_weight(shape=(1,),
name='{}_Add_ba'.format(self.name),
initializer=self.bias_initializer,
regularizer=self.bias_regularizer,
constraint=self.bias_constraint)
def _build_multiplicative_attention(self, input_shape):
feature_dim = int(input_shape[2])
self.Wa = self.add_weight(shape=(feature_dim, feature_dim),
name='{}_Mul_Wa'.format(self.name),
initializer=self.kernel_initializer,
regularizer=self.kernel_regularizer,
constraint=self.kernel_constraint)
if self.use_attention_bias:
self.ba = self.add_weight(shape=(1,),
name='{}_Mul_ba'.format(self.name),
initializer=self.bias_initializer,
regularizer=self.bias_regularizer,
constraint=self.bias_constraint)
def call(self, inputs, mask=None, **kwargs):
input_len = K.shape(inputs)[1]
if self.attention_type == Attention.ATTENTION_TYPE_ADD:
e = self._call_additive_emission(inputs)
elif self.attention_type == Attention.ATTENTION_TYPE_MUL:
e = self._call_multiplicative_emission(inputs)
if self.attention_activation is not None:
e = self.attention_activation(e)
e = K.exp(e - K.max(e, axis=-1, keepdims=True))
if self.attention_width is not None:
if self.history_only:
lower = K.arange(0, input_len) - (self.attention_width - 1)
else:
lower = K.arange(0, input_len) - self.attention_width // 2
lower = K.expand_dims(lower, axis=-1)
upper = lower + self.attention_width
indices = K.expand_dims(K.arange(0, input_len), axis=0)
e = e * K.cast(lower <= indices, K.floatx()) * K.cast(indices < upper, K.floatx())
if mask is not None:
mask = K.cast(mask, K.floatx())
mask = K.expand_dims(mask)
e = K.permute_dimensions(K.permute_dimensions(e * mask, (0, 2, 1)) * mask, (0, 2, 1))
# a_{t} = \text{softmax}(e_t)
s = K.sum(e, axis=-1, keepdims=True)
a = e / (s + K.epsilon())
# l_t = \sum_{t'} a_{t, t'} x_{t'}
v = K.batch_dot(a, inputs)
if self.attention_regularizer_weight > 0.0:
self.add_loss(self._attention_regularizer(a))
if self.return_attention:
return [v, a]
return v
def _call_additive_emission(self, inputs):
input_shape = K.shape(inputs)
batch_size, input_len = input_shape[0], input_shape[1]
# h_{t, t'} = \tanh(x_t^T W_t + x_{t'}^T W_x + b_h)
q = K.expand_dims(K.dot(inputs, self.Wt), 2)
k = K.expand_dims(K.dot(inputs, self.Wx), 1)
if self.use_additive_bias:
h = K.tanh(q + k + self.bh)
else:
h = K.tanh(q + k)
# e_{t, t'} = W_a h_{t, t'} + b_a
if self.use_attention_bias:
e = K.reshape(K.dot(h, self.Wa) + self.ba, (batch_size, input_len, input_len))
else:
e = K.reshape(K.dot(h, self.Wa), (batch_size, input_len, input_len))
return e
def _call_multiplicative_emission(self, inputs):
# e_{t, t'} = x_t^T W_a x_{t'} + b_a
e = K.batch_dot(K.dot(inputs, self.Wa), K.permute_dimensions(inputs, (0, 2, 1)))
if self.use_attention_bias:
e += self.ba[0]
return e
def compute_output_shape(self, input_shape):
output_shape = input_shape
if self.return_attention:
attention_shape = (input_shape[0], output_shape[1], input_shape[1])
return [output_shape, attention_shape]
return output_shape
def compute_mask(self, inputs, mask=None):
if self.return_attention:
return [mask, None]
return mask
def _attention_regularizer(self, attention):
batch_size = K.cast(K.shape(attention)[0], K.floatx())
input_len = K.shape(attention)[-1]
indices = K.expand_dims(K.arange(0, input_len), axis=0)
diagonal = K.expand_dims(K.arange(0, input_len), axis=-1)
eye = K.cast(K.equal(indices, diagonal), K.floatx())
return self.attention_regularizer_weight * K.sum(K.square(K.batch_dot(
attention,
K.permute_dimensions(attention, (0, 2, 1))) - eye)) / batch_size
Now we will train the final attention based model architecture on the full 960 hour data set.
def keras_model(input_dim, filters, activation, kernel_size, conv_stride,
conv_border_mode, recur_layers, dilation_rate, units, conv_layers, output_dim=29):
# Input
input_data = Input(name='the_input', shape=(None, input_dim))
# Inital Convolutional layer
conv_1d = Conv1D(filters, kernel_size,
strides=conv_stride,
padding=conv_border_mode,
activation=activation,
name='conv1d')(input_data)
# Batch normalization
bn_cnn = BatchNormalization()(conv_1d)
# Loop for additional layers
for i in range(conv_layers - 1):
conv_1d = Conv1D(filters, kernel_size,
padding=conv_border_mode,
activation=activation,
dilation_rate=2**i,
name="conv_1d_"+str(i))(bn_cnn)
bn_cnn = BatchNormalization()(conv_1d)
# Initial Bidirectional recurrent layer
brnn = Bidirectional(GRU(units, activation=activation,
return_sequences=True, implementation=2, recurrent_dropout=0.02, name='brnn'))(bn_cnn)
# Batch normalization
bn_rnn = BatchNormalization()(brnn)
# Loop for additional layers
for i in range(recur_layers - 1):
name = 'brnn_' + str(i + 1)
brnn = Bidirectional(GRU(units, activation=activation,
return_sequences=True, implementation=2, name=name))(bn_rnn)
bn_rnn = BatchNormalization()(brnn)
# Attention layer
attentive = Attention()(bn_rnn)
# TimeDistributed Dense layers
time_distributed_dense = TimeDistributed(Dense(1024))(attentive)
time_dense = TimeDistributed(Dense(output_dim))(time_distributed_dense)
# Softmax activation layer
y_pred = Activation('softmax', name='softmax')(time_dense)
# Specifying the model
model = Model(inputs=input_data, outputs=y_pred)
model.output_length = lambda x: cnn_output_length(
x, kernel_size, conv_border_mode, conv_stride)
print(model.summary())
return model
hey_jetson = keras_model(input_dim=161, # 161 for Spectrogram/13 for MFCC
filters=256,
activation='relu',
kernel_size=5,
conv_stride=2,
recur_layers=7,
conv_border_mode='causal',
conv_layers=3,
dilation_rate=2,
units=256)
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= the_input (InputLayer) (None, None, 161) 0 _________________________________________________________________ conv1d (Conv1D) (None, None, 256) 206336 _________________________________________________________________ batch_normalization_1 (Batch (None, None, 256) 1024 _________________________________________________________________ conv_1d_0 (Conv1D) (None, None, 256) 327936 _________________________________________________________________ batch_normalization_2 (Batch (None, None, 256) 1024 _________________________________________________________________ conv_1d_1 (Conv1D) (None, None, 256) 327936 _________________________________________________________________ batch_normalization_3 (Batch (None, None, 256) 1024 _________________________________________________________________ bidirectional_1 (Bidirection (None, None, 512) 787968 _________________________________________________________________ batch_normalization_4 (Batch (None, None, 512) 2048 _________________________________________________________________ bidirectional_2 (Bidirection (None, None, 512) 1181184 _________________________________________________________________ batch_normalization_5 (Batch (None, None, 512) 2048 _________________________________________________________________ bidirectional_3 (Bidirection (None, None, 512) 1181184 _________________________________________________________________ batch_normalization_6 (Batch (None, None, 512) 2048 _________________________________________________________________ bidirectional_4 (Bidirection (None, None, 512) 1181184 _________________________________________________________________ batch_normalization_7 (Batch (None, None, 512) 2048 _________________________________________________________________ bidirectional_5 (Bidirection (None, None, 512) 1181184 _________________________________________________________________ batch_normalization_8 (Batch (None, None, 512) 2048 _________________________________________________________________ bidirectional_6 (Bidirection (None, None, 512) 1181184 _________________________________________________________________ batch_normalization_9 (Batch (None, None, 512) 2048 _________________________________________________________________ bidirectional_7 (Bidirection (None, None, 512) 1181184 _________________________________________________________________ batch_normalization_10 (Batc (None, None, 512) 2048 _________________________________________________________________ attention_1 (Attention) (None, None, 512) 262145 _________________________________________________________________ time_distributed_1 (TimeDist (None, None, 1024) 525312 _________________________________________________________________ time_distributed_2 (TimeDist (None, None, 29) 29725 _________________________________________________________________ softmax (Activation) (None, None, 29) 0 ================================================================= Total params: 9,571,870 Trainable params: 9,563,166 Non-trainable params: 8,704 _________________________________________________________________ None
train_model(input_to_softmax=hey_jetson,
pickle_path='model_11.pickle',
save_model_path='model_11.h5',
spectrogram=True) # True for Spectrogram/False for MFCC
# Outputting a graph of the model architecture for inclusion in the app and repo
keras.utils.plot_model(hey_jetson, to_file='./app/static/images/model_11.png', show_shapes=True, show_layer_names=True)
Now you can visualize the model in TensorBoard by typing tensorboard --logdir ./logs
into a terminal in the repository directory and then navigating to http://localhost:6006 in your browser.
Training on our production model has finally concluded!
With each epoch taking around 9 hours using spectrograms, the total training time on an Nvidia GTX1070(8GB) for each model using the final architecture was roughly 11 days.
# Load saved model pickles
all_pickles = sorted(glob("results/*.pickle"))
# Extract model names
model_names = [item[8:-7] for item in all_pickles]
# Extract loss history
valid_loss = [pickle.load( open( i, "rb" ) )['val_loss'] for i in all_pickles]
train_loss = [pickle.load( open( i, "rb" ) )['loss'] for i in all_pickles]
# Identify number of epochs each model ran for
num_epochs = [len(valid_loss[i]) for i in range(len(valid_loss))]
fig = plt.figure(figsize=(16,5))
# Plot the training loss vs. epochs
ax1 = fig.add_subplot(121)
for i in range(len(all_pickles)):
ax1.plot(np.linspace(1, num_epochs[i], num_epochs[i]),
train_loss[i], label=model_names[i])
ax1.legend()
ax1.set_xlim([1, max(num_epochs)])
plt.xlabel('Epoch')
plt.ylabel('Training Loss')
# Plot the validation loss vs. epochs
ax2 = fig.add_subplot(122)
for i in range(len(all_pickles)):
ax2.plot(np.linspace(1, num_epochs[i], num_epochs[i]),
valid_loss[i], label=model_names[i])
ax2.legend()
ax2.set_xlim([1, max(num_epochs)])
plt.xlabel('Epoch')
plt.ylabel('Validation Loss')
plt.show()
Language modeling, the component of a speech recognition system that estimates the prior probabilities of spoken sounds, is the system's knowledge of what probable word sequences are. This system uses a class based language model, which allows it to narrow down its search field through the vocabulary of the speech recognizer (the first part of the system) as it will rarely see a sentence that looks like "the dog the ate sand the water" so it will assume that 'the' is not likely to come after the word 'sand'. We do this by assigning a probability to every possible sentence and then picking the word with the highest prior probability of occurring. Language model smoothing (often called discounting) will help us overcome the problem that this creates a model that will assign a probability of 0 to anything it hasn't witnessed in training. This is done by distributing non-zero probabilities over all possible occurrences in proportion to the unigram probabilities of words. This overcomes the limitations of traditional n-gram based modeling and is all made possible by the added dimension of time sequences in the recurrent neural network.
The best performing model is considered the one that gives the highest probabilities to the words that are found in a test set, since it wastes less probability on words that actually occur. More information on comparing models can be found in this paper.
We'll also benchmark how long it takes the model to produce the predictions.
def get_predictions(index, partition, input_to_softmax, model_path):
# Load the train and test data
data_gen = AudioGenerator(spectrogram = spectrogram)
data_gen.load_train_data()
data_gen.load_validation_data()
data_gen.load_test_data()
# Obtain ground truth transcriptions and audio features
if partition == 'validation':
transcription = data_gen.valid_texts[index]
audio_path = data_gen.valid_audio_paths[index]
data_point = data_gen.normalize(data_gen.featurize(audio_path))
elif partition == 'train':
transcription = data_gen.train_texts[index]
audio_path = data_gen.train_audio_paths[index]
data_point = data_gen.normalize(data_gen.featurize(audio_path))
elif partition == 'test':
transcription = data_gen.test_texts[index]
audio_path = data_gen.test_audio_paths[index]
data_point = data_gen.normalize(data_gen.featurize(audio_path))
else:
raise Exception('Invalid partition! Must be "train", "test", or "validation"')
# Obtain predictions
input_to_softmax.load_weights(model_path)
prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
output_length = [input_to_softmax.output_length(data_point.shape[0])]
pred_ints = (K.eval(K.ctc_decode(
prediction, output_length)[0][0])+1).flatten().tolist()
# Display ground truth transcription and predicted transcripted.
print('True transcription:\n' + '\n' + transcription)
print('Predicted transcription:\n' + '\n' + ''.join(int_seq_to_text(pred_ints)))
%time get_predictions(index=95, partition='train', input_to_softmax=hey_jetson, model_path='./results/model_11.h5')
%time get_predictions(index=95, partition='validation', input_to_softmax=hey_jetson, model_path='./results/model_11.h5')
%time get_predictions(index=95, partition='test', input_to_softmax=hey_jetson, model_path='./results/model_11.h5')
%time get_predictions(index=95, partition='test', input_to_softmax=model_10, model_path='./results/model_10.h5')
%time get_predictions(index=95, partition='test', input_to_softmax=model_8, model_path='./results/model_8.h5')
True transcription: in the absence of a hypodermic syringe the remedy may be given by the rectum Predicted transcription: inse absens of the hapademec shaenge sevemety may be gave in vye of recttim CPU times: user 2.29 s, sys: 89.4 ms, total: 2.37 s Wall time: 2.48 s
# Dilated Deep CNN + Deep Bidirectional RNN + Time Distributed Dense
%time get_predictions(index=95, partition='test', input_to_softmax=model_7, model_path='./results/model_7.h5')
True transcription: in the absence of a hypodermic syringe the remedy may be given by the rectum Predicted transcription: en tet fom af e heacedemo fovang the vemat o mo de givven by tha re tem CPU times: user 2.39 s, sys: 105 ms, total: 2.49 s Wall time: 2.56 s
# Deep CNN + Deep Bidirectional RNN + Time Distributed Dense w/ Droupout
%time get_predictions(index=95, partition='test', input_to_softmax=model_6, model_path='./results/model_6.h5')
True transcription: in the absence of a hypodermic syringe the remedy may be given by the rectum Predicted transcription: eensteedsoms of the had edem ofs hemine the veman youmo be given by thevectim CPU times: user 2.33 s, sys: 105 ms, total: 2.43 s Wall time: 2.5 s
# Model_5 on the 100 hour subset
%time get_predictions(index=95, partition='test', input_to_softmax=model_5, model_path='./results/model_5.h5')
True transcription: in the absence of a hypodermic syringe the remedy may be given by the rectum Predicted transcription: insth at srens of he had a demme sivvems thof remmantd ye mobe gaven by the vectame CPU times: user 2.34 s, sys: 80.1 ms, total: 2.42 s Wall time: 2.49 s
# Initial RNN model
%time get_predictions(index=95, partition='test', input_to_softmax=model_0, model_path='./results/model_0.h5')
True transcription: in the absence of a hypodermic syringe the remedy may be given by the rectum Predicted transcription: CPU times: user 2.1 s, sys: 88.9 ms, total: 2.19 s Wall time: 2.21 s
We've come a long way. Our final model comes close to the actual spoken transcription, while the first few models predicted nothing at all, or the same letter for every utterance. Now, let's quantify the final model's performance.
Fist we need to obtain the ground truth transcriptions and the predicted transcriptions for the validation and test sets. Then we can use several measures to determine accuracy.
# Function for extracting the ground truth transcriptions from the audio files.
def get_ground_truth(partition):
ground_truth_list = []
data_gen = AudioGenerator(spectrogram = spectrogram)
if partition == 'train':
data_gen.load_train_data()
for i in range(0, 61956):
transcription = data_gen.train_texts[i]
ground_truth_list.append(transcription)
elif partition == 'validation':
data_gen.load_validation_data()
for i in range(0, 4277):
transcription = data_gen.valid_texts[i]
ground_truth_list.append(transcription)
elif partition == 'test':
data_gen.load_test_data()
for i in range(0, 4176):
transcription = data_gen.test_texts[i]
ground_truth_list.append(transcription)
ground_truth = np.asarray(ground_truth_list)
return ground_truth
# Extracting the validation set ground truth transcriptions
valid_ground_truth = get_ground_truth('validation')
valid_ground_truth
array(['looking about me i saw a gentleman in a neat black dress smiling and his hand extended to me with great cordiality', 'he must have realized i was a stranger and wished to tender his hospitality to me i accepted it gratefully i clasped his hand he pressed mine', "we gazed for a moment silently into each other's eyes", ..., 'that penance hath no blame which magdalen found sweet purging our shame self punishment is virtue all men know', 'heaven help that body which a little mind housed in a head lacking ears tongue and eyes and senseless but for smell can tyrannise', 'due to thee their praise of maiden pure of teeming motherhood'], dtype='<U199')
# Extracting the test set ground truth transcriptions
test_ground_truth = get_ground_truth('test')
test_ground_truth
array(['stuff it into you his belly counselled him', 'after early nightfall the yellow lamps would light up here and there the squalid quarter of the brothels', 'hello bertie any good in your mind', ..., 'thou canst wait through sorrow and sickness to bring souls to touch and think it soon when others cry too late', 'i thank all who have loved me in their hearts with thanks and love from mine', 'i love thee freely as men strive for right i love thee purely as they turn from praise'], dtype='<U212')
# Saving the ground truth transcriptions
np.save('results/valid_ground_truth', valid_ground_truth)
np.save('results/test_ground_truth', test_ground_truth)
# Function for extracting the predicted transcriptions from the audio files.
def get_pred_transcription(partition, input_to_softmax, model_path):
prediction_list = []
data_gen = AudioGenerator(spectrogram = spectrogram)
data_gen.load_test_data()
data_gen.load_validation_data()
data_gen.load_train_data()
if partition == 'train':
for i in range(0, 61956):
audio_path = data_gen.train_audio_paths[i]
data_point = data_gen.normalize(data_gen.featurize(audio_path))
input_to_softmax.load_weights(model_path)
prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
output_length = [input_to_softmax.output_length(data_point.shape[0])]
pred_ints = (K.eval(K.ctc_decode(
prediction, output_length)[0][0])+1).flatten().tolist()
pred_trans = ''.join(int_seq_to_text(pred_ints))
prediction_list.append(pred_trans)
if i%2000 == 0: print('Processed {}'.format(i))
elif partition == 'validation':
for i in range(0, 4277):
audio_path = data_gen.valid_audio_paths[i]
data_point = data_gen.normalize(data_gen.featurize(audio_path))
input_to_softmax.load_weights(model_path)
prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
output_length = [input_to_softmax.output_length(data_point.shape[0])]
pred_ints = (K.eval(K.ctc_decode(
prediction, output_length)[0][0])+1).flatten().tolist()
pred_trans = ''.join(int_seq_to_text(pred_ints))
prediction_list.append(pred_trans)
if i%200 == 0: print('Processed {}'.format(i))
elif partition == 'test':
for i in range(0, 4176):
audio_path = data_gen.test_audio_paths[i]
data_point = data_gen.normalize(data_gen.featurize(audio_path))
input_to_softmax.load_weights(model_path)
prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
output_length = [input_to_softmax.output_length(data_point.shape[0])]
pred_ints = (K.eval(K.ctc_decode(
prediction, output_length)[0][0])+1).flatten().tolist()
pred_trans = ''.join(int_seq_to_text(pred_ints))
prediction_list.append(pred_trans)
if i%200 == 0: print('Processed {}'.format(i))
pred_transcription = np.asarray(prediction_list)
return pred_transcription
# Extracting the validation set predicted transcriptions
valid_pred_transcription = get_pred_transcription(partition='validation',
input_to_softmax=hey_jetson, model_path='./results/model_11.h5')
valid_pred_transcription
# Extracting the test set predicted transcriptions
test_pred_transcription = get_pred_transcription(partition='test',
input_to_softmax=hey_jetson, model_path='./results/model_11.h5')
test_pred_transcription
# Saving the predicted transcriptions
np.save('results/valid_pred_transcription', valid_pred_transcription)
np.save('results/test_pred_transcription', test_pred_transcription)
def get_sim(partition, input_to_softmax, model_path):
# Getting the cosine similarity using Count Vectorizer
sim_list = []
data_gen = AudioGenerator(spectrogram = spectrogram)
data_gen.load_test_data()
data_gen.load_validation_data()
data_gen.load_train_data()
if partition == 'train':
for i in range(0, 61956):
transcription = data_gen.train_texts[i]
audio_path = data_gen.train_audio_paths[i]
data_point = data_gen.normalize(data_gen.featurize(audio_path))
input_to_softmax.load_weights(model_path)
prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
output_length = [input_to_softmax.output_length(data_point.shape[0])]
pred_ints = (K.eval(K.ctc_decode(
prediction, output_length)[0][0])+1).flatten().tolist()
pred_trans = ''.join(int_seq_to_text(pred_ints))
cv = CountVectorizer()
ground_truth_vec = cv.fit_transform([transcription])
pred_transcription_vec = cv.transform([pred_trans])
sim = cosine_similarity(ground_truth_vec, pred_transcription_vec)
sim_list.append(sim)
if i%2000 == 0: print('Processed {}'.format(i))
elif partition == 'validation':
for i in range(0, 4277):
transcription = data_gen.valid_texts[i]
audio_path = data_gen.valid_audio_paths[i]
data_point = data_gen.normalize(data_gen.featurize(audio_path))
input_to_softmax.load_weights(model_path)
prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
output_length = [input_to_softmax.output_length(data_point.shape[0])]
pred_ints = (K.eval(K.ctc_decode(
prediction, output_length)[0][0])+1).flatten().tolist()
pred_trans = ''.join(int_seq_to_text(pred_ints))
cv = CountVectorizer()
ground_truth_vec = cv.fit_transform([transcription])
pred_transcription_vec = cv.transform([pred_trans])
sim = cosine_similarity(ground_truth_vec, pred_transcription_vec)
sim_list.append(sim)
if i%200 == 0: print('Processed {}'.format(i))
elif partition == 'test':
for i in range(0, 4176):
transcription = data_gen.test_texts[i]
audio_path = data_gen.test_audio_paths[i]
data_point = data_gen.normalize(data_gen.featurize(audio_path))
input_to_softmax.load_weights(model_path)
prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
output_length = [input_to_softmax.output_length(data_point.shape[0])]
pred_ints = (K.eval(K.ctc_decode(
prediction, output_length)[0][0])+1).flatten().tolist()
pred_trans = ''.join(int_seq_to_text(pred_ints))
cv = CountVectorizer()
ground_truth_vec = cv.fit_transform([transcription])
pred_transcription_vec = cv.transform([pred_trans])
sim = cosine_similarity(ground_truth_vec, pred_transcription_vec)
sim_list.append(sim)
if i%200 == 0: print('Processed {}'.format(i))
sim_array = np.asarray(sim_list)
return sim_array
# Extracting the validation count vectorizer cosine similarities
valid_sim = get_sim(partition='validation',
input_to_softmax=hey_jetson, model_path='./results/model_11.h5')
valid_sim
valid_sim.mean()
# Extracting the test count vectorizer cosine similarities
test_sim = get_sim(partition='test',
input_to_softmax=hey_jetson, model_path='./results/model_11.h5')
test_sim
test_sim.mean()
It looks like we have about an 80% similarity between the predictions and ground truth transcriptions in the validation and a 78% similarity in the test sets when using count vectorization, so let's see if TF-IDF vectorization produces different results:
def get_tfidf_sim(partition, input_to_softmax, model_path):
# Getting the cosine similarity using Tfidf Vectorizer
sim_list = []
data_gen = AudioGenerator(spectrogram = spectrogram)
data_gen.load_test_data()
data_gen.load_validation_data()
data_gen.load_train_data()
if partition == 'train':
for i in range(0, 61956):
transcription = data_gen.train_texts[i]
audio_path = data_gen.train_audio_paths[i]
data_point = data_gen.normalize(data_gen.featurize(audio_path))
input_to_softmax.load_weights(model_path)
prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
output_length = [input_to_softmax.output_length(data_point.shape[0])]
pred_ints = (K.eval(K.ctc_decode(
prediction, output_length)[0][0])+1).flatten().tolist()
pred_trans = ''.join(int_seq_to_text(pred_ints))
tfidf = TfidfVectorizer()
ground_truth_vec = tfidf.fit_transform([transcription])
pred_transcription_vec = tfidf.transform([pred_trans])
sim = cosine_similarity(ground_truth_vec, pred_transcription_vec)
sim_list.append(sim)
if i%2000 == 0: print('Processed {}'.format(i))
elif partition == 'validation':
for i in range(0, 4277):
transcription = data_gen.valid_texts[i]
audio_path = data_gen.valid_audio_paths[i]
data_point = data_gen.normalize(data_gen.featurize(audio_path))
input_to_softmax.load_weights(model_path)
prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
output_length = [input_to_softmax.output_length(data_point.shape[0])]
pred_ints = (K.eval(K.ctc_decode(
prediction, output_length)[0][0])+1).flatten().tolist()
pred_trans = ''.join(int_seq_to_text(pred_ints))
tfidf = TfidfVectorizer()
ground_truth_vec = tfidf.fit_transform([transcription])
pred_transcription_vec = tfidf.transform([pred_trans])
sim = cosine_similarity(ground_truth_vec, pred_transcription_vec)
sim_list.append(sim)
if i%200 == 0: print('Processed {}'.format(i))
elif partition == 'test':
for i in range(0, 4176):
transcription = data_gen.test_texts[i]
audio_path = data_gen.test_audio_paths[i]
data_point = data_gen.normalize(data_gen.featurize(audio_path))
input_to_softmax.load_weights(model_path)
prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
output_length = [input_to_softmax.output_length(data_point.shape[0])]
pred_ints = (K.eval(K.ctc_decode(
prediction, output_length)[0][0])+1).flatten().tolist()
pred_trans = ''.join(int_seq_to_text(pred_ints))
tfidf = TfidfVectorizer()
ground_truth_vec = tfidf.fit_transform([transcription])
pred_transcription_vec = tfidf.transform([pred_trans])
sim = cosine_similarity(ground_truth_vec, pred_transcription_vec)
sim_list.append(sim)
if i%200 == 0: print('Processed {}'.format(i))
sim_array = np.asarray(sim_list)
return sim_array
# Extracting the validation tfidf cosine similarities
valid_tfidf_sim = get_tfidf_sim(partition='validation',
input_to_softmax=hey_jetson, model_path='./results/model_11.h5')
valid_tfidf_sim
valid_tfidf_sim.mean()
# Extracting the test tfidf cosine similarities
test_tfidf_sim = get_tfidf_sim(partition='test',
input_to_softmax=hey_jetson, model_path='./results/model_11.h5')
test_tfidf_sim
test_tfidf_sim.mean()
It looks like we have about a 80% similarity between the predictions and ground truth transcriptions in the validation set and around a 78% similarity in the test sets when using TF-IDF vectorization as well.
Word error rate is defined as (substitutions + deletions + insertions) / # of words in the ground truth transcription.
More info on this metric can be found at Wikipedia.
def wer_calc(ref, pred):
# Calcualte word error rate
d = np.zeros((len(ref) + 1) * (len(pred) + 1), dtype=np.uint16)
d = d.reshape((len(ref) + 1, len(pred) + 1))
for i in range(len(ref) + 1):
for j in range(len(pred) + 1):
if i == 0:
d[0][j] = j
elif j == 0:
d[i][0] = i
for i in range(1, len(ref) + 1):
for j in range(1, len(pred) + 1):
if ref[i - 1] == pred[j - 1]:
d[i][j] = d[i - 1][j - 1]
else:
substitution = d[i - 1][j - 1] + 1
insertion = d[i][j - 1] + 1
deletion = d[i - 1][j] + 1
d[i][j] = min(substitution, insertion, deletion)
result = float(d[len(ref)][len(pred)]) / len(ref) * 100
return result
# Function for extracting the predicted transcriptions from the audio files and calculating word error rate on them
def get_wer(partition, input_to_softmax, model_path):
wer_list = []
data_gen = AudioGenerator(spectrogram = spectrogram)
data_gen.load_test_data()
data_gen.load_validation_data()
data_gen.load_train_data()
if partition == 'train':
for i in range(0, 61956):
transcription = data_gen.train_texts[i]
audio_path = data_gen.train_audio_paths[i]
data_point = data_gen.normalize(data_gen.featurize(audio_path))
input_to_softmax.load_weights(model_path)
prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
output_length = [input_to_softmax.output_length(data_point.shape[0])]
pred_ints = (K.eval(K.ctc_decode(
prediction, output_length)[0][0])+1).flatten().tolist()
pred_trans = ''.join(int_seq_to_text(pred_ints))
error_rate = wer_calc(transcription, pred_trans)
wer_list.append(error_rate)
if i%2000 == 0: print('Processed {}'.format(i))
elif partition == 'validation':
for i in range(0, 4277):
transcription = data_gen.valid_texts[i]
audio_path = data_gen.valid_audio_paths[i]
data_point = data_gen.normalize(data_gen.featurize(audio_path))
input_to_softmax.load_weights(model_path)
prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
output_length = [input_to_softmax.output_length(data_point.shape[0])]
pred_ints = (K.eval(K.ctc_decode(
prediction, output_length)[0][0])+1).flatten().tolist()
pred_trans = ''.join(int_seq_to_text(pred_ints))
error_rate = wer_calc(transcription, pred_trans)
wer_list.append(error_rate)
if i%200 == 0: print('Processed {}'.format(i))
elif partition == 'test':
for i in range(0, 4176):
transcription = data_gen.test_texts[i]
audio_path = data_gen.test_audio_paths[i]
data_point = data_gen.normalize(data_gen.featurize(audio_path))
input_to_softmax.load_weights(model_path)
prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
output_length = [input_to_softmax.output_length(data_point.shape[0])]
pred_ints = (K.eval(K.ctc_decode(
prediction, output_length)[0][0])+1).flatten().tolist()
pred_trans = ''.join(int_seq_to_text(pred_ints))
error_rate = wer_calc(transcription, pred_trans)
wer_list.append(error_rate)
if i%200 == 0: print('Processed {}'.format(i))
wer_array = np.asarray(wer_list)
return wer_array
# Extracting the validation word error rates
valid_wer = get_wer(partition='validation',
input_to_softmax=hey_jetson, model_path='./results/model_11.h5')
valid_wer
# Calculating the word error rate in the validation set
valid_wer.mean()
# Extracting the test word error rates
test_wer = get_wer(partition='test',
input_to_softmax=hey_jetson, model_path='./results/model_11.h5')
test_wer
# Calculating the word error rate in the test set
test_wer.mean()
We did pretty well! Our model achieved a word error rate of about 16% in the validation set and 18% in the test set.The error rate of our model is still pretty high compared to some of the models expored in the reference papers, but this is due to the scope of the project, the lack of a pretrained language model, the single 8G GPU and the length of time I had for this project prevented me from exploring further.
Let's time inference on the production system running live on the server to get a feel for how quickly we can serve up predictions:
%time get_predictions(index=12, partition='test', input_to_softmax=hey_jetson, model_path='./results/model_11.h5')
%time get_predictions(index=21, partition='test', input_to_softmax=hey_jetson, model_path='./results/model_11.h5')
%time get_predictions(index=2012, partition='test', input_to_softmax=hey_jetson, model_path='./results/model_11.h5')
This concludes the model construction demo. You have now trained a strong performing recurrent neural network for speech recognition, from scratch, with a word error rate of <20%. You have built an ASR model ready for deployment in production environments. If you would like to do so, instructions for building and deploying this model on the Nvidia Jetson using the flask RESTful web app framework for python are included in the GitHub Repository.
Next steps for this project, and things you can try on your own, include:
I want to thank the following people/organizations for their support and training:
If you would like to contribute to this project, please fork and submit a pull request. I am always open to feedback and would love help with this project.