## Comparison of spectral envelope parametrization methods¶

This notebook investigates the difference between two spectral envelope parametrization / recovering methods below:

1. WORLD's code_spectral_envelope and decode_spectral_envelope
2. pysptk.sp2mc and pysptk.mc2sp, which resembles SPTK's mcep function for input type is 4 (i.e. periodgram) and its inverse transform.

I measuered normalized mean square error (NMSE) between spectral envelope and reconstructed spectral envelope. Synthesized audio examples are available in the notebook.

### Requirements¶

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib

In [2]:
import matplotlib
import seaborn
seaborn.set(style="dark")
rcParams['figure.figsize'] = (16, 4)

In [3]:
import pysptk
import pyworld
import librosa
import librosa.display
from IPython.display import Audio


## 0. Preparation¶

### Utlis¶

In [4]:
def decompose(x, fs, period=5.0):
"""Decompose speech signal into f0, spectral envelope and aperiodicity using WORLD
"""
f0, timeaxis = pyworld.harvest(x, fs, frame_period=period, f0_floor=71.0, f0_ceil=800.0)
sp = pyworld.cheaptrick(x, f0, timeaxis, fs)
ap = pyworld.d4c(x, f0, timeaxis, fs)
return f0, timeaxis, sp, ap

def vis(sp_param, approx_sp, x, y, top_title):
"""Visualize compressed spectral parameter, recovered spectral envelope and reconstructed speech wavevform
"""
figure(figsize=(16,14))
subplot(3,1,1)
librosa.display.specshow(sp_param.T, sr=fs, hop_length=hop_length, x_axis="time", y_axis="linear")
title(top_title)
subplot(3,1,2)
librosa.display.specshow(10*log(approx_sp).T, sr=fs, hop_length=hop_length, x_axis="time", y_axis="linear")
title("20log|H(w)|")
subplot(3,1,3)
plot(y, "r-+", label="reconstructed speech signal")
plot(x, label="original speech signal")
xlabel("sample")
legend(fontsize=14)

def mcepalpha(fs):
"""Determine frequency warping parameter from fs (simplest implementation)

See https://bitbucket.org/happyalu/mcep_alpha_calc/ for details.
"""
if fs == 16000:
return 0.41
elif fs == 22050:
return 0.455
elif fs == 44100:
return 0.544
elif fs == 48000:
return 0.554
else:
raise NotImplementedError


### Input data¶

In [5]:
path = pysptk.util.example_audio_file() # from cmu_arctic
x = x.astype(np.float64)

librosa.display.waveplot(x, sr=fs)
Audio(x, rate=fs)

Out[5]:
In [6]:
frame_period = 5.0
hop_length = int(fs * frame_period * 0.001)
fftlen = pyworld.get_cheaptrick_fft_size(fs)

f0, timeaxis, sp, ap = decompose(x, fs, frame_period)

print(hop_length)
print(sp.shape)
print(fftlen)

80
(801, 513)
1024


### F0¶

In [7]:
plot(timeaxis, f0, linewidth=2, label="F0 contour estimated by Harvest")
xlabel("Time [sec]")
ylabel("Frequency  [Hz]")
legend(fontsize=18);


### Spectral envelope¶

In [8]:
librosa.display.specshow(10*log(sp).T, sr=fs, hop_length=hop_length, x_axis="time", y_axis="linear")
colorbar()

Out[8]:
<matplotlib.colorbar.Colorbar at 0x7f9131cd92e8>

### Aperiodicity¶

In [9]:
librosa.display.specshow(20*log(ap).T, sr=fs, hop_length=hop_length, x_axis="time", y_axis="linear")
colorbar()

Out[9]:
<matplotlib.colorbar.Colorbar at 0x7f9130361550>

## 1. Synthesis from coded spectral envelope by WORLD¶

In [10]:
dim = 60
coded_sp = pyworld.code_spectral_envelope(sp, fs, dim)
decoded_sp = pyworld.decode_spectral_envelope(coded_sp, fs, fftlen)
y = pyworld.synthesize(f0, decoded_sp, ap, fs, frame_period)

nmse = np.linalg.norm(np.log(sp) - np.log(decoded_sp)) / np.linalg.norm(np.log(sp))
print("NMSE: {}".format(nmse))

vis(coded_sp, decoded_sp, x, y, "Mel-cepstrum")
Audio(y, rate=fs)

NMSE: 0.25376685579810726

Out[10]: