Notebook

10 Environmental Robustness¶

一个鲁棒的语音识别系统，即使在环境与训练数据失配的情况下，性能也不会下降太多。本章将从麦克风、回声消除、语音信号增强等几方面讲解如何提高系统对环境变化的鲁棒性。

10.1 The Acoustical Environment¶

Def: the set of transformations that affect the speech signal from the time it leaves the speaker’s mouth until it is in digital form.

主要包含两部分：

加性噪声
1. 稳态噪声，e.g. 风扇声、汽车发动机的噪声，对于稳态噪声，数字信号处理的滤波方法就可以很好的处理掉。
2. 非稳态噪声，e.g. 关门声、说话声、歌声等，若它与一个已知的信号相关，那么使用 AEC(adaptive echo-canceling) 可以很好的处理。
3. 噪声唤醒下还可能会影响的人的说话方式，比如人会不自觉的提高音量和音调，这种现象称之为Lombard effect(a phenomenon by which a speaker increases his vocal effort in the presence of background noise)。
声道失真
1. 混响，实际中，由于墙等障碍物的遮挡，会产生回声，因此麦克风会在不同的时间点收到说话人的声音和回声，从而产生混响，除非人和麦克风都处于消音室或空旷的地方。减小系统的混响，可以提高系统的鲁棒性。
2. 麦克风的频率特性
3. AD 电路的采样失真

10.2 Acoustical Transducers¶

Acoustical Transducers: 声学信号和电信号之间互相转换的器件。e.g. 麦克风、扩音器

电容型麦克风：电容的两个极板中，有一个是活动的，根据声压的不同，极板间的距离会相应的进行改变，从而改变电容的容量，配合相应的电路，从而得到不同的电压输出。
麦克风的指向性：
1. 全向麦克风(omnidirectional Microphones)。麦克风的响应与声波的方向无关。
2. 双向麦克风(Bidirectional Microphones)。只接收两个方向的声压信号。
3. 单向麦克风(Unidirectional Microphones)。只接收单个方向的声压信号。

10.3 Adaptive Echo Cancellation(AEC)¶

AEC的作用：降噪，消掉麦克风中拾取到的系统本身产生的声音，如上图所示， $r(n)$ 是麦克风拾取的信号， $d(n)$ 是扬声器播放的声音， $s(n)$ 是用户的声音， $v(n)$ 是噪声， $\hat{d}(n)$ 是滤波器估计的系统声音。
性能评价指标：echo-return loss enhancement(ERLE)，越大越好，计算方法如下：

$ERLE(db) = 10 lg_{10} \frac{E\{d^2(n)\}}{E\{(d(n) - \hat{d}(n))^2\}}$

10.3.1 The LMS Algorithm¶

Least Mean Square(LMS)算法非常简单，它假设系统的信号 $d(n)$ 可以麦克风的输入信号通过加权得到：

$d(n) = \sum_{k=0}^{L-1}g_k x(n-k) + u(n) = G^T X(n) + u(n)$

其中 $u(n)$ 为噪声信号，它与 $x(n)$ 相互独立。

因此我们可以用相同的方法来估计滤波器的输出:

$y(n) = \sum_{k=0}^{L-1}w_k x(n-k) = W^T X(n)$

所以估计信号与系统信号的误差为

$e(n) = d(n) - y(n) = d(n) - W^T X(n)$

LMS 算法就是通过不断的迭代，逐步的减小误差，采用梯度下降法，得到权值的更新公式：

$W(n+1) = W(n) + \epsilon e(n)X(n)$

10.3.2 Convergence Properties of the LMS Algorithm¶

$\epsilon$ 的选择至关重要：过小，算法收敛慢，过大，则算法误差大。

$\epsilon$ 的取值范围：

$0 < \epsilon <\frac{1}{\lambda}$

其中 $\lambda$ 为输入 $X(n)$ 的最大特征值。

10.3.3 Normalized LMS Algorithm¶

归一化LMS 算法：动态更新学习步长 $\epsilon$

$\epsilon(n) = \frac{\epsilon}{\delta + L\hat{\sigma}_x^2(n)}$

其中 $\delta$ 只是为了防止分母为0， $\hat{\sigma}_x^2(n)$ 为输入信号的能量估计，它通过指数窗进行更新：

$\hat{\sigma}_x^2(n) = (1-\beta)\hat{\sigma}_x^2(n-1) + x^2(n)$

或矩形窗进行更新：

$\hat{\sigma}_x^2(n) = \hat{\sigma}_x^2(n-1) + \frac{1}{N}(x^2(n) - x^2(n-N))$

10.4 Multimicrophone Speech Enhancement¶

10.4.1 Microphone Arrays¶

使用麦克风阵列的目的：

声源定位
提高信号的 SNR

方法：beamforming

Blind source separation(BSS) is a set of techniques that assume no information about the mixing process or the sources, apart from their mutual statistical independence, hence is termed blind.

方法：Independent component analysisICA，a set of techniques to solve the BSS problem that estimate a set of linear filters to separate the mixed signals under the assumption that the original sources are statistically independent.

10.5 Environment compensation preprocessing¶

针对单麦环境，介绍几种对于加性噪声和声道失真的补偿措施。

Spectral substraction
- 前提假设：信号 $y(m)$ 由语音信号和加性噪声信号混叠而成：
  
  $y(n) = x(m) + n(m)$
  
  其中 $x(m)$ 和 $n(m)$ 相互独立，因此在频域中有
  
  $|Y(f)|^2 \approx |X(f)|^2 + |N(f)|^2$
- 算法过程：
  1. 噪声估计。在无语音信号的时候，在 $M$ 帧的时长上估计噪声的功率谱。
    $|\hat{N}(f)|^2 = \frac{1}{M}\sum_{i=0}^{M-1}|Y_i{f}|^2$
  2. 计算语音信号的功率谱
  其中
  
  上述过程会造成信号的失真，若减小失真，可以牺牲较小的噪声衰减为代价，对 SNR 进行平滑：
  
  Smoothing over both time and frequency can be done to obtain more accurate SNR measurements and thus less distortion.
Frequency-Domain MMSE from Stereo Data
Wiener Filtering
Cepstral Mean Normalization (CMN)

对于给定信号 $x(n)$ ，基于短时倒谱分析得到长度为 $T$ 的倒谱向量序列 $X={x_0, x_1, ..., x_{T-1}}$ ，它的均值为

$\bar{x} = \frac{1}{T}\sum_{t=0}^{T-1}x_t$

CMN 其实就是计算归一化的倒谱向量 $\hat{x}_t$ ，由下式计算得到:

$\hat{x}_t = x_t - \bar{x}$

假设信号 $y(n)$ 由输入信号 $x(n)$ 经过滤波器 $h(n)$ 得到，那么可以计算的到 $y(n)$ 的倒谱序列 $Y={y_0, y_1, ..., y_{T-1}}$ ，根据梅尔倒谱的计算过程，可以定义向量 ${h}$ 为

${h} = C(\ln{|H(\omega_0)|^2} \cdots \ln{|H(\omega_M)|^2)}$

其中 $C$ 为 DCT 变换矩阵。所以有

$y_t = x_t + h$

因此可以计算得到倒谱序列 $Y$ 的均值为

$\bar{y} = \frac{1}{T}\sum_{t=0}^{T-1}y_t=\frac{1}{T}\sum_{t=0}^{T-1}(x_t + h) = \bar{x} + h$

从而得到它的归一化向量

$\bar{y}_t = y_t - \bar{y} = \hat{x}_t$

特点: 1. Use of CMN to the cepstrum vectors does not modify the delta or delta-delta cepstrum. 2. It has been found that this procedure does not degrade the recognition rate on utterances from the same acoustical environment, as long as they are longer than 2–4 seconds. 3. When a system is trained on one microphone and tested on another, CMN can provide significant robustness. 4. One drawback of CMN is it does not discriminate silence and voice in computing the utterance mean.
Real-Time Cepstral Normalization

动态更新均值估计：

$\bar{x}_t = （1-\alpha) \bar{x}_{t-1} + (\alpha)x_t$
The Use of Gaussian Mixture Models

10.6 Environmental Model Adaptation¶

Retraining on Corrupted Speech
- 方法：
  1. 在对应的噪声环境中，收集大量的训练数据，重新训练模型（成本过高）
  2. 采集少量的噪声，对已有的训练集加噪，重新训练模型，结果如下图：
  3. 在线采集数据、在线训练（只适合小词汇量系统）
  4. 使用不同的噪声、不同 SNR 的训练数据，重新训练模型，使模型对噪声环境更鲁棒。
Model Adaptation
1. MAP is an unstructured method, it can offer results similar to those of matched conditions, but it requires a significant amount of adaptation data.
2. MLLR can achieve reasonable performance with about a minute of speech for minor mismatches. But for severe mismatches, MLLR also requires a large number of transformations, which, in turn, require a larger amount of adaptation data
Parallel Model Combination

处理方法如下图：
Vector Taylor Series
Retraining on Compensated Features

信号经前端处理(如 CMN)后，再重新训练模型，性能可以得到进一步提升，如下图：

10.7 Modeling Nonstationary Noise¶

之前的处理都是针对稳态噪声，但对于非稳态噪声，特别是某些固定的噪声，如鼠标、键盘的敲击声，电脑的风扇声、关门声等，理论上对其进行建模可以很好的处理掉上述非稳态噪声。

训练过程如下图：