一个鲁棒的语音识别系统,即使在环境与训练数据失配的情况下,性能也不会下降太多。本章将从麦克风、回声消除、语音信号增强等几方面讲解如何提高系统对环境变化的鲁棒性。
Def: the set of transformations that affect the speech signal from the time it leaves the speaker’s mouth until it is in digital form.
主要包含两部分:
加性噪声
声道失真
Acoustical Transducers: 声学信号和电信号之间互相转换的器件。e.g. 麦克风、扩音器
电容型麦克风:电容的两个极板中,有一个是活动的,根据声压的不同,极板间的距离会相应的进行改变,从而改变电容的容量,配合相应的电路,从而得到不同的电压输出。
麦克风的指向性:
Least Mean Square(LMS)算法非常简单,它假设系统的信号d(n)可以麦克风的输入信号通过加权得到:
d(n)=L−1∑k=0gkx(n−k)+u(n)=GTX(n)+u(n)其中u(n)为噪声信号,它与x(n)相互独立。
因此我们可以用相同的方法来估计滤波器的输出:
y(n)=L−1∑k=0wkx(n−k)=WTX(n)所以估计信号与系统信号的误差为
e(n)=d(n)−y(n)=d(n)−WTX(n)LMS 算法就是通过不断的迭代,逐步的减小误差,采用梯度下降法,得到权值的更新公式:
W(n+1)=W(n)+ϵe(n)X(n)ϵ的选择至关重要:过小,算法收敛慢,过大,则算法误差大。
ϵ的取值范围:
0<ϵ<1λ其中λ为输入X(n)的最大特征值。
归一化LMS 算法:动态更新学习步长ϵ
ϵ(n)=ϵδ+Lˆσ2x(n)其中δ只是为了防止分母为0,ˆσ2x(n)为输入信号的能量估计,它通过指数窗进行更新:
ˆσ2x(n)=(1−β)ˆσ2x(n−1)+x2(n)或矩形窗进行更新:
ˆσ2x(n)=ˆσ2x(n−1)+1N(x2(n)−x2(n−N))Blind source separation(BSS) is a set of techniques that assume no information about the mixing process or the sources, apart from their mutual statistical independence, hence is termed blind.
方法:Independent component analysisICA,a set of techniques to solve the BSS problem that estimate a set of linear filters to separate the mixed signals under the assumption that the original sources are statistically independent.
针对单麦环境,介绍几种对于加性噪声和声道失真的补偿措施。
Spectral substraction
前提假设:信号y(m)由语音信号和加性噪声信号混叠而成:
y(n)=x(m)+n(m)
其中x(m)和n(m)相互独立,因此在频域中有
|Y(f)|2≈|X(f)|2+|N(f)|2
算法过程:
噪声估计。在无语音信号的时候,在M帧的时长上估计噪声的功率谱。
|ˆN(f)|2=1MM−1∑i=0|Yif|2计算语音信号的功率谱
|X(f)|2=|Y(f)|2−|ˆN(f)|2=|Y(f)|2(1−1SNR(f))
其中
SNR(f)=|Y(f)|2|N(f)|2
上述过程会造成信号的失真,若减小失真,可以牺牲较小的噪声衰减为代价,对 SNR 进行平滑:
SNR(f,t)=γSNR(f,t−1)+(1−γ)|Y(f)|2|N(f)|2
Smoothing over both time and frequency can be done to obtain more accurate SNR measurements and thus less distortion.
Frequency-Domain MMSE from Stereo Data
Wiener Filtering
Cepstral Mean Normalization (CMN)
对于给定信号x(n),基于短时倒谱分析得到长度为T的倒谱向量序列X=x0,x1,...,xT−1,它的均值为
ˉx=1TT−1∑t=0xt
CMN 其实就是计算归一化的倒谱向量ˆxt,由下式计算得到:
ˆxt=xt−ˉx
假设信号y(n)由输入信号x(n)经过滤波器h(n)得到,那么可以计算的到y(n)的倒谱序列Y=y0,y1,...,yT−1,根据梅尔倒谱的计算过程,可以定义向量h为
h=C(ln|H(ω0)|2⋯ln|H(ωM)|2)
其中C为 DCT 变换矩阵。所以有
yt=xt+h
因此可以计算得到倒谱序列Y的均值为
ˉy=1TT−1∑t=0yt=1TT−1∑t=0(xt+h)=ˉx+h
从而得到它的归一化向量
ˉyt=yt−ˉy=ˆxt
特点: 1. Use of CMN to the cepstrum vectors does not modify the delta or delta-delta cepstrum. 2. It has been found that this procedure does not degrade the recognition rate on utterances from the same acoustical environment, as long as they are longer than 2–4 seconds. 3. When a system is trained on one microphone and tested on another, CMN can provide significant robustness. 4. One drawback of CMN is it does not discriminate silence and voice in computing the utterance mean.
Real-Time Cepstral Normalization
动态更新均值估计:
ˉxt=(1−α)ˉxt−1+(α)xt
The Use of Gaussian Mixture Models
Retraining on Corrupted Speech
方法:
Model Adaptation
Parallel Model Combination
处理方法如下图:
Vector Taylor Series
Retraining on Compensated Features
信号经前端处理(如 CMN)后,再重新训练模型,性能可以得到进一步提升,如下图:
之前的处理都是针对稳态噪声,但对于非稳态噪声,特别是某些固定的噪声,如鼠标、键盘的敲击声,电脑的风扇声、关门声等,理论上对其进行建模可以很好的处理掉上述非稳态噪声。
训练过程如下图: