The applications of deep learning in robust speaker recognition

Tan, Zhili

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/84437

Title:	The applications of deep learning in robust speaker recognition
Authors:	Tan, Zhili
Degree:	Ph.D.
Issue Date:	2018
Abstract:	Speaker verification aims to verify whether a test utterance is spoken by a target speaker. Since 2011, the i-vector approach together with probabilistic linear discriminant analysis (PLDA) have dominated this field. Under this framework, each utterance is represented by a low-dimensional i-vector that captures speaker- and channel-dependent characteristics, and the PLDA model aims to separate the speaker variability from channel variability in the i-vector space. On the other hand, in recent years, deep learning has achieved a great success in many areas, including speech recognition, computer vision, speech synthesis and music recognition. This thesis explores the applications of deep learning in speaker verification, especially under the i-vector/PLDA framework. To address the limitations of hand-crafted acoustic features, this thesis proposes a deep architecture formed by stacking a deep belief network (DBN) on top of a denoising autoencoder (DAE) for noise robust speaker identification. After backpropagation fine-tuning, the network - referred to as denoising autoencoder-deep neural network (DAE-DNN) - outputs the posterior probabilities of speakers and the top hidden layer outputs speaker-dependent bottleneck (BN) features. The autoencoder aims to reconstruct the clean spectra of a noisy test utterance using the spectra of the noisy test utterance and its SNR as input. With this denoising capability, the output from the bottleneck layer can be considered as a low-dimensional representation of the denoised utterances. These frame-based bottleneck features are then used to train an i-vector extractor and a PLDA model for speaker identification. Experimental results based on a noise-contaminated YOHO corpus show that the bottleneck features outperform the conventional MFCC under low SNR conditions and that the fusion of the two features leads to further performance gain, suggesting that the two features are complementary to each other. A limitation of the above network is that the BN feature vectors tend to be very similar across the whole utterance, causing numerical difficulty when training the UBM and the i-vector extractor. This problem, however, can be overcome by training the DAE-DNN to produce senone posteriors instead of speaker posteriors. The resulting DAE-DNN produces not only denoised BN features, but also senone posteriors from which a senone i-vector extractor can be trained and senone i-vectors can be extracted. Because the frame-based BN features are now aligned to senone clusters instead of acoustic clusters, the resulting i-vectors characterize how individual speakers pronounce different phones, which allows more precise comparisons between speakers. Through extensive evaluations on NIST 2012 SRE, this thesis demonstrates that senone i-vectors outperform conventional GMM i-vectors. More interestingly, the BN features are not only phonetically discriminative, results suggest that they also contain sufficient speaker information to produce BN-based senone i-vectors that outperform the conventional senone i-vectors. This thesis also shows that DAE training is more beneficial to BN feature extraction than senone posterior estimation. Although the denoised BN-based senone i-vectors improve the noise robustness significantly compared to the MFCC-GMM ones, adverse acoustic conditions and duration variability in utterances could still have detrimental effect on PLDA scores. This thesis also proposes and investigates several DNN-based PLDA score compensation, transformation and calibration algorithms for enhancing the noise robustness of i-vector/PLDA systems. Unlike conventional calibration methods where the required score shift is a linear function of SNR or log-duration, the DNN approach learns the complex relationship between the score shifts and the combination of i-vector pairs and uncalibrated scores. Furthermore, with the exibility of DNNs, it is possible to explicitly train a DNN to recover the clean scores without having to estimate the score shifts. To alleviate the overfitting problem, multi-task learning is applied to incorporate auxiliary information such as SNRs and speaker ID of training utterances into the DNN. Experiments on NIST 2012 SRE show that score calibration derived from multi-task DNNs can improve the performance of the conventional score-shift approach significantly, especially under noisy conditions.
Subjects:	Hong Kong Polytechnic University -- Dissertations Automatic speech recognition Machine learning
Pages:	xv, 109 pages : color illustrations
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/9625

Show full item record

Page views

137

Last Week
0

Last month

Citations as of May 11, 2025

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM