Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/87445
Title: Semi-supervised and adversarial domain adaptation for speaker recognition
Authors: Li, Longxin
Degree: M.Phil.
Issue Date: 2020
Abstract: The rapid development of technology has driven the society into a new era of AI in which speaker recognition is one of the essential techniques. Due to the unique characteristics of voiceprints, speaker recognition has been used for enhancing the security level of banking and personal security systems. Despite the great convenience provided by speaker recognition technology, some fundamental problems are remaining unsolved, which include (1) insufficient labeled samples from new acoustic environments for training supervised machine learning models and (2) domain mismatch among different acoustic environments. These fundamental problems may result in severe performance degradation in speaker recognition systems. We proposed two methods to address the above problems. First, to reduce domain mismatch in speaker verification systems, we propose an unsupervised domain adaptation method. Second, to enhance speaker identification performance, we introduce a contrastive adversarial domain adaptation network to create a domain-invariant feature space. The first method addresses the data sparsity issue by applying spectral clustering on in-domain unlabeled data to obtain hypothesized speaker labels for adapting an out-of-domain PLDA mixture model to the target domain. To further refine the target PLDA mixture model, spectral clustering is iteratively applied to the new PLDA score matrix to produce a new set of hypothesized speaker labels. A gender-aware deep neural network (DNN) is trained to produce gender posteriors given an i-vector. The gender posteriors then replace the posterior probabilities of the indicator variables in the PLDA mixture model. A gender-dependent inter dataset variability compensation (GD-IDVC) is implemented to reduce the mismatch between the i-vectors obtained from the in-domain and out-of-domain datasets. Evaluations based on NIST 2016 SRE show that at the end of the iterative re-training, the PLDA mixture model becomes fully adapted to the new domain. Results also show that the PLDA scores can be readily incorporated into spectral clustering, resulting in high-quality speaker clusters that could not be possibly achieved by agglomerative hierarchical clustering.
The second method aims to reduce the mismatch between male and female speakers through adversarial domain adaptation. The method mitigates an intrinsic drawback of the domain adversarial network by splitting the feature extractor into two contrastive branches, with one branch delegating for the class-dependence in the latent space and another branch focusing on domain-invariance. The feature extractor achieves these contrastive goals by sharing the first and the last hidden layers but having the decoupled branches in the middle hidden layers. We adversarially trained the label predictor to produce equal posterior probabilities across all of its outputs instead of producing one-hot outputs to ensure that the feature extractor can produce class-discriminative embedded features. We refer to the resulting domain adaptation network as a contrastive adversarial domain adaptation network (CADAN). We evaluated the domain-invariance of the embedded features via a series of speaker identifcation experiments under both clean and noisy conditions. Results demonstrate that the embedded features produced by CADAN lead to 8.9% and 77.6% improvement in speaker identification accuracy when compared with the conventional DAN under clean and noisy conditions, respectively.
Subjects: Speech processing systems
Pattern recognition systems
Hong Kong Polytechnic University -- Dissertations
Pages: vi, 64 pages : color illustrations
Appears in Collections:Thesis

Show full item record

Page views

41
Last Week
0
Last month
Citations as of May 5, 2024

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.