Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/93563
Title: Deep speaker embedding for robust speaker verification
Authors: Tu, Youzhi
Degree: Ph.D.
Issue Date: 2022
Abstract: Speaker verification (SV) aims to determine whether the speaker identity of a test utterance matches that of a target speaker. In SV, the identity of a variable-length utterance is typically represented by a fixed-dimensional vector. This vector or its modeling process is referred to as speaker embedding. Although state-of-the-art deep speaker embedding has achieved outstanding performance, deploying SV systems to adverse acoustic environments still faces a number of challenges. First, today's SV systems rely on the condition that the training and test data share the same distribution. Once this condition is violated, domain mismatch will occur. The problem will be exacerbated when the speaker embeddings violate the Gaussianity constraint. Second, because the temporal feature maps produced by the last frame-level layer are highly non-stationary, it is not desirable to use their global statistics as speaker embeddings. Third, current speaker embedding networks do not have any mechanisms to let the frame-level information flow directly into the embeddings layer, causing information loss in the pooling layer.
This thesis develops three strategies to address the above challenges. First, to jointly address domain mismatch and the Gaussianity requirement of probabilistic linear discriminant analysis (PLDA) models, the author proposes a variational domain adversarial learning framework with two specialized networks: variational domain adversarial neural network (VDANN) and information-maximized VDANN (In­foVDANN). Both networks leverage domain adversarial training to produce speaker discriminative and domain-invariant embeddings and apply variational autoencoders (VAEs) to perform Gaussian regularization. The InfoVDANN, in particular, avoids posterior collapse in VDANNs by preserving the mutual information (MI) between the domain-invariant embeddings and the speaker embeddings. Second, to mitigate the effect of non-stationarity in the temporal feature maps, the author proposes short-time spectral pooling (STSP) and attentive STSP to transform the temporal feature maps into the spectral domain through short-time Fourier transform (STFT). The zero-and low-frequency components are retained to preserve speaker information. A segment-level attention mechanism is designed to produce spectral representations with fewer variations, which results in better robustness to the non-stationary effect in the feature maps. Third, to allow information in the frame-level layers to flow directly to the speaker embedding layer, MI-enhanced training based on a semi-supervised deep InfoMax (DIM) framework is proposed. Because the dimensionality of the frame-level features is much larger than that of the speaker embeddings, the author proposes to squeeze the frame-level features via global pooling before MI estimation. The pro­posed method, called squeeze-DIM, effectively balances the dimension between the frame-level features and the speaker embeddings.
We evaluate the proposed methods on VoxCeleb1, VOiCES 2019, SRE16, and SRE18-CMN2. Results show that the VDANN and InfoVDANN outperform the DANN baseline, indicating the effectiveness of Gaussian regularization and MI maximization. We also observed that attentive STSP achieved the largest performance gains, suggesting that applying segment-level attention and leveraging low spectral components of temporal feature maps can produce discriminative speaker embeddings. Finally, we demonstrate that the squeeze-DIM outperforms the DIM regularization, suggesting that the squeeze operation facilitates MI maximization.
Subjects: Voice computing
Automatic speech recognition
Speech processing systems
Hong Kong Polytechnic University -- Dissertations
Pages: xviii, 128 pages : color illustrations
Appears in Collections:Thesis

Show full item record

Page views

46
Last Week
0
Last month
Citations as of Apr 28, 2024

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.