Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/88468
Title: Robust speaker recognition using deep neural networks
Authors: Lin, Weiwei
Degree: Ph.D.
Issue Date: 2020
Abstract: Speaker recognition refers to recognizing a person using his/her voice. Although state-of-the-art speaker recognition systems have shown remarkable performance, there are still some unsolved problems. Firstly, speaker recognition systems' performance degrades significantly when training data and test data have domain mismatch. Domain mismatch is prevalent and is expected to happen during system deployment. This could occur when the new environment has some specific noise or involves speakers speaking different languages than training speakers. Directly using the existing system in these situations could result in poor performance. Secondly, the statistics pooling layer in state-of-the-art systems does not have rich representation power to capture the complex characteristics of frame-level features. The statistics pooling layer only uses the mean and standard deviation of frame-level features. However, mean and standard deviation are insufficient for summarizing a complex distribution. Thirdly, state-of-the-art systems still rely on a PLDA backend, which makes deployment difficult and hinders the potential of the DNN frontend. This thesis proposes several solutions to the problems mentioned above. For reducing the domain mismatch, this thesis proposes adaptation methods for both DNN frontend and PLDA backend. The proposed backend adaptation uses an auto-encoder to minimize the domain mismatch between i-vectors, while the frontend adaptation focuses on producing speaker embedding that is both discriminative and domain-invariant. Using the proposed adaptation framework, we achieve an EER of 8.69% and 7.95% in NIST SRE 2016 and 2018, respectively, which are significantly better than the previously proposed DNN adaptation methods. For better frame-level information aggregation in the DNN, this thesis proposes an attention-based statistics pooling method, which uses an expectation-maximization (EM) like algorithm to produce multiple means and standard deviations for summarizing frame-level features distribution. Applying the proposed attention mechanism to a 121-layer Densenet, we achieve an EER of 1.1% in VoxCeleb1 and an EER of 4.77% in the VOiCES 2019 evaluation set. For facilitating end-to-end speaker recognition, this thesis proposes several strategies to eliminate the need of a backend model. Experiments on NIST SRE 2016 and 2018 show that with the proposed strategies, the DNN can achieve state-of-the-art performance using simple cosine similarity and requires only half of the computational cost of the x-vector network.
Subjects: Automatic speech recognition
Hong Kong Polytechnic University -- Dissertations
Pages: xvii, 113 pages : color illustrations
Appears in Collections:Thesis

Show full item record

Page views

2
Citations as of May 22, 2022

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.