A framework for adapting DNN speaker embedding across languages

Lin, W; Mak, MW; Li, N; Su, D; Yu, D

doi:10.1109/TASLP.2020.3030499

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/107121

Title:	A framework for adapting DNN speaker embedding across languages
Authors:	Lin, W Mak, MW Li, N Su, D Yu, D
Issue Date:	2020
Source:	IEEE/ACM transactions on audio, speech, and language processing, 2020, v. 28, p. 2810-2822
Abstract:	Language mismatch remains a major hindrance to the extensive deployment of speaker verification (SV) systems. Current language adaptation methods in SV mainly rely on linear projection in embedding space; i.e., adaptation is carried out after the speaker embeddings have been created, which underutilizes the powerful representation of deep neural networks. This article proposes a maximum mean discrepancy (MMD) based framework for adapting deep neural network (DNN) speaker embedding across languages, featuring multi-level domain loss, separate batch normalization, and consistency regularization. We refer to the framework as MSC. We show that (1) minimizing domain discrepancy at both frame- and utterance-levels performs significantly better than at utterance-level alone; (2) separating the source-domain data from the target-domain in batch normalization improves adaptation performance; and (3) data augmentation can be utilized in the unlabelled target-domain through consistency regularization. By combining these findings, we achieve an EER of 8.69% and 7.95% in NIST SRE 2016 and 2018, respectively, which are significantly better than the previously proposed DNN adaptation methods. Our framework also works well with backend adaptation. By combining the proposed framework with backend adaptation, we achieve an 11.8% improvement over the backend adaptation in SRE18. When applying our framework to a 121-layer Densenet, we achieved an EER of 7.81% and 7.02% in NIST SRE 2016 and 2018, respectively.
Keywords:	Data augmentation Domain adaptation Maximum mean discrepancy Speaker verification (SV) Transfer learning
Publisher:	Institute of Electrical and Electronics Engineers
Journal:	IEEE/ACM transactions on audio, speech, and language processing
ISSN:	2329-9290
EISSN:	2329-9304
DOI:	10.1109/TASLP.2020.3030499
Rights:	© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The following publication W. Lin, M. -W. Mak, N. Li, D. Su and D. Yu, "A Framework for Adapting DNN Speaker Embedding Across Languages," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2810-2822, 2020 is available at https://doi.org/10.1109/TASLP.2020.3030499.
Appears in Collections:	Journal/Magazine Article

Files in This Item:

File	Description	Size	Format
Lin_Framework_Adapting_Dnn.pdf	Pre-Published version	796.25 kB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Final Accepted Manuscript

Access

View full-text via PolyU eLinks

Show full item record

Page views

92

Last Week
2

Last month

Citations as of Apr 12, 2026

Downloads

158

Citations as of Apr 12, 2026

SCOPUS^TM
Citations

15

Citations as of May 8, 2026

WEB OF SCIENCE^TM
Citations

13

Citations as of Apr 23, 2026

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Page views

Downloads

SCOPUSTM Citations

WEB OF SCIENCETM Citations

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

WEB OF SCIENCE^TM
Citations

Google Scholar^TM