Mutual information enhanced training for speaker embedding

Tu, Y; Mak, MW

doi:10.21437/Interspeech.2021-1436

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/111709

DC Field	Value	Language
dc.contributor	Department of Electrical and Electronic Engineering	-
dc.creator	Tu, Y	-
dc.creator	Mak, MW	-
dc.date.accessioned	2025-03-13T02:22:10Z	-
dc.date.available	2025-03-13T02:22:10Z	-
dc.identifier.uri	http://hdl.handle.net/10397/111709	-
dc.description	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, Brno, Czechia, August 30 - September 3, 2021	en_US
dc.language.iso	en	en_US
dc.publisher	International Speech Communication Association	en_US
dc.rights	Copyright © 2021 ISCA	en_US
dc.rights	The following publication Tu, Y., Mak, M.-W. (2021) Mutual Information Enhanced Training for Speaker Embedding. Proc. Interspeech 2021, 91-95 is available at https://doi.org/10.21437/Interspeech.2021-1436.	en_US
dc.title	Mutual information enhanced training for speaker embedding	en_US
dc.type	Conference Paper	en_US
dc.identifier.spage	91	-
dc.identifier.epage	95	-
dc.identifier.doi	10.21437/Interspeech.2021-1436	-
dcterms.abstract	Mutual information (MI) is useful in unsupervised and self-supervised learning. Maximizing the MI between the low-level features and the learned embeddings can preserve meaningful information in the embeddings, which can contribute to performance gains. This strategy is called deep InfoMax (DIM) in representation learning. In this paper, we follow the DIM framework so that the speaker embeddings can capture more information from the frame-level features. However, a straightforward implementation of DIM may pose a dimensionality imbalance problem because the dimensionality of the frame-level features is much larger than that of the speaker embeddings. This problem can lead to unreliable MI estimation and can even cause detrimental effects on speaker verification. To overcome this problem, we propose to squeeze the frame-level features before MI estimation through some global pooling methods. We call the proposed method squeeze-DIM. Although the squeeze operation inevitably introduces some information loss, we empirically show that the squeeze-DIM can achieve performance gains on both Voxceleb1 and VOiCES-19 tasks. This suggests that the squeeze operation facilitates the MI estimation and maximization in a balanced dimensional space, which helps learn more informative speaker embeddings.	-
dcterms.accessRights	open access	en_US
dcterms.bibliographicCitation	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, p. 91-95	-
dcterms.issued	2021	-
dc.identifier.scopus	2-s2.0-85119248642	-
dc.relation.conference	Conference of the International Speech Communication Association [INTERSPEECH]	-
dc.description.validate	202503 bcch	-
dc.description.oa	Version of Record	en_US
dc.identifier.FolderNumber	OA_Others	en_US
dc.description.fundingSource	RGC	en_US
dc.description.fundingSource	Others	en_US
dc.description.fundingText	National Natural Science Foundation of China (NSFC)	en_US
dc.description.pubStatus	Published	en_US
dc.description.oaCategory	VoR allowed	en_US
Appears in Collections:	Conference Paper

Files in This Item:

File	Description	Size	Format
tu21_interspeech.pdf		297.96 kB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Version of Record

Access

View full-text via PolyU eLinks

Show simple item record

Page views

6

Citations as of Apr 14, 2025

Downloads

2

Citations as of Apr 14, 2025

SCOPUS^TM
Citations

3

Citations as of Nov 21, 2025

Google Scholar^TM

Check