Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/111709
PIRA download icon_1.1View/Download Full Text
DC FieldValueLanguage
dc.contributorDepartment of Electrical and Electronic Engineering-
dc.creatorTu, Y-
dc.creatorMak, MW-
dc.date.accessioned2025-03-13T02:22:10Z-
dc.date.available2025-03-13T02:22:10Z-
dc.identifier.urihttp://hdl.handle.net/10397/111709-
dc.description22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, Brno, Czechia, August 30 - September 3, 2021en_US
dc.language.isoenen_US
dc.publisherInternational Speech Communication Associationen_US
dc.rightsCopyright © 2021 ISCAen_US
dc.rightsThe following publication Tu, Y., Mak, M.-W. (2021) Mutual Information Enhanced Training for Speaker Embedding. Proc. Interspeech 2021, 91-95 is available at https://doi.org/10.21437/Interspeech.2021-1436.en_US
dc.titleMutual information enhanced training for speaker embeddingen_US
dc.typeConference Paperen_US
dc.identifier.spage91-
dc.identifier.epage95-
dc.identifier.doi10.21437/Interspeech.2021-1436-
dcterms.abstractMutual information (MI) is useful in unsupervised and self-supervised learning. Maximizing the MI between the low-level features and the learned embeddings can preserve meaningful information in the embeddings, which can contribute to performance gains. This strategy is called deep InfoMax (DIM) in representation learning. In this paper, we follow the DIM framework so that the speaker embeddings can capture more information from the frame-level features. However, a straightforward implementation of DIM may pose a dimensionality imbalance problem because the dimensionality of the frame-level features is much larger than that of the speaker embeddings. This problem can lead to unreliable MI estimation and can even cause detrimental effects on speaker verification. To overcome this problem, we propose to squeeze the frame-level features before MI estimation through some global pooling methods. We call the proposed method squeeze-DIM. Although the squeeze operation inevitably introduces some information loss, we empirically show that the squeeze-DIM can achieve performance gains on both Voxceleb1 and VOiCES-19 tasks. This suggests that the squeeze operation facilitates the MI estimation and maximization in a balanced dimensional space, which helps learn more informative speaker embeddings.-
dcterms.accessRightsopen accessen_US
dcterms.bibliographicCitationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, p. 91-95-
dcterms.issued2021-
dc.identifier.scopus2-s2.0-85119248642-
dc.relation.conferenceConference of the International Speech Communication Association [INTERSPEECH]-
dc.description.validate202503 bcch-
dc.description.oaVersion of Recorden_US
dc.identifier.FolderNumberOA_Othersen_US
dc.description.fundingSourceRGCen_US
dc.description.fundingSourceOthersen_US
dc.description.fundingTextNational Natural Science Foundation of China (NSFC)en_US
dc.description.pubStatusPublisheden_US
dc.description.oaCategoryVoR alloweden_US
Appears in Collections:Conference Paper
Files in This Item:
File Description SizeFormat 
tu21_interspeech.pdf297.96 kBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Access
View full-text via PolyU eLinks SFX Query
Show simple item record

Page views

6
Citations as of Apr 14, 2025

Downloads

2
Citations as of Apr 14, 2025

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.