Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/114610
PIRA download icon_1.1View/Download Full Text
DC FieldValueLanguage
dc.contributorDepartment of Electrical and Electronic Engineering-
dc.creatorJin, Z-
dc.creatorTu, Y-
dc.creatorMak, MW-
dc.date.accessioned2025-08-18T03:02:13Z-
dc.date.available2025-08-18T03:02:13Z-
dc.identifier.urihttp://hdl.handle.net/10397/114610-
dc.descriptionInterspeech 2024, 1-5 September 2024, Kos, Greeceen_US
dc.language.isoenen_US
dc.publisherInternational Speech Communication Associationen_US
dc.rightsThe following publication Jin, Z., Tu, Y., Mak, M.-W. (2024) Self-Supervised Learning with Multi-Head Multi-Mode Knowledge Distillation for Speaker Verification. Proc. Interspeech 2024, 4723-4727 is available at https://doi.org/10.21437/Interspeech.2024-360.en_US
dc.subjectCross-distillationen_US
dc.subjectDINOen_US
dc.subjectKnowledge distillationen_US
dc.subjectSelf-supervised learningen_US
dc.subjectSpeaker verificationen_US
dc.titleSelf-supervised learning with multi-head multi-mode knowledge distillation for speaker verificationen_US
dc.typeConference Paperen_US
dc.identifier.spage4723-
dc.identifier.epage4727-
dc.identifier.doi10.21437/Interspeech.2024-360-
dcterms.abstractTraining speaker verification (SV) systems without labeled data is challenging. To tackle the challenge, we propose Multi-Head, Multi-Mode (MeMo) self-supervised learning based on knowledge distillation. Unlike DINO, the teacher in MeMo uses two distinct architectures to learn collaboratively, and so does the student. MeMo employs two distillation modes: self- and cross-distillations, with the teacher and student having the same and different architectures, respectively. To reduce the output discrepancy caused by different architectures, we divide the projection head into self- and cross-heads so that each head is responsible for distillation in its respective mode. We also discover that contrastive learning at the embedding level is supportive only in early training stages. To address this issue, we propose dynamically stopping the contrastive learning while continuing knowledge distillation. MeMo achieves an impressive EER of 3.10% on Voxceleb1 using a small ECAPA-TDNN backbone.-
dcterms.accessRightsopen accessen_US
dcterms.bibliographicCitationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2024, p. 4723-4727-
dcterms.issued2024-
dc.identifier.scopus2-s2.0-85214827410-
dc.description.validate202508 bcch-
dc.description.oaVersion of Recorden_US
dc.identifier.FolderNumberOA_Othersen_US
dc.description.fundingSourceRGCen_US
dc.description.pubStatusPublisheden_US
dc.description.oaCategoryVoR alloweden_US
Appears in Collections:Conference Paper
Files in This Item:
File Description SizeFormat 
jin24c_interspeech.pdf862.66 kBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Access
View full-text via PolyU eLinks SFX Query
Show simple item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.