Please use this identifier to cite or link to this item:
http://hdl.handle.net/10397/114610
DC Field | Value | Language |
---|---|---|
dc.contributor | Department of Electrical and Electronic Engineering | - |
dc.creator | Jin, Z | - |
dc.creator | Tu, Y | - |
dc.creator | Mak, MW | - |
dc.date.accessioned | 2025-08-18T03:02:13Z | - |
dc.date.available | 2025-08-18T03:02:13Z | - |
dc.identifier.uri | http://hdl.handle.net/10397/114610 | - |
dc.description | Interspeech 2024, 1-5 September 2024, Kos, Greece | en_US |
dc.language.iso | en | en_US |
dc.publisher | International Speech Communication Association | en_US |
dc.rights | The following publication Jin, Z., Tu, Y., Mak, M.-W. (2024) Self-Supervised Learning with Multi-Head Multi-Mode Knowledge Distillation for Speaker Verification. Proc. Interspeech 2024, 4723-4727 is available at https://doi.org/10.21437/Interspeech.2024-360. | en_US |
dc.subject | Cross-distillation | en_US |
dc.subject | DINO | en_US |
dc.subject | Knowledge distillation | en_US |
dc.subject | Self-supervised learning | en_US |
dc.subject | Speaker verification | en_US |
dc.title | Self-supervised learning with multi-head multi-mode knowledge distillation for speaker verification | en_US |
dc.type | Conference Paper | en_US |
dc.identifier.spage | 4723 | - |
dc.identifier.epage | 4727 | - |
dc.identifier.doi | 10.21437/Interspeech.2024-360 | - |
dcterms.abstract | Training speaker verification (SV) systems without labeled data is challenging. To tackle the challenge, we propose Multi-Head, Multi-Mode (MeMo) self-supervised learning based on knowledge distillation. Unlike DINO, the teacher in MeMo uses two distinct architectures to learn collaboratively, and so does the student. MeMo employs two distillation modes: self- and cross-distillations, with the teacher and student having the same and different architectures, respectively. To reduce the output discrepancy caused by different architectures, we divide the projection head into self- and cross-heads so that each head is responsible for distillation in its respective mode. We also discover that contrastive learning at the embedding level is supportive only in early training stages. To address this issue, we propose dynamically stopping the contrastive learning while continuing knowledge distillation. MeMo achieves an impressive EER of 3.10% on Voxceleb1 using a small ECAPA-TDNN backbone. | - |
dcterms.accessRights | open access | en_US |
dcterms.bibliographicCitation | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2024, p. 4723-4727 | - |
dcterms.issued | 2024 | - |
dc.identifier.scopus | 2-s2.0-85214827410 | - |
dc.description.validate | 202508 bcch | - |
dc.description.oa | Version of Record | en_US |
dc.identifier.FolderNumber | OA_Others | en_US |
dc.description.fundingSource | RGC | en_US |
dc.description.pubStatus | Published | en_US |
dc.description.oaCategory | VoR allowed | en_US |
Appears in Collections: | Conference Paper |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
jin24c_interspeech.pdf | 862.66 kB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.