Mutual information-enhanced contrastive learning with margin for maximal speaker separability

Li, Z; Mak, MW; Pilanci, M; Meng, H

doi:10.1109/TASLPRO.2025.3583485

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/114106

Title:	Mutual information-enhanced contrastive learning with margin for maximal speaker separability
Authors:	Li, Z Mak, MW Pilanci, M Meng, H
Issue Date:	2025
Source:	IEEE transactions on audio, speech and language processing, 2025, v. 33, p. 2961-2972
Abstract:	Contrastive learning across various augmentations of the same utterance can enhance speaker representations' ability to distinguish new speakers. This paper introduces a supervised contrastive learning objective that optimizes a speaker embedding space using label information from training data. Besides augmenting different segments of an utterance to form a positive pair, our approach generates multiple positive pairs by augmenting various utterances from the same speaker. However, employing contrastive learning for speaker verification (SV) presents two challenges: (1) softmax loss is ineffective in reducing intra-class variation, and (2) previous research has shown that contrastive learning can share information across the augmented views of an object but could discard task-relevant nonshared information, suggesting that it is essential to keep nonshared speaker information across the augmented views when constructing a speaker representation space. To overcome the first challenge, we incorporate an additive angular margin in the contrastive loss. For the second challenge, we maximize the mutual information (MI) between the squeezed low-level features and speaker representations to extract the nonshared information. Evaluations on VoxCeleb, CN-Celeb, and CU-MARVEL validate that our new learning objective enables ECAPA-TDNN to identify an embedding space that exhibits robust speaker discrimination.
Keywords:	Additive angular margin Contrastive learning Mutual information Speaker verification
Publisher:	Institute of Electrical and Electronics Engineers
Journal:	IEEE transactions on audio, speech and language processing
ISSN:	1558-7916
EISSN:	1558-7924
DOI:	10.1109/TASLPRO.2025.3583485
Rights:	© 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The following publication Z. Li, M. -W. Mak, M. Pilanci and H. Meng, "Mutual Information-Enhanced Contrastive Learning With Margin for Maximal Speaker Separability," in IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 2961-2972, 2025 is available at https://doi.org/10.1109/TASLPRO.2025.3583485.
Appears in Collections:	Journal/Magazine Article

Files in This Item:

File	Description	Size	Format
Li_Mutual_Information_Enhanced.pdf	Pre-Published version	2.05 MB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Final Accepted Manuscript

Access

View full-text via PolyU eLinks

Show full item record

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Google ScholarTM

Altmetric

Google Scholar^TM