Contrastive self-supervised speaker embedding with sequential disentanglement

Tu, Y; Mak, MW; Chien, JT

doi:10.1109/TASLP.2024.3402077

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/106862

Title:	Contrastive self-supervised speaker embedding with sequential disentanglement
Authors:	Tu, Y Mak, MW Chien, JT
Issue Date:	2024
Source:	IEEE/ACM transactions on audio, speech, and language processing, 2024, v. 32, p. 2704-2715
Abstract:	Contrastive self-supervised learning has been widely used in speaker embedding to address the labeling challenge. Contrastive speaker embedding assumes that the contrast between the positive and negative pairs of speech segments is attributed to speaker identity only. However, this assumption is incorrect because speech signals contain not only speaker identity but also linguistic content. In this paper, we propose a contrastive learning framework with sequential disentanglement to remove linguistic content by incorporating a disentangled sequential variational autoencoder (DSVAE) into the conventional contrastive learning framework. The DSVAE aims to disentangle speaker factors from content factors in an embedding space so that the speaker factors become the main contributor to the contrastive loss. Because content factors have been removed from contrastive learning, the resulting speaker embeddings will be content-invariant. The learned embeddings are also robust to language mismatch. It is shown that the proposed method consistently outperforms the conventional contrastive speaker embedding on the VoxCeleb1 and CN-Celeb datasets. This finding suggests that applying sequential disentanglement is beneficial to learning speaker-discriminative embeddings.
Keywords:	Contrastive learning Disentangled representation learning Speaker embedding Speaker verification Variational autoencoder
Publisher:	Institute of Electrical and Electronics Engineers
Journal:	IEEE/ACM transactions on audio, speech, and language processing
ISSN:	2329-9290
EISSN:	2329-9304
DOI:	10.1109/TASLP.2024.3402077
Rights:	© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The following publication Y. Tu, M. -W. Mak and J. -T. Chien, "Contrastive Self-Supervised Speaker Embedding With Sequential Disentanglement," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2704-2715, 2024 is available at https://doi.org/10.1109/TASLP.2024.3402077.
Appears in Collections:	Journal/Magazine Article

Files in This Item:

File	Description	Size	Format
Tu_Contrastive_Self-Supervised_Speaker.pdf	Pre-Published version	2.58 MB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Final Accepted Manuscript

Access

View full-text via PolyU eLinks

Show full item record

Page views

5

Citations as of Jun 30, 2024

Downloads

15

Citations as of Jun 30, 2024

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Page views

Downloads

Google ScholarTM

Altmetric

Google Scholar^TM