Disentangling speech representations learning with latent diffusion for speaker verification

Li, Z; Mak, MW; Chien, JT; Pilanci, M; Jin, Z; Meng, H

doi:10.1109/TASLPRO.2025.3610023

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/117433

DC Field	Value	Language
dc.contributor	Department of Electrical and Electronic Engineering	-
dc.creator	Li, Z	-
dc.creator	Mak, MW	-
dc.creator	Chien, JT	-
dc.creator	Pilanci, M	-
dc.creator	Jin, Z	-
dc.creator	Meng, H	-
dc.date.accessioned	2026-02-25T03:50:08Z	-
dc.date.available	2026-02-25T03:50:08Z	-
dc.identifier.uri	http://hdl.handle.net/10397/117433	-
dc.language.iso	en	en_US
dc.publisher	Institute of Electrical and Electronics Engineers	en_US
dc.rights	© 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.	en_US
dc.rights	The following publication Z. Li, M. -W. Mak, J. -T. Chien, M. Pilanci, Z. Jin and H. Meng, 'Disentangling Speech Representations Learning With Latent Diffusion for Speaker Verification,' in IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 3896-3907, 2025 is available at https://doi.org/10.1109/TASLPRO.2025.3610023.	en_US
dc.subject	Disentangled speech representation	en_US
dc.subject	Latent diffusion model	en_US
dc.subject	Pre-trained speech model	en_US
dc.subject	Speaker verification	en_US
dc.subject	Variational autoencoder	en_US
dc.title	Disentangling speech representations learning with latent diffusion for speaker verification	en_US
dc.type	Journal/Magazine Article	en_US
dc.identifier.spage	3896	-
dc.identifier.epage	3907	-
dc.identifier.volume	33	-
dc.identifier.doi	10.1109/TASLPRO.2025.3610023	-
dcterms.abstract	Disentangled speech representation learning for speaker verification aims to separate spoken content and speaker timbre into distinct representations. However, existing variational autoencoder (VAE)–based methods for speech disentanglement rely on latent variables that lack semantic meaning, limiting their effectiveness for speaker verification. To address this limitation, we propose a diffusion-based method that disentangles and separates speaker features and speech content in the latent space. Building upon the VAE framework, we employ a speaker encoder to learn latent variables representing speaker features while using frame-specific latent variables to capture content. Unlike previous sequential VAE approaches, our method utilizes a conditional diffusion model in the latent space to derive speaker-aware representations. Experiments on the VoxCeleb and CN-Celeb datasets demonstrate that our method effectively isolates speaker features from speech content using pre-trained speech representations. The learned embeddings are robust to language mismatches since the speaker embeddings become content-invariant after content removal. Additionally, we design contrastive learning experiments showing that our training objective can enhance the learning of speaker-discriminative embeddings without relying on classification-based loss.	-
dcterms.accessRights	open access	en_US
dcterms.bibliographicCitation	IEEE transactions on audio, speech and language processing, 2025, v. 33, p. 3896-3907	-
dcterms.isPartOf	IEEE transactions on audio, speech and language processing	-
dcterms.issued	2025	-
dc.identifier.scopus	2-s2.0-105017701029	-
dc.identifier.eissn	2998-4173	-
dc.description.validate	202602 bcjz	-
dc.description.oa	Accepted Manuscript	en_US
dc.identifier.SubFormID	G001106/2025-11	en_US
dc.description.fundingSource	RGC	en_US
dc.description.fundingSource	Others	en_US
dc.description.fundingText	This work was supported in part by the Research Grants Council of Hong Kong, Theme-based Research Scheme under Grant T45-407/19-N, in part by GRF under Grant 15228223, and in part by the Research Student Attachment Programme of HKPolyU.	en_US
dc.description.pubStatus	Published	en_US
dc.description.oaCategory	Green (AAM)	en_US
Appears in Collections:	Journal/Magazine Article

Files in This Item:

File	Description	Size	Format
Li_Disentangling_Speech_Representations.pdf	Pre-Published version	2.96 MB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Final Accepted Manuscript

Access

View full-text via PolyU eLinks

Show simple item record

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Google ScholarTM

Altmetric

Google Scholar^TM