Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/117433
PIRA download icon_1.1View/Download Full Text
DC FieldValueLanguage
dc.contributorDepartment of Electrical and Electronic Engineering-
dc.creatorLi, Z-
dc.creatorMak, MW-
dc.creatorChien, JT-
dc.creatorPilanci, M-
dc.creatorJin, Z-
dc.creatorMeng, H-
dc.date.accessioned2026-02-25T03:50:08Z-
dc.date.available2026-02-25T03:50:08Z-
dc.identifier.urihttp://hdl.handle.net/10397/117433-
dc.language.isoenen_US
dc.publisherInstitute of Electrical and Electronics Engineersen_US
dc.rights© 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.en_US
dc.rightsThe following publication Z. Li, M. -W. Mak, J. -T. Chien, M. Pilanci, Z. Jin and H. Meng, 'Disentangling Speech Representations Learning With Latent Diffusion for Speaker Verification,' in IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 3896-3907, 2025 is available at https://doi.org/10.1109/TASLPRO.2025.3610023.en_US
dc.subjectDisentangled speech representationen_US
dc.subjectLatent diffusion modelen_US
dc.subjectPre-trained speech modelen_US
dc.subjectSpeaker verificationen_US
dc.subjectVariational autoencoderen_US
dc.titleDisentangling speech representations learning with latent diffusion for speaker verificationen_US
dc.typeJournal/Magazine Articleen_US
dc.identifier.spage3896-
dc.identifier.epage3907-
dc.identifier.volume33-
dc.identifier.doi10.1109/TASLPRO.2025.3610023-
dcterms.abstractDisentangled speech representation learning for speaker verification aims to separate spoken content and speaker timbre into distinct representations. However, existing variational autoencoder (VAE)–based methods for speech disentanglement rely on latent variables that lack semantic meaning, limiting their effectiveness for speaker verification. To address this limitation, we propose a diffusion-based method that disentangles and separates speaker features and speech content in the latent space. Building upon the VAE framework, we employ a speaker encoder to learn latent variables representing speaker features while using frame-specific latent variables to capture content. Unlike previous sequential VAE approaches, our method utilizes a conditional diffusion model in the latent space to derive speaker-aware representations. Experiments on the VoxCeleb and CN-Celeb datasets demonstrate that our method effectively isolates speaker features from speech content using pre-trained speech representations. The learned embeddings are robust to language mismatches since the speaker embeddings become content-invariant after content removal. Additionally, we design contrastive learning experiments showing that our training objective can enhance the learning of speaker-discriminative embeddings without relying on classification-based loss.-
dcterms.accessRightsopen accessen_US
dcterms.bibliographicCitationIEEE transactions on audio, speech and language processing, 2025, v. 33, p. 3896-3907-
dcterms.isPartOfIEEE transactions on audio, speech and language processing-
dcterms.issued2025-
dc.identifier.scopus2-s2.0-105017701029-
dc.identifier.eissn2998-4173-
dc.description.validate202602 bcjz-
dc.description.oaAccepted Manuscripten_US
dc.identifier.SubFormIDG001106/2025-11en_US
dc.description.fundingSourceRGCen_US
dc.description.fundingSourceOthersen_US
dc.description.fundingTextThis work was supported in part by the Research Grants Council of Hong Kong, Theme-based Research Scheme under Grant T45-407/19-N, in part by GRF under Grant 15228223, and in part by the Research Student Attachment Programme of HKPolyU.en_US
dc.description.pubStatusPublisheden_US
dc.description.oaCategoryGreen (AAM)en_US
Appears in Collections:Journal/Magazine Article
Files in This Item:
File Description SizeFormat 
Li_Disentangling_Speech_Representations.pdfPre-Published version2.96 MBAdobe PDFView/Open
Open Access Information
Status open access
File Version Final Accepted Manuscript
Access
View full-text via PolyU eLinks SFX Query
Show simple item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.