Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/106862
PIRA download icon_1.1View/Download Full Text
DC FieldValueLanguage
dc.contributorDepartment of Electrical and Electronic Engineering-
dc.creatorTu, Y-
dc.creatorMak, MW-
dc.creatorChien, JT-
dc.date.accessioned2024-06-06T06:06:02Z-
dc.date.available2024-06-06T06:06:02Z-
dc.identifier.issn2329-9290-
dc.identifier.urihttp://hdl.handle.net/10397/106862-
dc.language.isoenen_US
dc.publisherInstitute of Electrical and Electronics Engineersen_US
dc.rights© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.en_US
dc.rightsThe following publication Y. Tu, M. -W. Mak and J. -T. Chien, "Contrastive Self-Supervised Speaker Embedding With Sequential Disentanglement," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2704-2715, 2024 is available at https://doi.org/10.1109/TASLP.2024.3402077.en_US
dc.subjectContrastive learningen_US
dc.subjectDisentangled representation learningen_US
dc.subjectSpeaker embeddingen_US
dc.subjectSpeaker verificationen_US
dc.subjectVariational autoencoderen_US
dc.titleContrastive self-supervised speaker embedding with sequential disentanglementen_US
dc.typeJournal/Magazine Articleen_US
dc.identifier.spage2704-
dc.identifier.epage2715-
dc.identifier.volume32-
dc.identifier.doi10.1109/TASLP.2024.3402077-
dcterms.abstractContrastive self-supervised learning has been widely used in speaker embedding to address the labeling challenge. Contrastive speaker embedding assumes that the contrast between the positive and negative pairs of speech segments is attributed to speaker identity only. However, this assumption is incorrect because speech signals contain not only speaker identity but also linguistic content. In this paper, we propose a contrastive learning framework with sequential disentanglement to remove linguistic content by incorporating a disentangled sequential variational autoencoder (DSVAE) into the conventional contrastive learning framework. The DSVAE aims to disentangle speaker factors from content factors in an embedding space so that the speaker factors become the main contributor to the contrastive loss. Because content factors have been removed from contrastive learning, the resulting speaker embeddings will be content-invariant. The learned embeddings are also robust to language mismatch. It is shown that the proposed method consistently outperforms the conventional contrastive speaker embedding on the VoxCeleb1 and CN-Celeb datasets. This finding suggests that applying sequential disentanglement is beneficial to learning speaker-discriminative embeddings.-
dcterms.accessRightsopen accessen_US
dcterms.bibliographicCitationIEEE/ACM transactions on audio, speech, and language processing, 2024, v. 32, p. 2704-2715-
dcterms.isPartOfIEEE/ACM transactions on audio, speech, and language processing-
dcterms.issued2024-
dc.identifier.scopus2-s2.0-85193518794-
dc.identifier.eissn2329-9304-
dc.description.validate202406 bcch-
dc.description.oaAccepted Manuscripten_US
dc.identifier.FolderNumbera2778en_US
dc.identifier.SubFormID48311en_US
dc.description.fundingSourceRGCen_US
dc.description.pubStatusPublisheden_US
dc.description.oaCategoryGreen (AAM)en_US
Appears in Collections:Journal/Magazine Article
Files in This Item:
File Description SizeFormat 
Tu_Contrastive_Self-Supervised_Speaker.pdfPre-Published version2.58 MBAdobe PDFView/Open
Open Access Information
Status open access
File Version Final Accepted Manuscript
Access
View full-text via PolyU eLinks SFX Query
Show simple item record

Page views

5
Citations as of Jun 30, 2024

Downloads

15
Citations as of Jun 30, 2024

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.