Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/114611
PIRA download icon_1.1View/Download Full Text
DC FieldValueLanguage
dc.contributorDepartment of Electrical and Electronic Engineering-
dc.creatorTruong, DT-
dc.creatorTao, R-
dc.creatorNguyen, T-
dc.creatorLuong, HT-
dc.creatorLee, KA-
dc.creatorChng, ES-
dc.date.accessioned2025-08-18T03:02:14Z-
dc.date.available2025-08-18T03:02:14Z-
dc.identifier.urihttp://hdl.handle.net/10397/114611-
dc.descriptionInterspeech 2024, 1-5 September 2024, Kos, Greeceen_US
dc.language.isoenen_US
dc.publisherInternational Speech Communication Associationen_US
dc.rightsThe following publication Truong, D.-T., Tao, R., Nguyen, T., Luong, H.-T., Lee, K.A., Chng, E.S. (2024) Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection. Proc. Interspeech 2024, 537-541 is available at https://doi.org/10.21437/Interspeech.2024-659.en_US
dc.subjectASVspoof challengesen_US
dc.subjectAttention learningen_US
dc.subjectSynthetic speech detectionen_US
dc.titleTemporal-channel modeling in multi-head self-attention for synthetic speech detectionen_US
dc.typeConference Paperen_US
dc.identifier.spage537-
dc.identifier.epage541-
dc.identifier.doi10.21437/Interspeech.2024-659-
dcterms.abstractRecent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA’s capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech.-
dcterms.accessRightsopen accessen_US
dcterms.bibliographicCitationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2024, p. 537-541-
dcterms.issued2024-
dc.identifier.scopus2-s2.0-85211361807-
dc.description.validate202508 bcch-
dc.description.oaVersion of Recorden_US
dc.identifier.FolderNumberOA_Othersen_US
dc.description.fundingSourceOthersen_US
dc.description.fundingTextThe National Research Foundation Singapore under the AI Singapore Programme (AISG Award No.: AISG-TC-2023-011-SGIL)en_US
dc.description.pubStatusPublisheden_US
dc.description.oaCategoryVoR alloweden_US
Appears in Collections:Conference Paper
Files in This Item:
File Description SizeFormat 
truong24b_interspeech.pdf390.58 kBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Access
View full-text via PolyU eLinks SFX Query
Show simple item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.