Temporal-channel modeling in multi-head self-attention for synthetic speech detection

Truong, DT; Tao, R; Nguyen, T; Luong, HT; Lee, KA; Chng, ES

doi:10.21437/Interspeech.2024-659

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/114611

Title:	Temporal-channel modeling in multi-head self-attention for synthetic speech detection
Authors:	Truong, DT Tao, R Nguyen, T Luong, HT Lee, KA Chng, ES
Issue Date:	2024
Source:	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2024, p. 537-541
Abstract:	Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA’s capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech.
Keywords:	ASVspoof challenges Attention learning Synthetic speech detection
Publisher:	International Speech Communication Association
DOI:	10.21437/Interspeech.2024-659
Description:	Interspeech 2024, 1-5 September 2024, Kos, Greece
Rights:	The following publication Truong, D.-T., Tao, R., Nguyen, T., Luong, H.-T., Lee, K.A., Chng, E.S. (2024) Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection. Proc. Interspeech 2024, 537-541 is available at https://doi.org/10.21437/Interspeech.2024-659.
Appears in Collections:	Conference Paper

Files in This Item:

File	Description	Size	Format
truong24b_interspeech.pdf		390.58 kB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Version of Record

Access

View full-text via PolyU eLinks

Show full item record

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Google ScholarTM

Altmetric

Google Scholar^TM