Aggregating frame-level information in the spectral domain with self-attention for speaker embedding

Tu, Y; Mak, MW

doi:10.1109/TASLP.2022.3153267

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/95740

DC Field	Value	Language
dc.contributor	Department of Electronic and Information Engineering	en_US
dc.creator	Tu, Y	en_US
dc.creator	Mak, MW	en_US
dc.date.accessioned	2022-10-05T03:56:44Z	-
dc.date.available	2022-10-05T03:56:44Z	-
dc.identifier.issn	2329-9290	en_US
dc.identifier.uri	http://hdl.handle.net/10397/95740	-
dc.language.iso	en	en_US
dc.publisher	Institute of Electrical and Electronics Engineers	en_US
dc.rights	© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.	en_US
dc.rights	The following publication Y. Tu and M. -W. Mak, "Aggregating Frame-Level Information in the Spectral Domain With Self-Attention for Speaker Embedding," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 944-957, 2022 is available at https://dx.doi.org/10.1109/TASLP.2022.3153267.	en_US
dc.subject	Self-attention	en_US
dc.subject	Short-time Fourier transform	en_US
dc.subject	Speaker embedding	en_US
dc.subject	Speaker verification	en_US
dc.subject	Statistics pooling	en_US
dc.title	Aggregating frame-level information in the spectral domain with self-attention for speaker embedding	en_US
dc.type	Journal/Magazine Article	en_US
dc.identifier.spage	944	en_US
dc.identifier.epage	957	en_US
dc.identifier.volume	30	en_US
dc.identifier.doi	10.1109/TASLP.2022.3153267	en_US
dcterms.abstract	Most pooling methods in state-of-the-art speaker embedding networks are implemented in the temporal domain. However, due to the high non-stationarity in the feature maps produced from the last frame-level layer, it is not advantageous to use the global statistics (e.g., means and standard deviations) of the temporal feature maps as aggregated embeddings. This motivates us to explore stationary spectral representations and perform aggregation in the spectral domain. In this paper, we propose attentive short-time spectral pooling (attentive STSP) from a Fourier perspective to exploit the local stationarity of the feature maps. In attentive STSP, for each utterance, we compute the spectral representations through a weighted average of the windowed segments within each spectrogram by attention weights and aggregate their lowest spectral components to form the speaker embedding. Because most of the feature map energy is concentrated in the low-frequency region of the spectral domain, attentive STSP facilitates the information aggregation by retaining the low spectral components only. Attentive STSP is shown to consistently outperform attentive pooling on VoxCeleb1, VOiCES19-eval, SRE16-eval, and SRE18-CMN2-eval. This observation suggests that applying segment-level attention and leveraging low spectral components can produce discriminative speaker embeddings.	en_US
dcterms.accessRights	open access	en_US
dcterms.bibliographicCitation	IEEE/ACM transactions on audio, speech, and language processing, 2022, v. 30, p. 944-957	en_US
dcterms.isPartOf	IEEE/ACM transactions on audio, speech, and language processing	en_US
dcterms.issued	2022	-
dc.identifier.scopus	2-s2.0-85125712201	-
dc.identifier.eissn	2329-9304	en_US
dc.description.validate	202210 bckw	en_US
dc.description.oa	Accepted Manuscript	en_US
dc.identifier.FolderNumber	a1720	-
dc.identifier.SubFormID	45834	-
dc.description.fundingSource	RGC	en_US
dc.description.pubStatus	Published	en_US
dc.description.oaCategory	Green (AAM)	en_US
Appears in Collections:	Journal/Magazine Article

Files in This Item:

File	Description	Size	Format
att_stsp_j.pdf	Pre-Published version	1.37 MB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Final Accepted Manuscript

Access

View full-text via PolyU eLinks

Show simple item record

Page views

81

Last Week
0

Last month

Citations as of Apr 14, 2025

Downloads

108

Citations as of Apr 14, 2025

SCOPUS^TM
Citations

11

Citations as of Sep 12, 2025

WEB OF SCIENCE^TM
Citations

7

Citations as of Oct 10, 2024

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Page views

Downloads

SCOPUSTM Citations

WEB OF SCIENCETM Citations

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

WEB OF SCIENCE^TM
Citations

Google Scholar^TM