Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/95740
PIRA download icon_1.1View/Download Full Text
DC FieldValueLanguage
dc.contributorDepartment of Electronic and Information Engineeringen_US
dc.creatorTu, Yen_US
dc.creatorMak, MWen_US
dc.date.accessioned2022-10-05T03:56:44Z-
dc.date.available2022-10-05T03:56:44Z-
dc.identifier.issn2329-9290en_US
dc.identifier.urihttp://hdl.handle.net/10397/95740-
dc.language.isoenen_US
dc.publisherInstitute of Electrical and Electronics Engineersen_US
dc.rights© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.en_US
dc.rightsThe following publication Y. Tu and M. -W. Mak, "Aggregating Frame-Level Information in the Spectral Domain With Self-Attention for Speaker Embedding," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 944-957, 2022 is available at https://dx.doi.org/10.1109/TASLP.2022.3153267.en_US
dc.subjectSelf-attentionen_US
dc.subjectShort-time Fourier transformen_US
dc.subjectSpeaker embeddingen_US
dc.subjectSpeaker verificationen_US
dc.subjectStatistics poolingen_US
dc.titleAggregating frame-level information in the spectral domain with self-attention for speaker embeddingen_US
dc.typeJournal/Magazine Articleen_US
dc.identifier.spage944en_US
dc.identifier.epage957en_US
dc.identifier.volume30en_US
dc.identifier.doi10.1109/TASLP.2022.3153267en_US
dcterms.abstractMost pooling methods in state-of-the-art speaker embedding networks are implemented in the temporal domain. However, due to the high non-stationarity in the feature maps produced from the last frame-level layer, it is not advantageous to use the global statistics (e.g., means and standard deviations) of the temporal feature maps as aggregated embeddings. This motivates us to explore stationary spectral representations and perform aggregation in the spectral domain. In this paper, we propose attentive short-time spectral pooling (attentive STSP) from a Fourier perspective to exploit the local stationarity of the feature maps. In attentive STSP, for each utterance, we compute the spectral representations through a weighted average of the windowed segments within each spectrogram by attention weights and aggregate their lowest spectral components to form the speaker embedding. Because most of the feature map energy is concentrated in the low-frequency region of the spectral domain, attentive STSP facilitates the information aggregation by retaining the low spectral components only. Attentive STSP is shown to consistently outperform attentive pooling on VoxCeleb1, VOiCES19-eval, SRE16-eval, and SRE18-CMN2-eval. This observation suggests that applying segment-level attention and leveraging low spectral components can produce discriminative speaker embeddings.en_US
dcterms.accessRightsopen accessen_US
dcterms.bibliographicCitationIEEE/ACM transactions on audio, speech, and language processing, 2022, v. 30, p. 944-957en_US
dcterms.isPartOfIEEE/ACM transactions on audio, speech, and language processingen_US
dcterms.issued2022-
dc.identifier.scopus2-s2.0-85125712201-
dc.identifier.eissn2329-9304en_US
dc.description.validate202210 bckwen_US
dc.description.oaAccepted Manuscripten_US
dc.identifier.FolderNumbera1720-
dc.identifier.SubFormID45834-
dc.description.fundingSourceRGCen_US
dc.description.pubStatusPublisheden_US
dc.description.oaCategoryGreen (AAM)en_US
Appears in Collections:Journal/Magazine Article
Files in This Item:
File Description SizeFormat 
att_stsp_j.pdfPre-Published version1.37 MBAdobe PDFView/Open
Open Access Information
Status open access
File Version Final Accepted Manuscript
Access
View full-text via PolyU eLinks SFX Query
Show simple item record

Page views

65
Last Week
0
Last month
Citations as of Oct 13, 2024

Downloads

78
Citations as of Oct 13, 2024

SCOPUSTM   
Citations

9
Citations as of Oct 17, 2024

WEB OF SCIENCETM
Citations

7
Citations as of Oct 10, 2024

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.