Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/106888
PIRA download icon_1.1View/Download Full Text
DC FieldValueLanguage
dc.contributorDepartment of Electrical and Electronic Engineeringen_US
dc.creatorLin, Wen_US
dc.creatorMak, MWen_US
dc.creatorYi, Len_US
dc.date.accessioned2024-06-07T00:58:39Z-
dc.date.available2024-06-07T00:58:39Z-
dc.identifier.urihttp://hdl.handle.net/10397/106888-
dc.language.isoenen_US
dc.publisherInternational Speech Communication Association (ISCA)en_US
dc.rights© ISCAen_US
dc.rightsThe following publication Lin, W., Mak, M.W., Yi, L. (2020) Learning Mixture Representation for Deep Speaker Embedding Using Attention. Proc. The Speaker and Language Recognition Workshop (Odyssey 2020), 210-214 is available at https://doi.org/10.21437/Odyssey.2020-30.en_US
dc.titleLearning mixture representation for deep speaker embedding using attentionen_US
dc.typeConference Paperen_US
dc.identifier.spage210en_US
dc.identifier.epage214en_US
dc.identifier.doi10.21437/Odyssey.2020-30en_US
dcterms.abstractAlmost all speaker recognition systems involve a step that converts a sequence of frame-level features to a fixed dimension representation. In the context of deep neural networks, it is referred to as statistics pooling. In state-of-the-art speak recognition systems, statistics pooling is implemented by concatenating the mean and standard deviation of a sequence of frame-level features. However, a single mean and standard deviation are very limited descriptive statistics for an acoustic sequence even with a powerful feature extractor like a convolutional neural network. In this paper, we propose a novel statistics pooling method that can produce more descriptive statistics through a mixture representation. Our method is inspired by the expectation-maximization (EM) algorithm in Gaussian mixture models (GMMs). However, unlike the GMMs, the mixture assignments are given by an attention mechanism instead of the Euclidean distances between frame-level features and explicit centers. Applying the proposed attention mechanism to a 121-layer Densenet, we achieve an EER of 1.1\% in VoxCeleb1 and an EER of 4.77\% in VOiCES 2019 evaluation set.en_US
dcterms.accessRightsopen accessen_US
dcterms.bibliographicCitationThe Speaker and Language Recognition Workshop (Odyssey 2020), 1-5 November 2020, Tokyo, Japan, p. 210-214en_US
dcterms.issued2020-
dc.description.validate202405 bcchen_US
dc.description.oaVersion of Recorden_US
dc.identifier.FolderNumberEIE-0131-
dc.description.fundingSourceRGCen_US
dc.description.pubStatusPublisheden_US
dc.identifier.OPUS20509205-
dc.description.oaCategoryVoR alloweden_US
Appears in Collections:Conference Paper
Files in This Item:
File Description SizeFormat 
lin20c_odyssey.pdf254.06 kBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Access
View full-text via PolyU eLinks SFX Query
Show simple item record

Page views

113
Last Week
12
Last month
Citations as of Nov 9, 2025

Downloads

57
Citations as of Nov 9, 2025

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.