Please use this identifier to cite or link to this item:
http://hdl.handle.net/10397/106888
| DC Field | Value | Language |
|---|---|---|
| dc.contributor | Department of Electrical and Electronic Engineering | en_US |
| dc.creator | Lin, W | en_US |
| dc.creator | Mak, MW | en_US |
| dc.creator | Yi, L | en_US |
| dc.date.accessioned | 2024-06-07T00:58:39Z | - |
| dc.date.available | 2024-06-07T00:58:39Z | - |
| dc.identifier.uri | http://hdl.handle.net/10397/106888 | - |
| dc.language.iso | en | en_US |
| dc.publisher | International Speech Communication Association (ISCA) | en_US |
| dc.rights | © ISCA | en_US |
| dc.rights | The following publication Lin, W., Mak, M.W., Yi, L. (2020) Learning Mixture Representation for Deep Speaker Embedding Using Attention. Proc. The Speaker and Language Recognition Workshop (Odyssey 2020), 210-214 is available at https://doi.org/10.21437/Odyssey.2020-30. | en_US |
| dc.title | Learning mixture representation for deep speaker embedding using attention | en_US |
| dc.type | Conference Paper | en_US |
| dc.identifier.spage | 210 | en_US |
| dc.identifier.epage | 214 | en_US |
| dc.identifier.doi | 10.21437/Odyssey.2020-30 | en_US |
| dcterms.abstract | Almost all speaker recognition systems involve a step that converts a sequence of frame-level features to a fixed dimension representation. In the context of deep neural networks, it is referred to as statistics pooling. In state-of-the-art speak recognition systems, statistics pooling is implemented by concatenating the mean and standard deviation of a sequence of frame-level features. However, a single mean and standard deviation are very limited descriptive statistics for an acoustic sequence even with a powerful feature extractor like a convolutional neural network. In this paper, we propose a novel statistics pooling method that can produce more descriptive statistics through a mixture representation. Our method is inspired by the expectation-maximization (EM) algorithm in Gaussian mixture models (GMMs). However, unlike the GMMs, the mixture assignments are given by an attention mechanism instead of the Euclidean distances between frame-level features and explicit centers. Applying the proposed attention mechanism to a 121-layer Densenet, we achieve an EER of 1.1\% in VoxCeleb1 and an EER of 4.77\% in VOiCES 2019 evaluation set. | en_US |
| dcterms.accessRights | open access | en_US |
| dcterms.bibliographicCitation | The Speaker and Language Recognition Workshop (Odyssey 2020), 1-5 November 2020, Tokyo, Japan, p. 210-214 | en_US |
| dcterms.issued | 2020 | - |
| dc.description.validate | 202405 bcch | en_US |
| dc.description.oa | Version of Record | en_US |
| dc.identifier.FolderNumber | EIE-0131 | - |
| dc.description.fundingSource | RGC | en_US |
| dc.description.pubStatus | Published | en_US |
| dc.identifier.OPUS | 20509205 | - |
| dc.description.oaCategory | VoR allowed | en_US |
| Appears in Collections: | Conference Paper | |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| lin20c_odyssey.pdf | 254.06 kB | Adobe PDF | View/Open |
Page views
113
Last Week
12
12
Last month
Citations as of Nov 9, 2025
Downloads
57
Citations as of Nov 9, 2025
Google ScholarTM
Check
Altmetric
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.



