Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/106903
PIRA download icon_1.1View/Download Full Text
DC FieldValueLanguage
dc.contributorDepartment of Electrical and Electronic Engineeringen_US
dc.creatorLin, Wen_US
dc.creatorMak, MWen_US
dc.date.accessioned2024-06-07T00:58:46Z-
dc.date.available2024-06-07T00:58:46Z-
dc.identifier.isbn978-1-7138-2069-7en_US
dc.identifier.urihttp://hdl.handle.net/10397/106903-
dc.description21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, 25-29 October 2020, Shanghai, Chinaen_US
dc.language.isoenen_US
dc.publisherInternational Speech Communication Association (ISCA)en_US
dc.rightsCopyright © 2020 ISCAen_US
dc.rightsThe following publication Lin, W., Mak, M.-W. (2020) Wav2Spk: A Simple DNN Architecture for Learning Speaker Embeddings from Waveforms. Proc. Interspeech 2020, 3211-3215 is available at https://doi.org/10.21437/Interspeech.2020-1287.en_US
dc.titleWav2Spk : a simple DNN architecture for learning speaker embeddings from waveformsen_US
dc.typeConference Paperen_US
dc.identifier.spage3211en_US
dc.identifier.epage3215en_US
dc.identifier.doi10.21437/Interspeech.2020-1287en_US
dcterms.abstractSpeaker recognition has seen impressive advances with the advent of deep neural networks (DNNs). However, state-of-the-art speaker recognition systems still rely on human engineering features such as mel-frequency cepstrum coefficients (MFCC). We believe that the handcrafted features limit the potential of the powerful representation of DNNs. Besides, there are also additional steps such as voice activity detection (VAD) and cepstral mean and variance normalization (CMVN) after computing the MFCC. In this paper, we show that MFCC, VAD, and CMVN can be replaced by the tools available in the standard deep learning toolboxes, such as a stacked of stride convolutions, temporal gating, and instance normalization. With these tools, we show that directly learning speaker embeddings from waveforms outperforms an x-vector network that uses MFCC or filter-bank output as features. We achieve an EER of 1.95% on the VoxCeleb1 test set using an end-to-end training scheme, which, to our best knowledge, is the best performance reported using raw waveforms. What’s more, the proposed method is complementary with x-vector systems. The fusion of the proposed method with x-vectors trained on filter-bank features produce an EER of 1.55%.en_US
dcterms.accessRightsopen accessen_US
dcterms.bibliographicCitationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, 25-29 October 2020, Shanghai, China, p. 3211-3125en_US
dcterms.issued2020-
dc.identifier.scopus2-s2.0-85098222384-
dc.relation.conferenceInternational Speech Communication Association [Interspeech]en_US
dc.description.validate202405 bcchen_US
dc.description.oaVersion of Recorden_US
dc.identifier.FolderNumberEIE-0160-
dc.description.fundingSourceRGCen_US
dc.description.fundingSourceOthersen_US
dc.description.fundingTextNSFCen_US
dc.description.pubStatusPublisheden_US
dc.identifier.OPUS55969063-
dc.description.oaCategoryVoR alloweden_US
Appears in Collections:Conference Paper
Files in This Item:
File Description SizeFormat 
lin20i_interspeech.pdf266.71 kBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Access
View full-text via PolyU eLinks SFX Query
Show simple item record

Page views

46
Citations as of Feb 23, 2025

Downloads

27
Citations as of Feb 23, 2025

SCOPUSTM   
Citations

30
Citations as of Feb 20, 2025

WEB OF SCIENCETM
Citations

26
Citations as of Feb 20, 2025

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.