Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/106902
PIRA download icon_1.1View/Download Full Text
DC FieldValueLanguage
dc.contributorDepartment of Electrical and Electronic Engineeringen_US
dc.creatorLin, Wen_US
dc.creatorMak, MWen_US
dc.creatorChien, JTen_US
dc.date.accessioned2024-06-07T00:58:45Z-
dc.date.available2024-06-07T00:58:45Z-
dc.identifier.isbn978-1-7138-2069-7en_US
dc.identifier.urihttp://hdl.handle.net/10397/106902-
dc.description21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, 25-29 October 2020, Shanghai, Chinaen_US
dc.language.isoenen_US
dc.publisherInternational Speech Communication Association (ISCA)en_US
dc.rightsCopyright © 2020 ISCAen_US
dc.rightsThe following publication Lin, W., Mak, M.-W., Chien, J.-T. (2020) Strategies for End-to-End Text-Independent Speaker Verification. Proc. Interspeech 2020, 4308-4312 is available at https://doi.org/10.21437/Interspeech.2020-2092.en_US
dc.titleStrategies for end-to-end text-independent speaker verificationen_US
dc.typeConference Paperen_US
dc.identifier.spage4308en_US
dc.identifier.epage4312en_US
dc.identifier.doi10.21437/Interspeech.2020-2092en_US
dcterms.abstractState-of-the-art speaker verification (SV) systems typically consist of two distinct components: a deep neural network (DNN) for creating speaker embeddings and a backend for improving the embeddings’ discriminative ability. The question which arises is: Can we train an SV system without a backend? We believe that the backend is to compensate for the fact that the network is trained entirely on short speech segments. This paper shows that with several modifications to the x-vector system, DNN embeddings can be directly used for verification. The proposed modifications include: (1) a mask-pooling layer that augments the training samples by randomly masking the frame-level activations and then computing temporal statistics, (2) a sampling scheme that produces diverse training samples by randomly splicing several speech segments from each utterance, and (3) additional convolutional layers designed to reduce the temporal resolution to save computational cost. Experiments on NIST SRE 2016 and 2018 show that our method can achieve state-of-the-art performance with simple cosine similarity and requires only half of the computational cost of the x-vector network.en_US
dcterms.accessRightsopen accessen_US
dcterms.bibliographicCitationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, 25-29 October 2020, Shanghai, China, p. 4308-4312en_US
dcterms.issued2020-
dc.identifier.scopus2-s2.0-85098160006-
dc.relation.conferenceInternational Speech Communication Association [Interspeech]en_US
dc.description.validate202405 bcchen_US
dc.description.oaVersion of Recorden_US
dc.identifier.FolderNumberEIE-0159-
dc.description.fundingSourceRGCen_US
dc.description.pubStatusPublisheden_US
dc.identifier.OPUS55969000-
dc.description.oaCategoryVoR alloweden_US
Appears in Collections:Conference Paper
Files in This Item:
File Description SizeFormat 
lin20l_interspeech.pdf831.58 kBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Access
View full-text via PolyU eLinks SFX Query
Show simple item record

Page views

94
Last Week
9
Last month
Citations as of Nov 9, 2025

Downloads

43
Citations as of Nov 9, 2025

SCOPUSTM   
Citations

4
Citations as of Dec 19, 2025

WEB OF SCIENCETM
Citations

3
Citations as of Dec 18, 2025

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.