Wav2Spk : a simple DNN architecture for learning speaker embeddings from waveforms

Lin, W; Mak, MW

doi:10.21437/Interspeech.2020-1287

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/106903

Title:	Wav2Spk : a simple DNN architecture for learning speaker embeddings from waveforms
Authors:	Lin, W Mak, MW
Issue Date:	2020
Source:	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, 25-29 October 2020, Shanghai, China, p. 3211-3125
Abstract:	Speaker recognition has seen impressive advances with the advent of deep neural networks (DNNs). However, state-of-the-art speaker recognition systems still rely on human engineering features such as mel-frequency cepstrum coefficients (MFCC). We believe that the handcrafted features limit the potential of the powerful representation of DNNs. Besides, there are also additional steps such as voice activity detection (VAD) and cepstral mean and variance normalization (CMVN) after computing the MFCC. In this paper, we show that MFCC, VAD, and CMVN can be replaced by the tools available in the standard deep learning toolboxes, such as a stacked of stride convolutions, temporal gating, and instance normalization. With these tools, we show that directly learning speaker embeddings from waveforms outperforms an x-vector network that uses MFCC or filter-bank output as features. We achieve an EER of 1.95% on the VoxCeleb1 test set using an end-to-end training scheme, which, to our best knowledge, is the best performance reported using raw waveforms. What’s more, the proposed method is complementary with x-vector systems. The fusion of the proposed method with x-vectors trained on filter-bank features produce an EER of 1.55%.
Publisher:	International Speech Communication Association (ISCA)
ISBN:	978-1-7138-2069-7
DOI:	10.21437/Interspeech.2020-1287
Description:	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, 25-29 October 2020, Shanghai, China
Rights:	Copyright © 2020 ISCA The following publication Lin, W., Mak, M.-W. (2020) Wav2Spk: A Simple DNN Architecture for Learning Speaker Embeddings from Waveforms. Proc. Interspeech 2020, 3211-3215 is available at https://doi.org/10.21437/Interspeech.2020-1287.
Appears in Collections:	Conference Paper

Files in This Item:

File	Description	Size	Format
lin20i_interspeech.pdf		266.71 kB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Version of Record

Access

View full-text via PolyU eLinks

Show full item record

Page views

46

Citations as of Feb 23, 2025

Downloads

27

Citations as of Feb 23, 2025

SCOPUS^TM
Citations

30

Citations as of Feb 20, 2025

WEB OF SCIENCE^TM
Citations

26

Citations as of Feb 20, 2025

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Page views

Downloads

SCOPUSTM Citations

WEB OF SCIENCETM Citations

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

WEB OF SCIENCE^TM
Citations

Google Scholar^TM