Golden Gemini is all you need : finding the sweet spots for speaker verification

Liu, T; Lee, KA; Wang, Q; Li, H

doi:10.1109/TASLP.2024.3385277

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/109492

Title:	Golden Gemini is all you need : finding the sweet spots for speaker verification
Authors:	Liu, T Lee, KA Wang, Q Li, H
Issue Date:	2024
Source:	IEEE/ACM transactions on audio, speech, and language processing, 2024, v. 32, p. 2324-2337
Abstract:	The residual neural networks (ResNet) demonstrate the impressive performance in automatic speaker verification (ASV). They treat the time and frequency dimensions equally, following the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech representation. We address this issue and postulate Golden-Gemini Hypothesis, which posits the prioritization of temporal resolution over frequency resolution for ASV. The hypothesis is verified by conducting a systematic study on the impact of temporal and frequency resolutions on the performance, using a trellis diagram to represent the stride space. We further identify two optimal points, namely Golden Gemini , which serves as a guiding principle for designing 2D ResNet-based ASV models. By following the principle, a state-of-the-art ResNet baseline model gains a significant performance improvement on VoxCeleb, SITW, and CNCeleb datasets with 7.70%/11.76% average EER/minDCF reductions, respectively, across different network depths (ResNet18, 34, 50, and 101), while reducing the number of parameters by 16.5% and FLOPs by 4.1%. We refer to it as Gemini ResNet. Further investigation reveals the efficacy of the proposed Golden Gemini operating points across various training conditions and architectures. Furthermore, we present a new benchmark, namely the Gemini DF-ResNet, using a cutting-edge model.
Keywords:	2D CNN ResNet Speaker recognition Speaker verification Stride configuration Temporal resolution
Journal:	IEEE/ACM transactions on audio, speech, and language processing
DOI:	10.1109/TASLP.2024.3385277
Rights:	© 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ The following publication T. Liu, K. A. Lee, Q. Wang and H. Li, "Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2324-2337, 2024 is available at https://doi.org/10.1109/TASLP.2024.3385277.
Appears in Collections:	Journal/Magazine Article

Files in This Item:

File	Description	Size	Format
Liu_Golden_Gemini_All.pdf		3.48 MB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Version of Record

Access

View full-text via PolyU eLinks

Show full item record

Page views

45

Citations as of Apr 14, 2025

Downloads

20

Citations as of Apr 14, 2025

SCOPUS^TM
Citations

20

Citations as of Apr 3, 2026

Google Scholar^TM

Check