ConFusionformer : locality-enhanced conformer through multi-resolution attention fusion for speaker verification

Tu, Y; Mak, MW; Lee, KA; Lin, W

doi:10.1016/j.neucom.2025.130429

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/113412

DC Field	Value	Language
dc.contributor	Department of Electrical and Electronic Engineering	-
dc.creator	Tu, Y	en_US
dc.creator	Mak, MW	en_US
dc.creator	Lee, KA	en_US
dc.creator	Lin, W	en_US
dc.date.accessioned	2025-06-06T00:42:13Z	-
dc.date.available	2025-06-06T00:42:13Z	-
dc.identifier.issn	0925-2312	en_US
dc.identifier.uri	http://hdl.handle.net/10397/113412	-
dc.language.iso	en	en_US
dc.publisher	Elsevier BV	en_US
dc.subject	Conformer	en_US
dc.subject	Multi-resolution attention fusion	en_US
dc.subject	Speaker embedding	en_US
dc.subject	Speaker verification	en_US
dc.subject	Transformer	en_US
dc.title	ConFusionformer : locality-enhanced conformer through multi-resolution attention fusion for speaker verification	en_US
dc.type	Journal/Magazine Article	en_US
dc.identifier.volume	644	en_US
dc.identifier.doi	10.1016/j.neucom.2025.130429	en_US
dcterms.abstract	Conformers are capable of capturing both global and local dependencies in a sequence. Notably, the modeling of local information is critical to learning speaker characteristics. However, applying Conformers to speaker verification (SV) has not witnessed much success due to their inferior locality modeling capability and low computational efficiency. In this paper, we propose an improved Conformer, ConFusionformer, to address these two challenges. For increasing model efficiency, the conventional Conformer block is modified by placing one feed-forward network between a self-attention module and a convolution module. The modified Conformer block has fewer model parameters, thus reducing the computation cost. The modification also enables a deeper network, boosting the SV performance. Moreover, multi-resolution attention fusion is introduced into the self-attention mechanism to improve locality modeling. Specifically, the restored map from a low-resolution attention score map produced by downsampled queries and keys is fused with the original attention score map to exploit the local information within the restored local region. The proposed ConFusionformer is shown to outperform the Conformer for SV on VoxCeleb, CNCeleb, SRE21, and SRE24, demonstrating the superiority of the ConFusionformer in speaker modeling.	-
dcterms.accessRights	embaroged access	en_US
dcterms.bibliographicCitation	Neurocomputing, 1 Sept 2025, v. 644, 130429	en_US
dcterms.isPartOf	Neurocomputing	en_US
dcterms.issued	2025-09-01	-
dc.identifier.scopus	2-s2.0-105005393269	-
dc.identifier.eissn	1872-8286	en_US
dc.identifier.artn	130429	en_US
dc.description.validate	202506 bcch	-
dc.identifier.FolderNumber	a3641	-
dc.identifier.SubFormID	50551	-
dc.description.fundingSource	RGC	en_US
dc.description.pubStatus	Published	en_US
dc.date.embargo	2027-09-01	en_US
dc.description.oaCategory	Green (AAM)	en_US
Appears in Collections:	Journal/Magazine Article

Open Access Information

Status	embaroged access
Embargo End Date	2027-09-01

Access

View full-text via PolyU eLinks

Show simple item record

Google Scholar^TM

Check

Open Access Information

Access

Google ScholarTM

Altmetric

Google Scholar^TM