Contrastive learning-based place descriptor representation for cross-modality place recognition

Meng, S; Wang, Y; Xu, H; Chau, LP

doi:10.1016/j.inffus.2025.103351

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/116349

DC Field	Value	Language
dc.contributor	Department of Electrical and Electronic Engineering	-
dc.creator	Meng, S	-
dc.creator	Wang, Y	-
dc.creator	Xu, H	-
dc.creator	Chau, LP	-
dc.date.accessioned	2025-12-18T06:42:05Z	-
dc.date.available	2025-12-18T06:42:05Z	-
dc.identifier.issn	1566-2535	-
dc.identifier.uri	http://hdl.handle.net/10397/116349	-
dc.language.iso	en	en_US
dc.publisher	Elsevier	en_US
dc.subject	Contrastive learning	en_US
dc.subject	Cross-modality	en_US
dc.subject	Descriptor representation	en_US
dc.subject	Place recognition	en_US
dc.title	Contrastive learning-based place descriptor representation for cross-modality place recognition	en_US
dc.type	Journal/Magazine Article	en_US
dc.identifier.volume	124	-
dc.identifier.doi	10.1016/j.inffus.2025.103351	-
dcterms.abstract	Place recognition in LiDAR maps plays a vital role in assisting localization, especially in GPS-denied circumstances. While many efforts have been made toward pure LiDAR-based place recognition, these approaches are often hindered by high computational costs and operational burden on the driving agent. To alleviate these limitations, we explore an alternative approach for large-scale cross-modal localization by matching real-time RGB images to pre-existing LiDAR 3D point cloud maps. Specifically, we present a unified place descriptor representation learning method for cross modalities using Siamese architecture, which reformulates place recognition as a similarity modeling retrieval task. To address the inherent modality differences between visual images and point clouds, we first transform unordered point clouds into a range-view representation, facilitating effective cross-modal metric learning. Subsequently, we introduce a Transformer-Mamba Mixer module that integrates selective scanning and attention mechanisms to capture both intra-context and inter-context embeddings, enabling the generation of place descriptors. To further enrich and generate global location descriptors, we propose a semantic-promoted descriptor enhancer grounded in semantic distribution estimation. Finally, a contrastive learning paradigm is employed to perform cross-modal place recognition, identifying the most similar descriptors across modalities. Extensive experiments demonstrate the superiority of our proposed method in comparison to state-of-the-art methods. The details are available at https://github.com/emilyemliyM/Cross-PRNet.	-
dcterms.accessRights	embargoed access	en_US
dcterms.bibliographicCitation	Information fusion, Dec. 2025, v. 124, 103351	-
dcterms.isPartOf	Information fusion	-
dcterms.issued	2025-12	-
dc.identifier.scopus	2-s2.0-105007426191	-
dc.identifier.eissn	1872-6305	-
dc.identifier.artn	103351	-
dc.description.validate	202512 bcjz	-
dc.description.oa	Not applicable	en_US
dc.identifier.SubFormID	G000429/2025-11	en_US
dc.description.fundingSource	RGC	en_US
dc.description.fundingSource	Others	en_US
dc.description.fundingText	The research work was conducted in the JC STEM Lab of Machine Learning and Computer Vision funded by The Hong Kong Jockey Club Charities Trust. It was partially supported by the Research Grants Council of the Hong Kong SAR, China (Project No. PolyU 15215824).	en_US
dc.description.pubStatus	Published	en_US
dc.date.embargo	2027-12-31	en_US
dc.description.oaCategory	Green (AAM)	en_US
Appears in Collections:	Journal/Magazine Article

Open Access Information

Status	embargoed access
Embargo End Date	2027-12-31

Access

View full-text via PolyU eLinks

Show simple item record

Google Scholar^TM

Check

Open Access Information

Access

Google ScholarTM

Altmetric

Google Scholar^TM