Merge3D : efficient 3D multimodal LLMs via joint 2D-3D token merging

Pan, T; Yang, X; Wang, X

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/119530

DC Field	Value	Language
dc.contributor	Department of Data Science and Artificial Intelligence	en_US
dc.creator	Pan, T	en_US
dc.creator	Yang, X	en_US
dc.creator	Wang, X	en_US
dc.date.accessioned	2026-06-26T06:51:34Z	-
dc.date.available	2026-06-26T06:51:34Z	-
dc.identifier.uri	http://hdl.handle.net/10397/119530	-
dc.description	The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026, June 3 - Sun June 7, 2026, Colorado Convention Center	en_US
dc.description	The following paper Tianbo Pan, Xingyi Yang, Xinchao Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 31066-31077 is available at https://openaccess.thecvf.com/content/CVPR2026/html/Pan_Merge3D_Efficient_3D_Multimodal_LLMs_via_Joint_2D-3D_Token_Merging_CVPR_2026_paper.html	en_US
dc.language.iso	en	en_US
dc.title	Merge3D : efficient 3D multimodal LLMs via joint 2D-3D token merging	en_US
dc.type	Conference Paper	en_US
dc.identifier.spage	31066	en_US
dc.identifier.epage	31077	en_US
dcterms.abstract	Multimodal Large Language Models (MLLMs) incorporating 3D geometry demonstrate significant power in 3D scene understanding. Their primary bottleneck, however, is the substantial computational burden associated with processing multi-view, lengthy visual token sequences. To surmount this challenge, we propose \textbf{Merge3D}, a geometry-aware token merging framework that integrates both 3D geometry and 2D semantic information. Conventional 2D compression methods, which rely solely on semantic signals, prove inadequate for 3D tasks, as they tend to discard spatially critical tokens and damage grounding performance. Merge3D bridges the modalities with a Semantic–Geometric Token Merger (SemGeo Merger): 2D attention is used to select semantically salient dominant tokens, while a hybrid 2D+3D similarity assigns and aggregates contextual tokens from spatially coherent 3D neighborhoods. This preserves 3D structural priors and inter-frame correspondences under aggressive compression. Merge3D achieves up to 70\% visual token reduction and up to ~3X inference speedup, while retaining strong performance on 3D grounding, captioning, and spatial reasoning benchmarks such as Scan2Cap, CV-Bench, and BLINK.	en_US
dcterms.accessRights	embargoed access	en_US
dcterms.bibliographicCitation	The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026, June 3 - Sun June 7, 2026, Colorado Convention Center, p. 31066-31077	en_US
dcterms.issued	2026	-
dc.relation.conference	IEEE/CVF Conference on Computer Vision and Pattern Recognition [CVPR]	en_US
dc.description.validate	202606 bcch	en_US
dc.description.oa	Not applicable	en_US
dc.identifier.FolderNumber	a4535b	-
dc.identifier.SubFormID	53070	-
dc.description.fundingSource	Others	en_US
dc.description.fundingText	This project is supported by the National Research Foundation, Singapore, and Cyber Security Agency of Singapore under its National Cybersecurity R&D Programme and CyberSG R&D Cyber Research Programme Office (Award: CRPO-GC1-NTU-002). This work was also supported by the Presidential Young Scholars Scheme (Project ID: P0058232) from The Hong Kong Polytechnic University and by TeleAI.	en_US
dc.description.pubStatus	Not yet published	en_US
dc.date.embargo	0000-00-00 (to be updated)	en_US
dc.description.oaCategory	Green (AAM)	en_US
Appears in Collections:	Conference Paper

Open Access Information

Status	embargoed access
Embargo End Date	0000-00-00 (to be updated)

Show simple item record

Google Scholar^TM

Check

Open Access Information

Google ScholarTM

Google Scholar^TM