Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/119530
Title: Merge3D : efficient 3D multimodal LLMs via joint 2D-3D token merging
Authors: Pan, T
Yang, X 
Wang, X
Issue Date: 2026
Source: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026, June 3 - Sun June 7, 2026, Colorado Convention Center, p. 31066-31077
Abstract: Multimodal Large Language Models (MLLMs) incorporating 3D geometry demonstrate significant power in 3D scene understanding. Their primary bottleneck, however, is the substantial computational burden associated with processing multi-view, lengthy visual token sequences. To surmount this challenge, we propose \textbf{Merge3D}, a geometry-aware token merging framework that integrates both 3D geometry and 2D semantic information. Conventional 2D compression methods, which rely solely on semantic signals, prove inadequate for 3D tasks, as they tend to discard spatially critical tokens and damage grounding performance. Merge3D bridges the modalities with a Semantic–Geometric Token Merger (SemGeo Merger): 2D attention is used to select semantically salient dominant tokens, while a hybrid 2D+3D similarity assigns and aggregates contextual tokens from spatially coherent 3D neighborhoods. This preserves 3D structural priors and inter-frame correspondences under aggressive compression. Merge3D achieves up to 70\% visual token reduction and up to ~3X inference speedup, while retaining strong performance on 3D grounding, captioning, and spatial reasoning benchmarks such as Scan2Cap, CV-Bench, and BLINK.
Description: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026, June 3 - Sun June 7, 2026, Colorado Convention Center
The following paper Tianbo Pan, Xingyi Yang, Xinchao Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 31066-31077 is available at https://openaccess.thecvf.com/content/CVPR2026/html/Pan_Merge3D_Efficient_3D_Multimodal_LLMs_via_Joint_2D-3D_Token_Merging_CVPR_2026_paper.html
Appears in Collections:Conference Paper

Open Access Information
Status embargoed access
Embargo End Date 0000-00-00 (to be updated)
Show full item record

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.