Please use this identifier to cite or link to this item:
http://hdl.handle.net/10397/119530
| DC Field | Value | Language |
|---|---|---|
| dc.contributor | Department of Data Science and Artificial Intelligence | en_US |
| dc.creator | Pan, T | en_US |
| dc.creator | Yang, X | en_US |
| dc.creator | Wang, X | en_US |
| dc.date.accessioned | 2026-06-26T06:51:34Z | - |
| dc.date.available | 2026-06-26T06:51:34Z | - |
| dc.identifier.uri | http://hdl.handle.net/10397/119530 | - |
| dc.description | The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026, June 3 - Sun June 7, 2026, Colorado Convention Center | en_US |
| dc.description | The following paper Tianbo Pan, Xingyi Yang, Xinchao Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 31066-31077 is available at https://openaccess.thecvf.com/content/CVPR2026/html/Pan_Merge3D_Efficient_3D_Multimodal_LLMs_via_Joint_2D-3D_Token_Merging_CVPR_2026_paper.html | en_US |
| dc.language.iso | en | en_US |
| dc.title | Merge3D : efficient 3D multimodal LLMs via joint 2D-3D token merging | en_US |
| dc.type | Conference Paper | en_US |
| dc.identifier.spage | 31066 | en_US |
| dc.identifier.epage | 31077 | en_US |
| dcterms.abstract | Multimodal Large Language Models (MLLMs) incorporating 3D geometry demonstrate significant power in 3D scene understanding. Their primary bottleneck, however, is the substantial computational burden associated with processing multi-view, lengthy visual token sequences. To surmount this challenge, we propose \textbf{Merge3D}, a geometry-aware token merging framework that integrates both 3D geometry and 2D semantic information. Conventional 2D compression methods, which rely solely on semantic signals, prove inadequate for 3D tasks, as they tend to discard spatially critical tokens and damage grounding performance. Merge3D bridges the modalities with a Semantic–Geometric Token Merger (SemGeo Merger): 2D attention is used to select semantically salient dominant tokens, while a hybrid 2D+3D similarity assigns and aggregates contextual tokens from spatially coherent 3D neighborhoods. This preserves 3D structural priors and inter-frame correspondences under aggressive compression. Merge3D achieves up to 70\% visual token reduction and up to ~3X inference speedup, while retaining strong performance on 3D grounding, captioning, and spatial reasoning benchmarks such as Scan2Cap, CV-Bench, and BLINK. | en_US |
| dcterms.accessRights | embargoed access | en_US |
| dcterms.bibliographicCitation | The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026, June 3 - Sun June 7, 2026, Colorado Convention Center, p. 31066-31077 | en_US |
| dcterms.issued | 2026 | - |
| dc.relation.conference | IEEE/CVF Conference on Computer Vision and Pattern Recognition [CVPR] | en_US |
| dc.description.validate | 202606 bcch | en_US |
| dc.description.oa | Not applicable | en_US |
| dc.identifier.FolderNumber | a4535b | - |
| dc.identifier.SubFormID | 53070 | - |
| dc.description.fundingSource | Others | en_US |
| dc.description.fundingText | This project is supported by the National Research Foundation, Singapore, and Cyber Security Agency of Singapore under its National Cybersecurity R&D Programme and CyberSG R&D Cyber Research Programme Office (Award: CRPO-GC1-NTU-002). This work was also supported by the Presidential Young Scholars Scheme (Project ID: P0058232) from The Hong Kong Polytechnic University and by TeleAI. | en_US |
| dc.description.pubStatus | Not yet published | en_US |
| dc.date.embargo | 0000-00-00 (to be updated) | en_US |
| dc.description.oaCategory | Green (AAM) | en_US |
| Appears in Collections: | Conference Paper | |
Google ScholarTM
Check
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.


