Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/119530
DC FieldValueLanguage
dc.contributorDepartment of Data Science and Artificial Intelligenceen_US
dc.creatorPan, Ten_US
dc.creatorYang, Xen_US
dc.creatorWang, Xen_US
dc.date.accessioned2026-06-26T06:51:34Z-
dc.date.available2026-06-26T06:51:34Z-
dc.identifier.urihttp://hdl.handle.net/10397/119530-
dc.descriptionThe IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026, June 3 - Sun June 7, 2026, Colorado Convention Centeren_US
dc.descriptionThe following paper Tianbo Pan, Xingyi Yang, Xinchao Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 31066-31077 is available at https://openaccess.thecvf.com/content/CVPR2026/html/Pan_Merge3D_Efficient_3D_Multimodal_LLMs_via_Joint_2D-3D_Token_Merging_CVPR_2026_paper.htmlen_US
dc.language.isoenen_US
dc.titleMerge3D : efficient 3D multimodal LLMs via joint 2D-3D token mergingen_US
dc.typeConference Paperen_US
dc.identifier.spage31066en_US
dc.identifier.epage31077en_US
dcterms.abstractMultimodal Large Language Models (MLLMs) incorporating 3D geometry demonstrate significant power in 3D scene understanding. Their primary bottleneck, however, is the substantial computational burden associated with processing multi-view, lengthy visual token sequences. To surmount this challenge, we propose \textbf{Merge3D}, a geometry-aware token merging framework that integrates both 3D geometry and 2D semantic information. Conventional 2D compression methods, which rely solely on semantic signals, prove inadequate for 3D tasks, as they tend to discard spatially critical tokens and damage grounding performance. Merge3D bridges the modalities with a Semantic–Geometric Token Merger (SemGeo Merger): 2D attention is used to select semantically salient dominant tokens, while a hybrid 2D+3D similarity assigns and aggregates contextual tokens from spatially coherent 3D neighborhoods. This preserves 3D structural priors and inter-frame correspondences under aggressive compression. Merge3D achieves up to 70\% visual token reduction and up to ~3X inference speedup, while retaining strong performance on 3D grounding, captioning, and spatial reasoning benchmarks such as Scan2Cap, CV-Bench, and BLINK.en_US
dcterms.accessRightsembargoed accessen_US
dcterms.bibliographicCitationThe IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026, June 3 - Sun June 7, 2026, Colorado Convention Center, p. 31066-31077en_US
dcterms.issued2026-
dc.relation.conferenceIEEE/CVF Conference on Computer Vision and Pattern Recognition [CVPR]en_US
dc.description.validate202606 bcchen_US
dc.description.oaNot applicableen_US
dc.identifier.FolderNumbera4535b-
dc.identifier.SubFormID53070-
dc.description.fundingSourceOthersen_US
dc.description.fundingTextThis project is supported by the National Research Foundation, Singapore, and Cyber Security Agency of Singapore under its National Cybersecurity R&D Programme and CyberSG R&D Cyber Research Programme Office (Award: CRPO-GC1-NTU-002). This work was also supported by the Presidential Young Scholars Scheme (Project ID: P0058232) from The Hong Kong Polytechnic University and by TeleAI.en_US
dc.description.pubStatusNot yet publisheden_US
dc.date.embargo0000-00-00 (to be updated)en_US
dc.description.oaCategoryGreen (AAM)en_US
Appears in Collections:Conference Paper
Open Access Information
Status embargoed access
Embargo End Date 0000-00-00 (to be updated)
Show simple item record

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.