Geometry-guided transformer for monocular 3D object detection

Zhang, M; Zhang, Y; Sun, J; Yung, KL; Yang, L

doi:10.1002/aisy.202500003

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/113968

Title:	Geometry-guided transformer for monocular 3D object detection
Authors:	Zhang, M Zhang, Y Sun, J Yung, KL Yang, L
Issue Date:	Nov-2025
Source:	Advanced intelligent systems, Nov. 2025, v. 7, no. 11, 2500003
Abstract:	Monocular 3D object detection aims to identify objects’ 3D positions and poses with low hardware and computation power costs, which is crucial for scenarios like autonomous driving and deep space exploration. While the corresponding research has developed rapidly with the integration of transformer structures, features in 3D are still simply transformed from visual features, resulting in a mismatch between the detection results and the reality. Moreover, most existing methods suffer from the slow convergence speed. To address these issues in monocular 3D object detection, a framework, named geometry-guided monocular detection with transformer (GG-Mono), is proposed. It consists of three main components: 1) the mix-feature encoder module that incorporates pretrained depth estimation models to enhance convergence speed and accuracy; 2) the geometry encoding module that supplements hybrid encoding with global geometry data; 3) the GG decoder module that utilizes geometry queries to guide the decoding process. Extensive experiments show that the model outperforms all existing methods in terms of detection accuracy, and achieves 26.88% and 30.65% in average precision of 3D detection box (AP3D) on the validation dataset and test dataset, respectively, which is 1.88% and 1.81% higher than the baseline, and significantly improved the convergence speed (from 184 to 90 epochs). These facts prove the advantages of the proposed method for monocular 3D object detection.
Keywords:	3D object detection Deep learning Single-view geometry Transformer
Publisher:	Advanced intelligent systems
Journal:	2640-4567
ISSN:	Advanced Intelligent Systems
DOI:	10.1002/aisy.202500003
Rights:	© 2025 The Author(s). Advanced Intelligent Systems published by Wiley-VCH GmbH. This is an open access article under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits use, distribution and reproduction in any medium, provided the original work is properly cited. The following publication Zhang, M., Zhang, Y., Sun, J., Yung, K.-L. and Yang, L. (2025), Geometry-Guided Transformer for Monocular 3D Object Detection. Adv. Intell. Syst., 7: 2500003 is available at https://doi.org/10.1002/aisy.202500003.
Appears in Collections:	Journal/Magazine Article

Files in This Item:

File	Description	Size	Format
Zhang_Geometry_Guided_Transformer.pdf		3.6 MB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Version of Record

Access

View full-text via PolyU eLinks

Show full item record

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Google ScholarTM

Altmetric

Google Scholar^TM