Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/119106
PIRA download icon_1.1View/Download Full Text
Title: Leveraging pretrained diffusion model for semantic 3-D reconstruction from monocular remote sensing image
Authors: Xu, X 
Deng, R 
Cao, Q
Guo, Z 
Chen, Y
Yan, J 
Issue Date: 2026
Source: IEEE transactions on geoscience and remote sensing, 2026, v. 64, 5603516
Abstract: Semantic 3D reconstruction from monocular imagery serves as a cost-effective tool for many urban applications, such as energy system modeling, resilience analysis, and urban planning. However, the generalization of task-specific models for semantic 3D reconstruction remains limited by the available data scale and diversity. In contrast, visual foundation models (VFMs) are trained on large-scale, diverse datasets, enabling stronger adaptability and richer visual knowledge across different tasks. Unlike most VFMs that focus on discrimination or feature extraction, pretrained diffusion models (PDMs) are generative, combining high-level semantic understanding with the ability to produce high-fidelity details and textures. Building upon these advantages, this study proposes a novel task-adaptive framework that harnesses PDMs for semantic 3D reconstruction from monocular remote sensing images. Our framework employs low-rank adaptation to efficiently fine-tune the denoising network, effectively modeling the high-dimensional features required for semantic 3D reconstruction while only training a minimal fraction of parameters. We further design a lightweight, task-specific decoder to map these features into target elevation and semantic maps. In addition, we introduce an evidential height regression method, which incorporates uncertainty awareness into height estimation without introducing additional computational overhead. Experiments on the public US3D JAX and Open Data DC datasets demonstrate that our framework significantly outperforms other existing methods in both subtasks of height estimation and semantic segmentation, achieving high-fidelity semantic 3D reconstruction of remote sensing scenes. This technology holds significant potential for advancing urban modeling, enabling more accurate and efficient large-scale geographic analysis.
Keywords: Low-rank adaptation (LoRA)
Pretrained diffusion model (PDM)
Semantic 3-D reconstruction
Task adaptation
Visual foundation models
Publisher: Institute of Electrical and Electronics Engineers
Journal: IEEE transactions on geoscience and remote sensing 
ISSN: 0196-2892
EISSN: 1558-0644
DOI: 10.1109/TGRS.2026.3653117
Rights: © 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
The following publication X. Xu, R. Deng, Q. Cao, Z. Guo, Y. Chen and J. Yan, 'Leveraging Pretrained Diffusion Model for Semantic 3-D Reconstruction From Monocular Remote Sensing Image,' in IEEE Transactions on Geoscience and Remote Sensing, vol. 64, pp. 1-16, 2026, Art no. 5603516 is available at https://doi.org/10.1109/TGRS.2026.3653117.
Appears in Collections:Journal/Magazine Article

Files in This Item:
File Description SizeFormat 
Xu_Leveraging_Pretrained_Diffusion.pdfPre-Published version30.65 MBAdobe PDFView/Open
Open Access Information
Status open access
File Version Final Accepted Manuscript
Access
View full-text via PolyU eLinks SFX Query
Show full item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.