Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/117401
PIRA download icon_1.1View/Download Full Text
Title: Photorealistic fire scene video generation via multimodal large language model and pre-trained video diffusion model
Authors: Zheng, H 
Huang, X 
Issue Date: 2026
Source: Computational visual media, Date of Publication: 27 January 2026, Early Access, https://doi.org/10.26599/CVM.2025.9450511
Abstract: Text-to-video diffusion models have made significant progress. However, there is still a lack of dedicated research on generating fire scene videos with physical realism and visual fidelity. To address this gap, we propose text-to-video fire (T2VFire) scene generation. T2VFire uses GPT-4o as the core engine, which is integrated with an external fire-related knowledge base and a retrieval-augmented generation (RAG) mechanism that can be dynamically updated based on prompts. With the support of this knowledge, the system first expands the user's initial text description and generates a keyframe image. Then, through iterative prompt optimization, it guides a pretrained video diffusion model to generate fire scene videos with physical consistency. Experimental results show that T2VFire improves upon the physical consistency and visual realism of fire scene videos generated by current video generation models. This method provides a solid foundation for future smart firefighting and digital twin systems in building fire safety management.
Keywords: Diffusion models
Fire
Physicality
Text-to-Video (T2V)
Video
Publisher: Tsinghua University Press
Journal: Computational visual media 
ISSN: 2096-0433
EISSN: 2096-0662
DOI: 10.26599/CVM.2025.9450511
Rights: © The Author(s) 2026.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
The following publication H. Zheng and X. Huang, "Photorealistic fire scene video generation via multimodal large language model and pre-trained video diffusion model," in Computational Visual Media is available at https://doi.org/10.26599/CVM.2025.9450511.
Appears in Collections:Journal/Magazine Article

Files in This Item:
File Description SizeFormat 
Zheng_Photorealistic_Fire_Scene.pdf20.9 MBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Access
View full-text via PolyU eLinks SFX Query
Show full item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.