Photorealistic fire scene video generation via multimodal large language model and pre-trained video diffusion model

Zheng, H; Huang, X

doi:10.26599/CVM.2025.9450511

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/117401

Title:	Photorealistic fire scene video generation via multimodal large language model and pre-trained video diffusion model
Authors:	Zheng, H Huang, X
Issue Date:	2026
Source:	Computational visual media, Date of Publication: 27 January 2026, Early Access, https://doi.org/10.26599/CVM.2025.9450511
Abstract:	Text-to-video diffusion models have made significant progress. However, there is still a lack of dedicated research on generating fire scene videos with physical realism and visual fidelity. To address this gap, we propose text-to-video fire (T2VFire) scene generation. T2VFire uses GPT-4o as the core engine, which is integrated with an external fire-related knowledge base and a retrieval-augmented generation (RAG) mechanism that can be dynamically updated based on prompts. With the support of this knowledge, the system first expands the user's initial text description and generates a keyframe image. Then, through iterative prompt optimization, it guides a pretrained video diffusion model to generate fire scene videos with physical consistency. Experimental results show that T2VFire improves upon the physical consistency and visual realism of fire scene videos generated by current video generation models. This method provides a solid foundation for future smart firefighting and digital twin systems in building fire safety management.
Keywords:	Diffusion models Fire Physicality Text-to-Video (T2V) Video
Publisher:	Tsinghua University Press
Journal:	Computational visual media
ISSN:	2096-0433
EISSN:	2096-0662
DOI:	10.26599/CVM.2025.9450511
Rights:	© The Author(s) 2026. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The following publication H. Zheng and X. Huang, "Photorealistic fire scene video generation via multimodal large language model and pre-trained video diffusion model," in Computational Visual Media is available at https://doi.org/10.26599/CVM.2025.9450511.
Appears in Collections:	Journal/Magazine Article

Files in This Item:

File	Description	Size	Format
Zheng_Photorealistic_Fire_Scene.pdf		20.9 MB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Version of Record

Access

View full-text via PolyU eLinks

Show full item record

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Google ScholarTM

Altmetric

Google Scholar^TM