Minute-long videos with dual parallelisms

Wang, Z; Zheng, B; Yang, X; Tan, Z; Xu, Y; Wang, X

doi:10.1609/aaai.v40i12.38006

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/119322

Title:	Minute-long videos with dual parallelisms
Authors:	Wang, Z Zheng, B Yang, X Tan, Z Xu, Y Wang, X
Issue Date:	2026
Source:	In S Koenig, C Jenkins, & ME Taylor (Eds.), Proceedings of the 40th Annual AAAI Conference on Artificial Intelligence, p. 10358-10366. Washington, DC: Association for the Advancement of Artificial Intelligence, 2026
Abstract:	Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize computation by partitioning both video frames and model layers across multiple GPUs. However, a naive parallel implementation is not feasible. Because all frames need to share the same noise level, they can't be processed independently. Instead, every step must wait for all others to finish, which cancels out the speed benefits of parallel processing. We overcome this obstacle with a block-wise denoising scheme. Namely, we segment the video into sequential blocks, each with a different noise level. As a result, we process them in a pipeline across the GPUs. Each GPU, holding a subset of the model layers, processes a specific block of frames and passes the results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, each GPU uses a feature cache technique to reduce the overhead of smooth transitions by reusing only features involved in cross-frame computation from the prior block, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54x lower latency and 1.48x lower memory cost on 8xRTX 4090 GPUs.
Publisher:	AAAI Press
ISBN:	1-57735-906-2 978-1-57735-906-7
DOI:	10.1609/aaai.v40i12.38006
Description:	The 40th AAAI Conference on Artificial Intelligence, January 20-27, 2026, Singapore
Rights:	Copyright © 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. The following publication Wang, Z., Zheng, B., Yang, X., Tan, Z., Xu, Y., & Wang, X. (2026). Minute-Long Videos with Dual Parallelisms. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 10358–10366 is available at https://dx.doi.org/10.1609/aaai.v40i12.38006.
Appears in Collections:	Conference Paper

Files in This Item:

File	Description	Size	Format
Wang_Minute_Long_Videos.pdf		6.25 MB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Version of Record

Access

View full-text via PolyU eLinks

Show full item record

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Google ScholarTM

Altmetric

Google Scholar^TM