Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/119322
PIRA download icon_1.1View/Download Full Text
Title: Minute-long videos with dual parallelisms
Authors: Wang, Z
Zheng, B
Yang, X 
Tan, Z
Xu, Y
Wang, X
Issue Date: 2026
Source: In S Koenig, C Jenkins, & ME Taylor (Eds.), Proceedings of the 40th Annual AAAI Conference on Artificial Intelligence, p. 10358-10366. Washington, DC: Association for the Advancement of Artificial Intelligence, 2026
Abstract: Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize computation by partitioning both video frames and model layers across multiple GPUs. However, a naive parallel implementation is not feasible. Because all frames need to share the same noise level, they can't be processed independently. Instead, every step must wait for all others to finish, which cancels out the speed benefits of parallel processing. We overcome this obstacle with a block-wise denoising scheme. Namely, we segment the video into sequential blocks, each with a different noise level. As a result, we process them in a pipeline across the GPUs. Each GPU, holding a subset of the model layers, processes a specific block of frames and passes the results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, each GPU uses a feature cache technique to reduce the overhead of smooth transitions by reusing only features involved in cross-frame computation from the prior block, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54x lower latency and 1.48x lower memory cost on 8xRTX 4090 GPUs.
Publisher: AAAI Press
ISBN: 1-57735-906-2
978-1-57735-906-7
DOI: 10.1609/aaai.v40i12.38006
Description: The 40th AAAI Conference on Artificial Intelligence, January 20-27, 2026, Singapore
Rights: Copyright © 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
The following publication Wang, Z., Zheng, B., Yang, X., Tan, Z., Xu, Y., & Wang, X. (2026). Minute-Long Videos with Dual Parallelisms. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 10358–10366 is available at https://dx.doi.org/10.1609/aaai.v40i12.38006.
Appears in Collections:Conference Paper

Files in This Item:
File Description SizeFormat 
Wang_Minute_Long_Videos.pdf6.25 MBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Access
View full-text via PolyU eLinks SFX Query
Show full item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.