Minute-long videos with dual parallelisms

Wang, Z; Zheng, B; Yang, X; Tan, Z; Xu, Y; Wang, X

doi:10.1609/aaai.v40i12.38006

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/119322

DC Field	Value	Language
dc.contributor	Department of Data Science and Artificial Intelligence	en_US
dc.creator	Wang, Z	en_US
dc.creator	Zheng, B	en_US
dc.creator	Yang, X	en_US
dc.creator	Tan, Z	en_US
dc.creator	Xu, Y	en_US
dc.creator	Wang, X	en_US
dc.date.accessioned	2026-06-15T09:02:01Z	-
dc.date.available	2026-06-15T09:02:01Z	-
dc.identifier.isbn	1-57735-906-2	en_US
dc.identifier.isbn	978-1-57735-906-7	en_US
dc.identifier.uri	http://hdl.handle.net/10397/119322	-
dc.description	The 40th AAAI Conference on Artificial Intelligence, January 20-27, 2026, Singapore	en_US
dc.language.iso	en	en_US
dc.publisher	AAAI Press	en_US
dc.rights	Copyright © 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.	en_US
dc.rights	The following publication Wang, Z., Zheng, B., Yang, X., Tan, Z., Xu, Y., & Wang, X. (2026). Minute-Long Videos with Dual Parallelisms. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 10358–10366 is available at https://dx.doi.org/10.1609/aaai.v40i12.38006.	en_US
dc.title	Minute-long videos with dual parallelisms	en_US
dc.type	Conference Paper	en_US
dc.identifier.spage	10358	en_US
dc.identifier.epage	10366	en_US
dc.identifier.volume	40	en_US
dc.identifier.issue	12	en_US
dc.identifier.doi	10.1609/aaai.v40i12.38006	en_US
dcterms.abstract	Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize computation by partitioning both video frames and model layers across multiple GPUs. However, a naive parallel implementation is not feasible. Because all frames need to share the same noise level, they can't be processed independently. Instead, every step must wait for all others to finish, which cancels out the speed benefits of parallel processing. We overcome this obstacle with a block-wise denoising scheme. Namely, we segment the video into sequential blocks, each with a different noise level. As a result, we process them in a pipeline across the GPUs. Each GPU, holding a subset of the model layers, processes a specific block of frames and passes the results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, each GPU uses a feature cache technique to reduce the overhead of smooth transitions by reusing only features involved in cross-frame computation from the prior block, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54x lower latency and 1.48x lower memory cost on 8xRTX 4090 GPUs.	en_US
dcterms.accessRights	open access	en_US
dcterms.bibliographicCitation	In S Koenig, C Jenkins, & ME Taylor (Eds.), Proceedings of the 40th Annual AAAI Conference on Artificial Intelligence, p. 10358-10366. Washington, DC: Association for the Advancement of Artificial Intelligence, 2026	en_US
dcterms.issued	2026	-
dc.relation.ispartofbook	Proceedings of the 40th Annual AAAI Conference on Artificial Intelligence	en_US
dc.relation.conference	Conference on Artificial Intelligence [AAAI]	en_US
dc.publisher.place	Washington, DC	en_US
dc.description.validate	202606 bcch	en_US
dc.description.oa	Version of Record	en_US
dc.identifier.FolderNumber	a4498	-
dc.identifier.SubFormID	52971	-
dc.description.fundingSource	Others	en_US
dc.description.fundingText	This project is supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (Award Number: MOE-T2EP20122-0006), and by the Presidential Young Scholars Scheme (Project ID: IDP0058232) from The Hong Kong Polytechnic University.	en_US
dc.description.pubStatus	Published	en_US
dc.description.oaCategory	Publisher permission	en_US
Appears in Collections:	Conference Paper

Files in This Item:

File	Description	Size	Format
Wang_Minute_Long_Videos.pdf		6.25 MB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Version of Record

Access

View full-text via PolyU eLinks

Show simple item record

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Google ScholarTM

Altmetric

Google Scholar^TM