Please use this identifier to cite or link to this item:
http://hdl.handle.net/10397/119322
| DC Field | Value | Language |
|---|---|---|
| dc.contributor | Department of Data Science and Artificial Intelligence | en_US |
| dc.creator | Wang, Z | en_US |
| dc.creator | Zheng, B | en_US |
| dc.creator | Yang, X | en_US |
| dc.creator | Tan, Z | en_US |
| dc.creator | Xu, Y | en_US |
| dc.creator | Wang, X | en_US |
| dc.date.accessioned | 2026-06-15T09:02:01Z | - |
| dc.date.available | 2026-06-15T09:02:01Z | - |
| dc.identifier.isbn | 1-57735-906-2 | en_US |
| dc.identifier.isbn | 978-1-57735-906-7 | en_US |
| dc.identifier.uri | http://hdl.handle.net/10397/119322 | - |
| dc.description | The 40th AAAI Conference on Artificial Intelligence, January 20-27, 2026, Singapore | en_US |
| dc.language.iso | en | en_US |
| dc.publisher | AAAI Press | en_US |
| dc.rights | Copyright © 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. | en_US |
| dc.rights | The following publication Wang, Z., Zheng, B., Yang, X., Tan, Z., Xu, Y., & Wang, X. (2026). Minute-Long Videos with Dual Parallelisms. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 10358–10366 is available at https://dx.doi.org/10.1609/aaai.v40i12.38006. | en_US |
| dc.title | Minute-long videos with dual parallelisms | en_US |
| dc.type | Conference Paper | en_US |
| dc.identifier.spage | 10358 | en_US |
| dc.identifier.epage | 10366 | en_US |
| dc.identifier.volume | 40 | en_US |
| dc.identifier.issue | 12 | en_US |
| dc.identifier.doi | 10.1609/aaai.v40i12.38006 | en_US |
| dcterms.abstract | Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize computation by partitioning both video frames and model layers across multiple GPUs. However, a naive parallel implementation is not feasible. Because all frames need to share the same noise level, they can't be processed independently. Instead, every step must wait for all others to finish, which cancels out the speed benefits of parallel processing. We overcome this obstacle with a block-wise denoising scheme. Namely, we segment the video into sequential blocks, each with a different noise level. As a result, we process them in a pipeline across the GPUs. Each GPU, holding a subset of the model layers, processes a specific block of frames and passes the results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, each GPU uses a feature cache technique to reduce the overhead of smooth transitions by reusing only features involved in cross-frame computation from the prior block, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54x lower latency and 1.48x lower memory cost on 8xRTX 4090 GPUs. | en_US |
| dcterms.accessRights | open access | en_US |
| dcterms.bibliographicCitation | In S Koenig, C Jenkins, & ME Taylor (Eds.), Proceedings of the 40th Annual AAAI Conference on Artificial Intelligence, p. 10358-10366. Washington, DC: Association for the Advancement of Artificial Intelligence, 2026 | en_US |
| dcterms.issued | 2026 | - |
| dc.relation.ispartofbook | Proceedings of the 40th Annual AAAI Conference on Artificial Intelligence | en_US |
| dc.relation.conference | Conference on Artificial Intelligence [AAAI] | en_US |
| dc.publisher.place | Washington, DC | en_US |
| dc.description.validate | 202606 bcch | en_US |
| dc.description.oa | Version of Record | en_US |
| dc.identifier.FolderNumber | a4498 | - |
| dc.identifier.SubFormID | 52971 | - |
| dc.description.fundingSource | Others | en_US |
| dc.description.fundingText | This project is supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (Award Number: MOE-T2EP20122-0006), and by the Presidential Young Scholars Scheme (Project ID: IDP0058232) from The Hong Kong Polytechnic University. | en_US |
| dc.description.pubStatus | Published | en_US |
| dc.description.oaCategory | Publisher permission | en_US |
| Appears in Collections: | Conference Paper | |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| Wang_Minute_Long_Videos.pdf | 6.25 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.



