Please use this identifier to cite or link to this item:
http://hdl.handle.net/10397/116821
| DC Field | Value | Language |
|---|---|---|
| dc.contributor | Department of Computing | - |
| dc.creator | Wang, D | - |
| dc.creator | Liu, B | - |
| dc.creator | Lu, R | - |
| dc.creator | Zhang, Z | - |
| dc.creator | Zhu, S | - |
| dc.date.accessioned | 2026-01-21T03:52:56Z | - |
| dc.date.available | 2026-01-21T03:52:56Z | - |
| dc.identifier.isbn | 979-8-4007-1125-1 | - |
| dc.identifier.uri | http://hdl.handle.net/10397/116821 | - |
| dc.description | 16th ACM International Conference on Future and Sustainable Energy Systems, Rotterdam, Netherlands, June 17-20, 2025 | en_US |
| dc.language.iso | en | en_US |
| dc.publisher | The Association for Computing Machinery | en_US |
| dc.rights | This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0). | en_US |
| dc.rights | ©2025 Copyright held by the owner/author(s). | en_US |
| dc.rights | The following publication Wang, D., Liu, B., Lu, R., Zhang, Z., & Zhu, S. (2025). StoreLLM: Energy Efficient Large Language Model Inference with Permanently Pre-stored Attention Matrices Proceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems is available at https://doi.org/10.1145/3679240.3734604. | en_US |
| dc.subject | Hierarchy Storage System | en_US |
| dc.subject | KV Cache | en_US |
| dc.subject | Large Language Model | en_US |
| dc.title | StoreLLM : energy efficient large language model inference with permanently pre-stored attention matrices | en_US |
| dc.type | Conference Paper | en_US |
| dc.identifier.spage | 398 | - |
| dc.identifier.epage | 406 | - |
| dc.identifier.doi | 10.1145/3679240.3734604 | - |
| dcterms.abstract | Energy efficiency has become an important design issue in Large Language Model (LLM) inference systems. The main energy consumption goes to computing. There are studies to reduce computing or to conduct computing in regions with green energy. In this paper, we study an orthogonal perspective. We observe that the attention matrices of the tokens remain largely unchanged across different LLM inference. We argue that there can be over-computing of the attention matrices across different LLM inference in LLM inference systems. As the energy of computing is substantially greater than the energy of storage access, we propose StoreLLM, an LLM inference system where the attention matrices of tokens are pre-stored so that the computing of the attention matrices in any LLM inference can be substituted by storage access. Our analysis shows that it is possible to permanently pre-store the attention matrices of all tokens in storage, and we develop mechanisms to effectively maintain the LLM inference performance. Our evaluation shows that StoreLLM can outperform state-of-the-art LLM inference systems LazyLLM by 1.45 × in energy consumption with a sacrifice of 5.05% in delays. With further improvements, StoreLLM-MoE and StoreLLM-PTQ can achieve 2.64 × and 2.83 × energy reduction as compared to state-of-the-art LLM systems.1 | - |
| dcterms.accessRights | open access | en_US |
| dcterms.bibliographicCitation | In E-ENERGY '25: Proceedings of the 2025 the 16th ACM International Conference on Future and Sustainable Energy Systems, p. 398-406. New York, New York: The Association for Computing Machinery, 2025 | - |
| dcterms.issued | 2025 | - |
| dc.identifier.scopus | 2-s2.0-105016376962 | - |
| dc.relation.ispartofbook | E-ENERGY '25: Proceedings of the 2025 the 16th ACM International Conference on Future and Sustainable Energy Systems | - |
| dc.publisher.place | New York, New York | en_US |
| dc.description.validate | 202601 bcch | - |
| dc.description.oa | Version of Record | en_US |
| dc.identifier.FolderNumber | OA_Scopus/WOS | en_US |
| dc.description.fundingSource | RGC | en_US |
| dc.description.fundingSource | Others | en_US |
| dc.description.fundingText | Dan Wang’s work is supported in part by RGC GRF 15200321, 15201322, 15230624, ITC ITF-ITS/056/22MX, ITS/052/23MX, and PolyU 1-CDKK, G-SAC8. | en_US |
| dc.description.pubStatus | Published | en_US |
| dc.description.oaCategory | CC | en_US |
| Appears in Collections: | Conference Paper | |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| 3679240.3734604.pdf | 4.71 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.



