StoreLLM : energy efficient large language model inference with permanently pre-stored attention matrices

Wang, D; Liu, B; Lu, R; Zhang, Z; Zhu, S

doi:10.1145/3679240.3734604

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/116821

DC Field	Value	Language
dc.contributor	Department of Computing	-
dc.creator	Wang, D	-
dc.creator	Liu, B	-
dc.creator	Lu, R	-
dc.creator	Zhang, Z	-
dc.creator	Zhu, S	-
dc.date.accessioned	2026-01-21T03:52:56Z	-
dc.date.available	2026-01-21T03:52:56Z	-
dc.identifier.isbn	979-8-4007-1125-1	-
dc.identifier.uri	http://hdl.handle.net/10397/116821	-
dc.description	16th ACM International Conference on Future and Sustainable Energy Systems, Rotterdam, Netherlands, June 17-20, 2025	en_US
dc.language.iso	en	en_US
dc.publisher	The Association for Computing Machinery	en_US
dc.rights	This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0).	en_US
dc.rights	©2025 Copyright held by the owner/author(s).	en_US
dc.rights	The following publication Wang, D., Liu, B., Lu, R., Zhang, Z., & Zhu, S. (2025). StoreLLM: Energy Efficient Large Language Model Inference with Permanently Pre-stored Attention Matrices Proceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems is available at https://doi.org/10.1145/3679240.3734604.	en_US
dc.subject	Hierarchy Storage System	en_US
dc.subject	KV Cache	en_US
dc.subject	Large Language Model	en_US
dc.title	StoreLLM : energy efficient large language model inference with permanently pre-stored attention matrices	en_US
dc.type	Conference Paper	en_US
dc.identifier.spage	398	-
dc.identifier.epage	406	-
dc.identifier.doi	10.1145/3679240.3734604	-
dcterms.abstract	Energy efficiency has become an important design issue in Large Language Model (LLM) inference systems. The main energy consumption goes to computing. There are studies to reduce computing or to conduct computing in regions with green energy. In this paper, we study an orthogonal perspective. We observe that the attention matrices of the tokens remain largely unchanged across different LLM inference. We argue that there can be over-computing of the attention matrices across different LLM inference in LLM inference systems. As the energy of computing is substantially greater than the energy of storage access, we propose StoreLLM, an LLM inference system where the attention matrices of tokens are pre-stored so that the computing of the attention matrices in any LLM inference can be substituted by storage access. Our analysis shows that it is possible to permanently pre-store the attention matrices of all tokens in storage, and we develop mechanisms to effectively maintain the LLM inference performance. Our evaluation shows that StoreLLM can outperform state-of-the-art LLM inference systems LazyLLM by 1.45 × in energy consumption with a sacrifice of 5.05% in delays. With further improvements, StoreLLM-MoE and StoreLLM-PTQ can achieve 2.64 × and 2.83 × energy reduction as compared to state-of-the-art LLM systems.1	-
dcterms.accessRights	open access	en_US
dcterms.bibliographicCitation	In E-ENERGY '25: Proceedings of the 2025 the 16th ACM International Conference on Future and Sustainable Energy Systems, p. 398-406. New York, New York: The Association for Computing Machinery, 2025	-
dcterms.issued	2025	-
dc.identifier.scopus	2-s2.0-105016376962	-
dc.relation.ispartofbook	E-ENERGY '25: Proceedings of the 2025 the 16th ACM International Conference on Future and Sustainable Energy Systems	-
dc.publisher.place	New York, New York	en_US
dc.description.validate	202601 bcch	-
dc.description.oa	Version of Record	en_US
dc.identifier.FolderNumber	OA_Scopus/WOS	en_US
dc.description.fundingSource	RGC	en_US
dc.description.fundingSource	Others	en_US
dc.description.fundingText	Dan Wang’s work is supported in part by RGC GRF 15200321, 15201322, 15230624, ITC ITF-ITS/056/22MX, ITS/052/23MX, and PolyU 1-CDKK, G-SAC8.	en_US
dc.description.pubStatus	Published	en_US
dc.description.oaCategory	CC	en_US
Appears in Collections:	Conference Paper

Files in This Item:

File	Description	Size	Format
3679240.3734604.pdf		4.71 MB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Version of Record

Access

View full-text via PolyU eLinks

Show simple item record

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Google ScholarTM

Altmetric

Google Scholar^TM