Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/116821
PIRA download icon_1.1View/Download Full Text
Title: StoreLLM : energy efficient large language model inference with permanently pre-stored attention matrices
Authors: Wang, D 
Liu, B 
Lu, R 
Zhang, Z 
Zhu, S 
Issue Date: 2025
Source: In E-ENERGY '25: Proceedings of the 2025 the 16th ACM International Conference on Future and Sustainable Energy Systems, p. 398-406. New York, New York: The Association for Computing Machinery, 2025
Abstract: Energy efficiency has become an important design issue in Large Language Model (LLM) inference systems. The main energy consumption goes to computing. There are studies to reduce computing or to conduct computing in regions with green energy. In this paper, we study an orthogonal perspective. We observe that the attention matrices of the tokens remain largely unchanged across different LLM inference. We argue that there can be over-computing of the attention matrices across different LLM inference in LLM inference systems. As the energy of computing is substantially greater than the energy of storage access, we propose StoreLLM, an LLM inference system where the attention matrices of tokens are pre-stored so that the computing of the attention matrices in any LLM inference can be substituted by storage access. Our analysis shows that it is possible to permanently pre-store the attention matrices of all tokens in storage, and we develop mechanisms to effectively maintain the LLM inference performance. Our evaluation shows that StoreLLM can outperform state-of-the-art LLM inference systems LazyLLM by 1.45 × in energy consumption with a sacrifice of 5.05% in delays. With further improvements, StoreLLM-MoE and StoreLLM-PTQ can achieve 2.64 × and 2.83 × energy reduction as compared to state-of-the-art LLM systems.1
Keywords: Hierarchy Storage System
KV Cache
Large Language Model
Publisher: The Association for Computing Machinery
ISBN: 979-8-4007-1125-1
DOI: 10.1145/3679240.3734604
Description: 16th ACM International Conference on Future and Sustainable Energy Systems, Rotterdam, Netherlands, June 17-20, 2025
Rights: This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0).
©2025 Copyright held by the owner/author(s).
The following publication Wang, D., Liu, B., Lu, R., Zhang, Z., & Zhu, S. (2025). StoreLLM: Energy Efficient Large Language Model Inference with Permanently Pre-stored Attention Matrices Proceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems is available at https://doi.org/10.1145/3679240.3734604.
Appears in Collections:Conference Paper

Files in This Item:
File Description SizeFormat 
3679240.3734604.pdf4.71 MBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Access
View full-text via PolyU eLinks SFX Query
Show full item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.