Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/116821
PIRA download icon_1.1View/Download Full Text
DC FieldValueLanguage
dc.contributorDepartment of Computing-
dc.creatorWang, D-
dc.creatorLiu, B-
dc.creatorLu, R-
dc.creatorZhang, Z-
dc.creatorZhu, S-
dc.date.accessioned2026-01-21T03:52:56Z-
dc.date.available2026-01-21T03:52:56Z-
dc.identifier.isbn979-8-4007-1125-1-
dc.identifier.urihttp://hdl.handle.net/10397/116821-
dc.description16th ACM International Conference on Future and Sustainable Energy Systems, Rotterdam, Netherlands, June 17-20, 2025en_US
dc.language.isoenen_US
dc.publisherThe Association for Computing Machineryen_US
dc.rightsThis work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0).en_US
dc.rights©2025 Copyright held by the owner/author(s).en_US
dc.rightsThe following publication Wang, D., Liu, B., Lu, R., Zhang, Z., & Zhu, S. (2025). StoreLLM: Energy Efficient Large Language Model Inference with Permanently Pre-stored Attention Matrices Proceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems is available at https://doi.org/10.1145/3679240.3734604.en_US
dc.subjectHierarchy Storage Systemen_US
dc.subjectKV Cacheen_US
dc.subjectLarge Language Modelen_US
dc.titleStoreLLM : energy efficient large language model inference with permanently pre-stored attention matricesen_US
dc.typeConference Paperen_US
dc.identifier.spage398-
dc.identifier.epage406-
dc.identifier.doi10.1145/3679240.3734604-
dcterms.abstractEnergy efficiency has become an important design issue in Large Language Model (LLM) inference systems. The main energy consumption goes to computing. There are studies to reduce computing or to conduct computing in regions with green energy. In this paper, we study an orthogonal perspective. We observe that the attention matrices of the tokens remain largely unchanged across different LLM inference. We argue that there can be over-computing of the attention matrices across different LLM inference in LLM inference systems. As the energy of computing is substantially greater than the energy of storage access, we propose StoreLLM, an LLM inference system where the attention matrices of tokens are pre-stored so that the computing of the attention matrices in any LLM inference can be substituted by storage access. Our analysis shows that it is possible to permanently pre-store the attention matrices of all tokens in storage, and we develop mechanisms to effectively maintain the LLM inference performance. Our evaluation shows that StoreLLM can outperform state-of-the-art LLM inference systems LazyLLM by 1.45 × in energy consumption with a sacrifice of 5.05% in delays. With further improvements, StoreLLM-MoE and StoreLLM-PTQ can achieve 2.64 × and 2.83 × energy reduction as compared to state-of-the-art LLM systems.1-
dcterms.accessRightsopen accessen_US
dcterms.bibliographicCitationIn E-ENERGY '25: Proceedings of the 2025 the 16th ACM International Conference on Future and Sustainable Energy Systems, p. 398-406. New York, New York: The Association for Computing Machinery, 2025-
dcterms.issued2025-
dc.identifier.scopus2-s2.0-105016376962-
dc.relation.ispartofbookE-ENERGY '25: Proceedings of the 2025 the 16th ACM International Conference on Future and Sustainable Energy Systems-
dc.publisher.placeNew York, New Yorken_US
dc.description.validate202601 bcch-
dc.description.oaVersion of Recorden_US
dc.identifier.FolderNumberOA_Scopus/WOSen_US
dc.description.fundingSourceRGCen_US
dc.description.fundingSourceOthersen_US
dc.description.fundingTextDan Wang’s work is supported in part by RGC GRF 15200321, 15201322, 15230624, ITC ITF-ITS/056/22MX, ITS/052/23MX, and PolyU 1-CDKK, G-SAC8.en_US
dc.description.pubStatusPublisheden_US
dc.description.oaCategoryCCen_US
Appears in Collections:Conference Paper
Files in This Item:
File Description SizeFormat 
3679240.3734604.pdf4.71 MBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Access
View full-text via PolyU eLinks SFX Query
Show simple item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.