HiBench : benchmarking LLMs capability on hierarchical structure reasoning

Jiang, Z; Wu, P; Liang, Z; Chen, PQ; Yuan, X; Jia, Y; Tu, J; Li, C; Ng, PHF; Li, Q

doi:10.1145/3711896.3737378

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/114572

DC Field	Value	Language
dc.contributor	Department of Computing	en_US
dc.creator	Jiang, Z	en_US
dc.creator	Wu, P	en_US
dc.creator	Liang, Z	en_US
dc.creator	Chen, PQ	en_US
dc.creator	Yuan, X	en_US
dc.creator	Jia, Y	en_US
dc.creator	Tu, J	en_US
dc.creator	Li, C	en_US
dc.creator	Ng, PHF	en_US
dc.creator	Li, Q	en_US
dc.date.accessioned	2025-08-11T06:20:00Z	-
dc.date.available	2025-08-11T06:20:00Z	-
dc.identifier.isbn	979-8-4007-1454-2	en_US
dc.identifier.uri	http://hdl.handle.net/10397/114572	-
dc.language.iso	en	en_US
dc.publisher	Association for Computing Machinery	en_US
dc.rights	This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0).	en_US
dc.rights	© 2025 Copyright held by the owner/author(s).	en_US
dc.rights	The following publication Jiang, Z., Wu, P., Liang, Z., Chen, P. Q., Yuan, X., Jia, Y., Tu, J., Li, C., Ng, P. H. F., & Li, Q. (2025). HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, Toronto ON, Canada, 5505-5515 is available at https://doi.org/10.1145/3711896.3737378.	en_US
dc.subject	Benchmark	en_US
dc.subject	Hierarchical reasoning	en_US
dc.subject	Large language models	en_US
dc.subject	Natural language processing	en_US
dc.title	HiBench : benchmarking LLMs capability on hierarchical structure reasoning	en_US
dc.type	Conference Paper	en_US
dc.identifier.spage	5505	en_US
dc.identifier.epage	5515	en_US
dc.identifier.doi	10.1145/3711896.3737378	en_US
dcterms.abstract	Structure reasoning is a fundamental capability of large language models (LLMs), enabling them to reason about structured commonsense and answer multi-hop questions. However, existing benchmarks for structure reasoning mainly focus on horizontal and coordinate structures (e.g. graphs), overlooking the hierarchical relationships within them. Hierarchical structure reasoning is crucial for human cognition, particularly in memory organization and problem-solving. It also plays a key role in various real-world tasks, such as information extraction and decision-making. To address this gap, we propose HiBench, the first framework designed to systematically benchmark the hierarchical reasoning capabilities of LLMs from initial structure generation to final proficiency assessment. It encompasses six representative scenarios, covering both fundamental and practical aspects, and consists of 30 tasks with varying hierarchical complexity, totaling 39,519 queries. To evaluate LLMs comprehensively, we develop five capability dimensions that depict different facets of hierarchical structure understanding. Through extensive evaluation of 20 LLMs from 10 model families, we reveal key insights into their capabilities and limitations: 1) existing LLMs show proficiency in basic hierarchical reasoning tasks; 2) they still struggle with more complex structures and implicit hierarchical representations, especially in structural modification and textual reasoning. Based on these findings, we create a small yet well-designed instruction dataset, which enhances LLMs' performance on HiBench by an average of 88.84% (Llama-3.1-8B) and 31.38% (Qwen2.5-7B) across all tasks. The HiBench dataset and toolkit are available at https://github.com/jzzzzh/HiBench to encourage evaluation.	en_US
dcterms.accessRights	open access	en_US
dcterms.bibliographicCitation	KDD '25: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, p. 5505-5515	en_US
dcterms.issued	2025	-
dc.relation.ispartofbook	KDD '25: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2	en_US
dc.description.validate	202508 bcch	en_US
dc.description.oa	Version of Record	en_US
dc.identifier.FolderNumber	a3975	-
dc.identifier.SubFormID	51855	-
dc.description.fundingSource	RGC	en_US
dc.description.fundingSource	Others	en_US
dc.description.fundingText	The research described in this paper has been partly supported by General Research Funds from the Hong Kong Research Grants Council (project no. PolyU 15207322, 15200023, 15206024, and 15224524), internal research funds from The Hong Kong Polytechnic University (project no. P0042693, P0048625, P0051361, P0052406, and P0052986).	en_US
dc.description.pubStatus	Published	en_US
dc.description.oaCategory	CC	en_US
Appears in Collections:	Conference Paper

Files in This Item:

File	Description	Size	Format
3711896.3737378.pdf		1.69 MB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Version of Record

Access

View full-text via PolyU eLinks

Show simple item record

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Google ScholarTM

Altmetric

Google Scholar^TM