EAP-GP : mitigating saturation effect in gradient-based automated circuit identification

Zhang, L; Dong, W; Zhang, Z; Yang, S; Hu, L; Liu, N; Zhou, P; Wang, D

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/119597

DC Field	Value	Language
dc.contributor	Department of Computing	en_US
dc.creator	Zhang, L	en_US
dc.creator	Dong, W	en_US
dc.creator	Zhang, Z	en_US
dc.creator	Yang, S	en_US
dc.creator	Hu, L	en_US
dc.creator	Liu, N	en_US
dc.creator	Zhou, P	en_US
dc.creator	Wang, D	en_US
dc.date.accessioned	2026-07-03T03:12:50Z	-
dc.date.available	2026-07-03T03:12:50Z	-
dc.identifier.uri	http://hdl.handle.net/10397/119597	-
dc.description	The Thirty-ninth Annual Conference on Neural Information Processing Systems, NeurIPS 2025, San Diego, USA, Dec 01 2025	en_US
dc.language.iso	en	en_US
dc.rights	Posted with permission of the author.	en_US
dc.title	EAP-GP : mitigating saturation effect in gradient-based automated circuit identification	en_US
dc.type	Other Conference Contributions	en_US
dcterms.abstract	Understanding the internal mechanisms of transformer-based language models remains challenging. Mechanistic interpretability based on circuit discovery aims to reverse engineer neural networks by analyzing their internal processes at the level of computational subgraphs. In this paper, we revisit existing gradient-based circuit identification methods and find that their performance is either affected by the zero-gradient problem or saturation effects, where edge attribution scores become insensitive to input changes, resulting in noisy and unreliable attribution evaluations for circuit components. To address the saturation effect, we propose Edge Attribution Patching with GradPath (EAP-GP), EAP-GP introduces an integration path, starting from the input and adaptively following the direction of the difference between the gradients of corrupted and clean inputs to avoid the saturated region. This approach enhances attribution reliability and improves the faithfulness of circuit identification. We evaluate EAP-GP on 6 datasets using GPT-2 Small, GPT-2 Medium, and GPT-2 XL. Experimental results demonstrate that EAP-GP outperforms existing methods in circuit faithfulness, achieving improvements up to 17.7%. Comparisons with manually annotated ground-truth circuits demonstrate that EAP-GP achieves precision and recall comparable to or better than previous approaches, highlighting its effectiveness in identifying accurate circuits.	en_US
dcterms.accessRights	open access	en_US
dcterms.bibliographicCitation	In D Belgrave, C Zhang, H Lin, R Pascanu, P Koniusz, M Ghassemi & N Chen (Eds.), Advances in Neural Information Processing Systems 38. NeurIPS 2025 [Poster Presentation]. https://neurips.cc/virtual/2025/loc/san-diego/poster/116294	en_US
dcterms.issued	2025	-
dc.relation.conference	Neural Information Processing Systems [NeurIPS]	en_US
dc.description.validate	202606 bcch	en_US
dc.description.oa	Version of Record	en_US
dc.identifier.FolderNumber	a4574c	-
dc.identifier.SubFormID	53232	-
dc.description.fundingSource	Others	en_US
dc.description.fundingText	Di Wang is supported in part by the funding BAS/1/1689-01-01 and funding from KAUST- Center of Excellence for Generative AI, under award number 5940.	en_US
dc.description.pubStatus	Published	en_US
dc.description.oaCategory	Copyright retained by author	en_US
Appears in Collections:	Conference Paper

Files in This Item:

File	Description	Size	Format
EAP_GP_Mitigating_Saturat.pdf		585.6 kB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Version of Record

Show simple item record

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Google ScholarTM

Google Scholar^TM