Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/119597
PIRA download icon_1.1View/Download Full Text
DC FieldValueLanguage
dc.contributorDepartment of Computingen_US
dc.creatorZhang, Len_US
dc.creatorDong, Wen_US
dc.creatorZhang, Zen_US
dc.creatorYang, Sen_US
dc.creatorHu, Len_US
dc.creatorLiu, Nen_US
dc.creatorZhou, Pen_US
dc.creatorWang, Den_US
dc.date.accessioned2026-07-03T03:12:50Z-
dc.date.available2026-07-03T03:12:50Z-
dc.identifier.urihttp://hdl.handle.net/10397/119597-
dc.descriptionThe Thirty-ninth Annual Conference on Neural Information Processing Systems, NeurIPS 2025, San Diego, USA, Dec 01 2025en_US
dc.language.isoenen_US
dc.rightsPosted with permission of the author.en_US
dc.titleEAP-GP : mitigating saturation effect in gradient-based automated circuit identificationen_US
dc.typeOther Conference Contributionsen_US
dcterms.abstractUnderstanding the internal mechanisms of transformer-based language models remains challenging. Mechanistic interpretability based on circuit discovery aims to reverse engineer neural networks by analyzing their internal processes at the level of computational subgraphs. In this paper, we revisit existing gradient-based circuit identification methods and find that their performance is either affected by the zero-gradient problem or saturation effects, where edge attribution scores become insensitive to input changes, resulting in noisy and unreliable attribution evaluations for circuit components. To address the saturation effect, we propose Edge Attribution Patching with GradPath (EAP-GP), EAP-GP introduces an integration path, starting from the input and adaptively following the direction of the difference between the gradients of corrupted and clean inputs to avoid the saturated region. This approach enhances attribution reliability and improves the faithfulness of circuit identification. We evaluate EAP-GP on 6 datasets using GPT-2 Small, GPT-2 Medium, and GPT-2 XL. Experimental results demonstrate that EAP-GP outperforms existing methods in circuit faithfulness, achieving improvements up to 17.7%. Comparisons with manually annotated ground-truth circuits demonstrate that EAP-GP achieves precision and recall comparable to or better than previous approaches, highlighting its effectiveness in identifying accurate circuits.en_US
dcterms.accessRightsopen accessen_US
dcterms.bibliographicCitationIn D Belgrave, C Zhang, H Lin, R Pascanu, P Koniusz, M Ghassemi & N Chen (Eds.), Advances in Neural Information Processing Systems 38. NeurIPS 2025 [Poster Presentation]. https://neurips.cc/virtual/2025/loc/san-diego/poster/116294en_US
dcterms.issued2025-
dc.relation.conferenceNeural Information Processing Systems [NeurIPS]en_US
dc.description.validate202606 bcchen_US
dc.description.oaVersion of Recorden_US
dc.identifier.FolderNumbera4574c-
dc.identifier.SubFormID53232-
dc.description.fundingSourceOthersen_US
dc.description.fundingTextDi Wang is supported in part by the funding BAS/1/1689-01-01 and funding from KAUST- Center of Excellence for Generative AI, under award number 5940.en_US
dc.description.pubStatusPublisheden_US
dc.description.oaCategoryCopyright retained by authoren_US
Appears in Collections:Conference Paper
Files in This Item:
File Description SizeFormat 
EAP_GP_Mitigating_Saturat.pdf585.6 kBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Show simple item record

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.