Foundation model-assisted interpretable vehicle behavior decision making

Meng, S; Wang, Y; Cui, Y; Chau, LP

doi:10.1016/j.knosys.2025.113868

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/115657

DC Field	Value	Language
dc.contributor	Department of Electrical and Electronic Engineering	en_US
dc.creator	Meng, S	en_US
dc.creator	Wang, Y	en_US
dc.creator	Cui, Y	en_US
dc.creator	Chau, LP	en_US
dc.date.accessioned	2025-10-16T01:53:59Z	-
dc.date.available	2025-10-16T01:53:59Z	-
dc.identifier.issn	0950-7051	en_US
dc.identifier.uri	http://hdl.handle.net/10397/115657	-
dc.language.iso	en	en_US
dc.publisher	Elsevier	en_US
dc.subject	Behavior decision	en_US
dc.subject	Multi-task	en_US
dc.subject	Segment-anything model	en_US
dc.title	Foundation model-assisted interpretable vehicle behavior decision making	en_US
dc.type	Journal/Magazine Article	en_US
dc.identifier.volume	324	en_US
dc.identifier.doi	10.1016/j.knosys.2025.113868	en_US
dcterms.abstract	Intelligent autonomous driving systems must achieve accurate perception and driving decisions to enhance their effectiveness and adoption. Currently, driving behavior decisions have achieved high performance thanks to deep learning technology. However, most existing approaches lack interpretability, reducing user trust and hindering widespread adoption. While some efforts focus on transparency through strategies like heat maps, cost-volume, and auxiliary tasks, they often provide limited model interpretation or require additional annotations. In this paper, we present a novel unified framework to tackle these issues by integrating ego-vehicle behavior decisions with human-centric language-based interpretation prediction from ego-view visual input. First, we propose a self-supervised class-agnostic object Segmentor module based on Segment Anything Model and 2-D light adapter strategy, to capture the overall surrounding cues without any extra segmentation mask labels. Second, the semantic extractor is adopted to generate the hierarchical semantic-level cues. Subsequently, a fusion module is designed to generate the refined global features by incorporating the class-agnostic object features and semantic-level features using a self-attention mechanism. Finally, vehicle behavior decisions and possible human-centric interpretations are jointly generated based on the global fusion context. The experimental results across various settings on the public datasets demonstrate the superiority and effectiveness of our proposed solution.	en_US
dcterms.accessRights	embargoed access	en_US
dcterms.bibliographicCitation	Knowledge-based systems, 3 Aug. 2025, v. 324, 113868	en_US
dcterms.isPartOf	Knowledge-based systems	en_US
dcterms.issued	2025-08-03	-
dc.identifier.scopus	2-s2.0-105008112195	-
dc.identifier.artn	113868	en_US
dc.description.validate	202510 bcel	en_US
dc.description.oa	Not applicable	en_US
dc.identifier.SubFormID	G000232/2025-07	-
dc.description.fundingSource	RGC	en_US
dc.description.fundingSource	Others	en_US
dc.description.fundingText	The research work was conducted in the JC STEM Lab of Machine Learning and Computer Vision funded by The Hong Kong Jockey Club Charities Trust. And it was partially supported by the Research Grants Council of the Hong Kong SAR, China (Project No. PolyU 15215824).	en_US
dc.description.pubStatus	Published	en_US
dc.date.embargo	2027-08-03	en_US
dc.description.oaCategory	Green (AAM)	en_US
Appears in Collections:	Journal/Magazine Article

Open Access Information

Status	embargoed access
Embargo End Date	2027-08-03

Access

View full-text via PolyU eLinks

Show simple item record

Google Scholar^TM

Check

Open Access Information

Access

Google ScholarTM

Altmetric

Google Scholar^TM