Anatomical structure-guided medical vision-language pre-training

Li, Q; Yan, X; Xu, J; Yuan, R; Zhang, Y; Feng, R; Shen, Q; Zhang, X; Wang, S

doi:10.1007/978-3-031-72120-5_8

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/109379

DC Field	Value	Language
dc.contributor	Department of Biomedical Engineering	en_US
dc.contributor	Research Institute for Smart Ageing	en_US
dc.creator	Li, Q	en_US
dc.creator	Yan, X	en_US
dc.creator	Xu, J	en_US
dc.creator	Yuan, R	en_US
dc.creator	Zhang, Y	en_US
dc.creator	Feng, R	en_US
dc.creator	Shen, Q	en_US
dc.creator	Zhang, X	en_US
dc.creator	Wang, S	en_US
dc.date.accessioned	2024-10-07T08:32:30Z	-
dc.date.available	2024-10-07T08:32:30Z	-
dc.identifier.uri	http://hdl.handle.net/10397/109379	-
dc.language.iso	en	en_US
dc.publisher	Springer	en_US
dc.subject	Anatomical structure	en_US
dc.subject	Contrastive learning	en_US
dc.subject	Medical vision-language	en_US
dc.subject	Pre-training	en_US
dc.subject	Representation learning	en_US
dc.title	Anatomical structure-guided medical vision-language pre-training	en_US
dc.type	Conference Paper	en_US
dc.identifier.spage	80	en_US
dc.identifier.epage	90	en_US
dc.identifier.doi	10.1007/978-3-031-72120-5_8	en_US
dcterms.abstract	Learning medical visual representations through vision-language pre-training has reached remarkable progress. Despite the promising performance, it still faces challenges, i.e., local alignment lacks interpretability and clinical relevance, and the insufficient internal and external representation learning of image-report pairs. To address these issues, we propose an Anatomical Structure-Guided (ASG) framework. Specifically, we parse raw reports into triplets <anatomical region, finding, existence>, and fully utilize each element as supervision to enhance representation learning. For anatomical region, we design an automatic anatomical region-sentence alignment paradigm in collaboration with radiologists, considering them as the minimum semantic units to explore fine-grained local alignment. For finding and existence, we regard them as image tags, applying an image-tag recognition decoder to associate image features with their respective tags within each sample and constructing soft labels for contrastive learning to improve the semantic association of different image-report pairs. We evaluate the proposed ASG framework on two downstream tasks, including five public benchmarks. Experimental results demonstrate that our method outperforms the state-of-the-art methods.	en_US
dcterms.accessRights	embargoed access	en_US
dcterms.bibliographicCitation	In MG Linguraru,Q Dou, A Feragen, S Giannarou, B Glocker, K Lekadir, & JA Schnabel [Eds.]. Medical Image Computing and Computer Assisted Intervention– MICCAI 2024 27th International Conference Marrakesh, Morocco, October 6–10, 2024 Proceedings, Part XI, p. 80-90. Cham, Switzerland: Springer, 2024	en_US
dcterms.issued	2024	-
dc.relation.ispartofbook	Medical Image Computing and Computer Assisted Intervention– MICCAI 2024 : 27th International Conference Marrakesh, Morocco, October 6–10, 2024 Proceedings, Part XI	en_US
dc.relation.conference	Medical Image Computing and Computer Assisted Intervention [MICCAI]	en_US
dc.description.validate	202410 bcch	en_US
dc.description.oa	Not applicable	en_US
dc.identifier.FolderNumber	a3073a	-
dc.identifier.SubFormID	49382	-
dc.description.fundingSource	Others	en_US
dc.description.fundingText	Start-up Fund of The Hong Kong Polytechnic University (No. P0045999); the Seed Fund of the Research Institute for Smart Ageing (No. P0050946)	en_US
dc.description.pubStatus	Published	en_US
dc.date.embargo	2025-10-03	en_US
dc.description.oaCategory	Green (AAM)	en_US
Appears in Collections:	Conference Paper

Open Access Information

Status	embargoed access
Embargo End Date	2025-10-03

Access

View full-text via PolyU eLinks

Show simple item record

Page views

108

Citations as of Oct 6, 2025

Google Scholar^TM

Check

Open Access Information

Access

Page views

Google ScholarTM

Altmetric

Google Scholar^TM