Graph neural networks in vision-language image understanding : a survey

Senior, H; Slabaugh, G; Yuan, S; Rossi, L

doi:10.1007/s00371-024-03343-0

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/107968

Title:	Graph neural networks in vision-language image understanding : a survey
Authors:	Senior, H Slabaugh, G Yuan, S Rossi, L
Issue Date:	Jan-2025
Source:	Visual computer, Jan. 2025, v. 41, no. 1, p. 491-516
Abstract:	2D image understanding is a complex problem within computer vision, but it holds the key to providing human-level scene comprehension. It goes further than identifying the objects in an image, and instead, it attempts to understand the scene. Solutions to this problem form the underpinning of a range of tasks, including image captioning, visual question answering (VQA), and image retrieval. Graphs provide a natural way to represent the relational arrangement between objects in an image, and thus, in recent years graph neural networks (GNNs) have become a standard component of many 2D image understanding pipelines, becoming a core architectural component, especially in the VQA group of tasks. In this survey, we review this rapidly evolving field and we provide a taxonomy of graph types used in 2D image understanding approaches, a comprehensive list of the GNN models used in this domain, and a roadmap of future potential developments. To the best of our knowledge, this is the first comprehensive survey that covers image captioning, visual question answering, and image retrieval techniques that focus on using GNNs as the main part of their architecture.
Keywords:	Graph neural networks Image captioning Image retrieval Visual question answering
Publisher:	Springer
Journal:	Visual computer
ISSN:	0178-2789
EISSN:	1432-2315
DOI:	10.1007/s00371-024-03343-0
Rights:	© The Author(s) 2024 This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The following publication Senior, H., Slabaugh, G., Yuan, S. et al. Graph neural networks in vision-language image understanding: a survey. Vis Comput 41, 491–516 (2025) is available at https://doi.org/10.1007/s00371-024-03343-0.
Appears in Collections:	Journal/Magazine Article

Files in This Item:

File	Description	Size	Format
s00371-024-03343-0.pdf		2.94 MB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Version of Record

Access

View full-text via PolyU eLinks

Show full item record

Page views

50

Citations as of Apr 14, 2025

Downloads

7

Citations as of Apr 14, 2025

SCOPUS^TM
Citations

3

Citations as of Mar 27, 2025

WEB OF SCIENCE^TM
Citations

3

Citations as of Mar 27, 2025

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Page views

Downloads

SCOPUSTM Citations

WEB OF SCIENCETM Citations

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

WEB OF SCIENCE^TM
Citations

Google Scholar^TM