Towards robust multimodal learning in the open world

Huo, Fushuo

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/116688

DC Field	Value	Language
dc.contributor	Department of Computing	-
dc.creator	Huo, Fushuo	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/14075	-
dc.language.iso	English	-
dc.title	Towards robust multimodal learning in the open world	-
dc.type	Thesis	-
dcterms.abstract	The rapid evolution of machine learning has propelled neural networks to unprecedented success across diverse domains. In particular, multimodal learning has emerged as a transformative paradigm, leveraging complementary information from heterogeneous data streams (e.g., text, vision, audio) to advance contextual reasoning and intelligent decision-making. Despite these advancements, current neural network-based models often fall short in open-world environments characterized by inherent unpredictability, where unpredictable environmental composition dynamics, incomplete modality inputs, and spurious distributions relations critically undermine system reliability. While humans naturally adapt to such dynamic, ambiguous scenarios, artificial intelligence systems exhibit stark limitations in robustness, particularly when processing multimodal signals under real-world complexity. This study investigates the fundamental challenge of multimodal learning robustness in open-world settings, aiming to bridge the gap between controlled experimental performance and practical deployment requirements. Here, we study the multimodal learning robustness in the open world settings:	-
dcterms.abstract	(1). Humans can extrapolate new concepts from previously learned multi-modal knowledge. This ability is known as compositional generalization, while neural networks have deficiencies in compositional generalization robustness, struggling to reliably handle unseen compositions due to rigid feature representations and over-reliance on training data biases. (2). Humans can seamlessly infer unimodal inputs based on memorized contextual multimodal information, with robust inference in the absence of modality. However, neural networks hardly achieve satisfactory results when inferring unimodal inputs, based on integrated multimodal information. (3). With the development of large language models (LLMs), large-scale multimodal large language models (MLLMs), especially large vision language models (LVLMs), have demonstrated expressing comprehensive abilities, approaching or even surpassing human abilities. However, most LVLMs are derived from LLMs by instruction tuning on multimodal datasets. LVLMs usually have the strong language modality prior or statistical bias to LLMs, which is one of the main reasons that arises the significant challenge problem known as 'hallucination', even when queried by simple questions.	-
dcterms.abstract	In summary, we study above three problems to improve class-level and modality-level multimodal robustness in terms of composition gneralization robustness (i.e., class-level), modality missing robustness (i.e., modality-level), and modality prior robustness (i.e., modality-level). Concretely, In Chapter 3, we propose a novel Progressive Cross-primitive Compatibility (ProCC) network, mimicking the human learning progress of recognizing the multimodal compositions to improve the modality composition ability. In Chapter 4, we propose the customized crossmodal knowledge distillation (C²KD) to inherit multimodal knowledge during the pre-training period, and enhance the inference robustness when missing some modalities. In Chapter 5, we propose the train-free decoding strategy to alleviate language modality prior of LVLMs to mitigate the hallucination issues while not compromising general abilities of foundation models. Extensive experimental evaluations and ablation studies show the performance advantages of our works with provable advances in robustness abilities for multiple modalities.	-
dcterms.accessRights	open access	-
dcterms.educationLevel	Ph.D.	-
dcterms.extent	xix, 128 pages : color illustrations	-
dcterms.issued	2025	-
dcterms.LCSH	Machine learning	-
dcterms.LCSH	Artificial intelligence	-
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	-
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/14075

Show simple item record

Google Scholar^TM

Check

Access

Google ScholarTM

Google Scholar^TM