Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/116688
DC FieldValueLanguage
dc.contributorDepartment of Computing-
dc.creatorHuo, Fushuo-
dc.identifier.urihttps://theses.lib.polyu.edu.hk/handle/200/14075-
dc.language.isoEnglish-
dc.titleTowards robust multimodal learning in the open world-
dc.typeThesis-
dcterms.abstractThe rapid evolution of machine learning has propelled neural networks to unprecedented success across diverse domains. In particular, multimodal learning has emerged as a transformative paradigm, leveraging complementary information from heterogeneous data streams (e.g., text, vision, audio) to advance contextual reasoning and intelligent decision-making. Despite these advancements, current neural network-based models often fall short in open-world environments characterized by inherent unpredictability, where unpredictable environmental composition dynamics, incomplete modality inputs, and spurious distributions relations critically undermine system reliability. While humans naturally adapt to such dynamic, ambiguous scenarios, artificial intelligence systems exhibit stark limitations in robustness, particularly when processing multimodal signals under real-world complexity. This study investigates the fundamental challenge of multimodal learning robustness in open-world settings, aiming to bridge the gap between controlled experimental performance and practical deployment requirements. Here, we study the multimodal learning robustness in the open world settings:-
dcterms.abstract(1). Humans can extrapolate new concepts from previously learned multi-modal knowledge. This ability is known as compositional generalization, while neural networks have deficiencies in compositional generalization robustness, struggling to reliably handle unseen compositions due to rigid feature representations and over-reliance on training data biases. (2). Humans can seamlessly infer unimodal inputs based on memorized contextual multimodal information, with robust inference in the absence of modality. However, neural networks hardly achieve satisfactory results when inferring unimodal inputs, based on integrated multimodal information. (3). With the development of large language models (LLMs), large-scale multimodal large language models (MLLMs), especially large vision language models (LVLMs), have demonstrated expressing comprehensive abilities, approaching or even surpassing human abilities. However, most LVLMs are derived from LLMs by instruction tuning on multimodal datasets. LVLMs usually have the strong language modality prior or statistical bias to LLMs, which is one of the main reasons that arises the significant challenge problem known as 'hallucination', even when queried by simple questions.-
dcterms.abstractIn summary, we study above three problems to improve class-level and modality-level multimodal robustness in terms of composition gneralization robustness (i.e., class-level), modality missing robustness (i.e., modality-level), and modality prior robustness (i.e., modality-level). Concretely, In Chapter 3, we propose a novel Progressive Cross-primitive Compatibility (ProCC) network, mimicking the human learning progress of recognizing the multimodal compositions to improve the modality composition ability. In Chapter 4, we propose the customized crossmodal knowledge distillation (C²KD) to inherit multimodal knowledge during the pre-training period, and enhance the inference robustness when missing some modalities. In Chapter 5, we propose the train-free decoding strategy to alleviate language modality prior of LVLMs to mitigate the hallucination issues while not compromising general abilities of foundation models. Extensive experimental evaluations and ablation studies show the performance advantages of our works with provable advances in robustness abilities for multiple modalities.-
dcterms.accessRightsopen access-
dcterms.educationLevelPh.D.-
dcterms.extentxix, 128 pages : color illustrations-
dcterms.issued2025-
dcterms.LCSHMachine learning-
dcterms.LCSHArtificial intelligence-
dcterms.LCSHHong Kong Polytechnic University -- Dissertations-
Appears in Collections:Thesis
Show simple item record

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.