Please use this identifier to cite or link to this item:
Title: Multimedia content analysis via computational human visual model
Authors: Zhong, Shenghua
Degree: Ph.D.
Issue Date: 2013
Abstract: Multimedia content analysis refers to the computerized understanding of the semantic meanings of a multimedia document. Despite more than twenty years of extensive research, multimedia content analysis for real-world applications remains a well-known challenge in the field of multimedia and computer vision. Due to the human-machine gap in multimedia content analysis, more and more researchers focused on constructing the computational human visual models to imitate human perception and intelligence. This thesis proposes a novel framework to solve various problems in multimedia content analysis via computational human visual models. We want to provide a human-like judgment by referencing the architecture of the human visual system and the procedure of intelligent perception. The techniques based on four important processes in human visual system are designed as follows: 1) retinal image formation for object detection and recognition; 2) attention allocation for image saliency detection and quality assessment; 3) perceptual modeling for image annotation; and 4) visual cortex simulation for multimedia content analysis. Retinal image formation is the first step of human visual system. Integrated with the limited frame rate of retinal image formation, a novel Invariant moment & Curvelet coefficient (IMCC) feature space with two novel algorithms are proposed for water reflection detection and recognition. The proposed feature space and algorithms demonstrate impressive results in the water reflection image classification, the reflection axis detection, and the retrieval of the images with water reflection. The fovea produces the most accurate information. Therefore, to encode detailed visual information, eyes need to be moved so that this area is focused on the visual locations. As a cognitive process of selectively concentrating on one aspect of the environment while ignoring other things, attention has been referred to as the allocation of processing resources. Construction of attention model in multimedia data is useful for applications in multimedia like object segmentation, object recognition and quality assessment. Our novel attention model Bi-directional saliency map (BSMP) integrates bottom-up saliency features and top-down targets information together. Empirical validations on standard datasets demonstrate the effectiveness of the bottom-up saliency detection, and top-down saliency detection. Furthermore, in the image quality assessment task, our technique based on Bi-directional saliency map (BSMP) outperforms the representative blurriness methods.
The visual process is not only based on the immediate visual features, but also relies on the past experience of regularities. As a perceptual processing in which human brain gathers the information from visual elements and their surroundings, contextual cueing provides spatial knowledge about the objects in image. Different from existing techniques only rely on the visual information, the proposed Fuzzy-based contextual-cueing label propagation (FCLP) model addresses the challenging problem in region level annotation to improve the semantic understanding of images. The proposed technique shows obvious performance improvement of label to region assignment for images with multiple objects and complex background. The visual cortex of the brain is the most important part in the human visual system which is responsible for processing visual information. Deep architecture composed of multiple layers of parameterized nonlinear modules is a representative paradigm that has achieved notable success in modeling the human visual system. By referencing the architecture of the visual cortex and the procedure of perception, we construct two novel deep networks models for the two classical and intelligent tasks in multimedia content analysis. The first novel deep networks model called Bilinear deep belief networks (BDBN) is proposed for the task of image classification. The second novel deep learning technique called Field effect bilinear deep belief networks (FBDBN) is proposed to seek the recognition discriminant boundary and estimate the missing features jointly. Extensive experiments on various standard datasets not only show the distinguishing ability of our model in various tasks but also clearly demonstrate our intention of providing a human-like image analysis by referencing the human visual system and perception procedure. Computational human visual models have demonstrated good performance in multimedia content analysis. Further work will be explored from two aspects: how to propose a general deep learning model by simulating human visual system and how to explore deep learning model for video data analysis.
Subjects: Multimedia systems.
Computer vision.
Hong Kong Polytechnic University -- Dissertations
Pages: xxi, 183 p. : ill. ; 30 cm.
Appears in Collections:Thesis

Show full item record

Page views

Last Week
Last month
Citations as of Jun 11, 2023

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.