Please use this identifier to cite or link to this item:
Title: Deep learning models for human parsing and action recognition : architectural design, model compression and data augmentation
Authors: Jiang, Yalong
Degree: Ph.D.
Issue Date: 2020
Abstract: The methods for human parsing and action recognition have long been critical techniques in visually describing human behaviours. The recent developments in Convolutional Neural Networks (CNNs) have brought significant improvements to the tasks thanks to the availability of an increased amount of training data. In this study, I focus on three major problems which hinder the applications of deep learning models to human parsing and action recognition. Firstly, existing human parsing models suffer from incomplete feature representations which may lead to failures in some difficult cases. I propose two novel architectures with comprehensive feature representations to improve the robustness of models. The first architecture explores the relationship between human parsing and pose estimation. A module for pose estimation is integrated with a human parsing module to improve the performance under complex backgrounds and variances in human's poses. The second architecture adopts a CNN module for depth estimation which pre-processes input images for the segmentation module. It can improve the pixel classification near boundaries. The availability of abundant labelled data in pose estimation and depth estimation boosts the performance in human parsing. Secondly, the inappropriate capacity of a CNN model and insufficient training data both contribute to the failures in perceiving semantic information of detailed regions. A high-capacity model cannot generalize to the variations in human parsing and action recognition. In my work, three novel methods to reduce the complexity of convolutional layers are proposed. The first method applies orthogonal weight normalization for weight initialization. Performance is improved with complexity reduced. The second method adjusts the dependency among convolutional kernels by conducting principal component analysis on the kernels. The third method clusters the convolutional kernels in each layer based on the Euclidean distance and evaluates the contributions from different clusters by examining the changes in training and test accuracy upon removing the clusters. Higher computational efficiency and better performance can be achieved at the same time. This method can be applied to the models which are pretrained on other tasks. Besides model compression, I further propose a method to evaluate the complexity of a human parsing task. The variances in scales, locations and the consistency in predictions from different models are studied. Additionally, a layer-wise training scheme is proposed to better explore the potential of a CNN model. Thirdly, human parsing models are used for improving the robustness of action recognition models. I extend human parsing models to predict the correspondences between RGB images and the surface-based representations of human bodies. The predictions are used for determining the task-irrelevant content in inputs which increases the domain discrepancy. The proposed scheme reduces the discrepancy between the training data and the test data and improves the performance in action recognition. The above-mentioned methods are evaluated on the Pascal Person Part dataset and the Look into Person dataset for human parsing, the COCO dataset for pose estimation, the MegaDepth dataset for depth estimation, and the HMDB-51 dataset for action recognition.
Subjects: Hong Kong Polytechnic University -- Dissertations
Pattern recognition systems
Human activity recognition
Machine learning
Pages: xxvii, 209 pages : color illustrations
Appears in Collections:Thesis

Show full item record

Page views

Citations as of May 15, 2022

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.