Please use this identifier to cite or link to this item:
http://hdl.handle.net/10397/97934
| Title: | Towards efficient and reliable human activity understanding | Authors: | Xiang, Wangmeng | Degree: | Ph.D. | Issue Date: | 2023 | Abstract: | Human activity understanding has been an active research area due to its wide range of applications, e.g., sports analysis, healthcare, security monitoring, environment protection, entertainment, self-driving vehicle and human-computer interaction. Generally speaking, understanding of human activities requires us to answer "who (person re-identification) is doing what (action recognition)". In this thesis, we aim to investigate efficient and reliable methodologies for person re-identification and action recognition. In order to reliably recognize human identity, in chapter 2, we propose a novel Part-aware Attention Network (PAN) for person re-identification by using part feature maps as queries to perform second-order information propagation from middle-level features. PAN operates on all spatial positions of feature maps so that it can discover long-distance relations. Considering that hard negative samples have huge impact on action recognition performance, in chapter 3 we propose a Common Daily Action Dataset (CDAD), which contains positive and negative action pairs for reliable daily action understanding. The established CDAD dataset could not only serve as a benchmark for several important daily action understanding tasks, including multi-label action recognition, temporal action localization and spatial-temporal action detection, but also provide a testbed for researchers to investigate the influence of highly similar negative samples in learning action understanding models. How to efficiently and effectively model the 3D self-attention of video data has been a great challenge for transformer based action recognition. In chapter 4, we propose Temporal Patch Shift (TPS) for efficient spatiotemporal self-attention modeling, which largely increase the temporal modeling ability of 2D transformer without additional computation cost. Previous skeleton-based action recognition methods are typically formulated as a classification task of one-hot labels without fully utilizing the semantic relations between actions. To fully explore the action prior knowledge contained in languages, in chapter 5 we propose Language Supervised Training (LST) for skeleton-based action recognition. More specifically, we take a large-scale language model as the knowledge engine to provide text descriptions for body parts' actions and apply a multi-modal training scheme to supervise the skeleton encoder for action representation learning. In summary, in this thesis we present three methods and one dataset for efficient and reliable human activity understanding. Among them, PAN uses part feature to aggregate information from mid-level feature of CNN for person re-identification; CDAD collects positive and negative action pairs for reliable action recognition; TPS applies patch shift operation for efficient spatial-temporal modeling in transformer for video action recognition; and LST deploys human part language description to guide skeleton-based action recognition. Extensive experiments demonstrate their efficiency and reliability for human activity understanding. |
Subjects: | Computer vision Image analysis Motion perception (Vision) Pattern recognition systems Hong Kong Polytechnic University -- Dissertations |
Pages: | xv, 146 pages : color illustrations |
| Appears in Collections: | Thesis |
Access
View full-text via https://theses.lib.polyu.edu.hk/handle/200/12263
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.


