Towards efficient and reliable human activity understanding

Xiang, Wangmeng

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/97934

Title:	Towards efficient and reliable human activity understanding
Authors:	Xiang, Wangmeng
Degree:	Ph.D.
Issue Date:	2023
Abstract:	Human activity understanding has been an active research area due to its wide range of applications, e.g., sports analysis, healthcare, security monitoring, environment protection, entertainment, self-driving vehicle and human-computer interaction. Generally speaking, understanding of human activities requires us to answer "who (person re-identification) is doing what (action recognition)". In this thesis, we aim to investigate efficient and reliable methodologies for person re-identification and action recognition. In order to reliably recognize human identity, in chapter 2, we propose a novel Part-aware Attention Network (PAN) for person re-identification by using part feature maps as queries to perform second-order information propagation from middle-level features. PAN operates on all spatial positions of feature maps so that it can discover long-distance relations. Considering that hard negative samples have huge impact on action recognition performance, in chapter 3 we propose a Common Daily Action Dataset (CDAD), which contains positive and negative action pairs for reliable daily action understanding. The established CDAD dataset could not only serve as a benchmark for several important daily action understanding tasks, including multi-label action recognition, temporal action localization and spatial-temporal action detection, but also provide a testbed for researchers to investigate the influence of highly similar negative samples in learning action understanding models. How to efficiently and effectively model the 3D self-attention of video data has been a great challenge for transformer based action recognition. In chapter 4, we propose Temporal Patch Shift (TPS) for efficient spatiotemporal self-attention modeling, which largely increase the temporal modeling ability of 2D transformer without additional computation cost. Previous skeleton-based action recognition methods are typically formulated as a classification task of one-hot labels without fully utilizing the semantic relations between actions. To fully explore the action prior knowledge contained in languages, in chapter 5 we propose Language Supervised Training (LST) for skeleton-based action recognition. More specifically, we take a large-scale language model as the knowledge engine to provide text descriptions for body parts' actions and apply a multi-modal training scheme to supervise the skeleton encoder for action representation learning. In summary, in this thesis we present three methods and one dataset for efficient and reliable human activity understanding. Among them, PAN uses part feature to aggregate information from mid-level feature of CNN for person re-identification; CDAD collects positive and negative action pairs for reliable action recognition; TPS applies patch shift operation for efficient spatial-temporal modeling in transformer for video action recognition; and LST deploys human part language description to guide skeleton-based action recognition. Extensive experiments demonstrate their efficiency and reliability for human activity understanding.
Subjects:	Computer vision Image analysis Motion perception (Vision) Pattern recognition systems Hong Kong Polytechnic University -- Dissertations
Pages:	xv, 146 pages : color illustrations
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/12263

Show full item record

Page views

156

Last Week
6

Last month

Citations as of Nov 30, 2025

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM