Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/109867
DC FieldValueLanguage
dc.contributorDepartment of Electrical and Electronic Engineering-
dc.creatorLu, Chongkai-
dc.identifier.urihttps://theses.lib.polyu.edu.hk/handle/200/13241-
dc.language.isoEnglish-
dc.titleTowards end-to-end temporal action detection in videos-
dc.typeThesis-
dcterms.abstractThe exponential surge in video content in recent years has positioned video as a dominant medium of social interactions. The abundant video content facilitates the curation of video datasets rich in insightful content, enabling researchers to study human behaviors and enhancing our understanding of the world. However, the casual and random nature of everyday video recordings often leads to a predominance of irrelevant information, necessitating efficient methods for extracting valuable content. Temporal Action Detection (TAD) addresses this challenge by distinguishing ‘foreground’ and ‘background’ segments in video sequences, based on the presence of target actions. This technology, pivotal in processing untrimmed raw videos, helps pinpoint segments of interest and extract pertinent information from these datasets.-
dcterms.abstractDeep learning, a key subset of machine learning, has revolutionized various fields by enabling neural networks to learn from large datasets. Its primary advantage is the ability to learn from extensive data rather than relying on pre-existing human knowledge, leading to robust generalization and superior performance. Recently, deep learning-based methods have become central in advancing video analysis techniques, particularly in TAD, where they are now the standard approach in the academic community.-
dcterms.abstractThe main contribution of this work is the development of several deep-learning based TAD frameworks that outperform previous methods and offer unique structural benefits. A core goal of this research is to enhance the efficiency and performance of TAD methods. Traditional TAD approaches are often complex and multi-staged, requiring significant engineering effort to fine-tune the model’s hyperparameters. In contrast, our research leverages deep learning’s spirit of efficiency and streamlined processing to develop end-to-end TAD models that integrate feature extraction and action detection in a single process.-
dcterms.abstractThe first part of this dissertation addresses the input stage of TAD. In response to the challenge of processing extensive untrimmed videos, the dissertation introduces the Action Progression Network (APN). APN employs ‘action progression’ as a measurable indicator, enabling the use of a single frame or a brief video segment as the input. This innovation streamlines the TAD process, ensuring uniform computational efficiency, irre­spective of video duration. Additionally, APN is distinctively trained to target specific actions independent of background activities, substantially improving its generalization capabilities and diminishing the dependency on large datasets. APN has demonstrated exceptional precision in identifying actions with notable evolutionary features. This pro­ficiency, coupled with its top-tier performance on public datasets, establishes APN as a groundbreaking development in enhancing both the efficiency and accuracy of TAD.-
dcterms.abstractThe second part of this dissertation focuses on optimizing the output stage of TAD. Traditional TAD models generate a multitude of initial results, typically requiring laborious post-processing, such as Non-maximum Suppression (NMS), for refinement. To streamline this process, we integrated the Detection with Transformer (DETR) approach into TAD, enabling the model to directly produce finalized detection results via a one-­to-one matching mechanism. This integration not only simplifies the overall detection workflow but also faithfully adheres to the end-to-end principles. Our work further en­tails the adaptation and refinement of various DETR optimization techniques for TAD applications, involving a series of experiments with diverse configurations to elevate both the performance and accuracy of the models. The result of this extensive research and development is DITA: DETR with Improved Queries for End-to-End Temporal Action Detection. DITA successfully incorporates the traditional detection and post-processing methods into TAD, achieving competitive performance on public datasets and demon­strating its robust capability in practical TAD applications.-
dcterms.abstractIn conclusion, the contributions of this work significantly propel the development of end-to-end TAD, making action detection in videos simple and efficient. The impressive performance of these frameworks on public datasets demonstrates their efficacy and real-world applicability. Looking ahead, we plan to integrate the insights from this work and draw inspiration from other methodologies to develop a truly comprehensive end-to-end TAD model. Further, we plan to delve deeper into the mechanics of deep learning models in video action detection, seeking knowledge beyond traditional model design. This ex­ploration is anticipated to uncover new insights, enhancing the efficiency and effectiveness of TAD models.-
dcterms.accessRightsopen access-
dcterms.educationLevelPh.D.-
dcterms.extent135 pages : color illustrations-
dcterms.issued2024-
dcterms.LCSHImage processing -- Digital techniques-
dcterms.LCSHVideo recordings-
dcterms.LCSHMachine learning-
dcterms.LCSHDeep learning (Machine learning)-
dcterms.LCSHHong Kong Polytechnic University -- Dissertations-
Appears in Collections:Thesis
Show simple item record

Page views

61
Citations as of Apr 14, 2025

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.