Deep learning-based 3D human pose estimation for fashion applications

Peng, Jihua

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/115853

Title:	Deep learning-based 3D human pose estimation for fashion applications
Authors:	Peng, Jihua
Degree:	Ph.D.
Issue Date:	2024
Abstract:	3D human pose estimation, a foundational task in computer vision, has received significant attention in recent years due to its crucial applications in robotics, healthcare, and sports science. In particular, it is also a very important research topic in the fashion field due to its ability to yield plausible human body regions for cloth parsing. This study aims to address the issues inherent in exiting state-of-the-art (SOTA) methods of 3D pose estimation by proposing three new and efficient models for 3D pose estimation from various inputs, including video sequence and single image inputs. It is also demonstrated in this study, as an application of these proposed methods, 3D poses predicted from video sequence inputs are being applied and retargeted to game and fashion avatars. Pose estimation covers both 2D and 3D pose estimation, and the latter are technically more challenging. For 3D pose estimation, most existing methods have converted this challenging task into a local pose estimation problem by partitioning the human body joints into different groups based on the relevant anatomical relationships. Subsequently, the body joint features from various groups are then fused to predict the overall pose of the whole body, which requires a joint feature fusion module. Nevertheless, the joint feature fusion schemes adopted in existing methods involve the learning of extensive parameters and hence are computationally very expensive. Thus, in this study, a novel grouped 3D pose estimation network is first proposed, which involves an optimized feature fusion (OFF) module that not only requires fewer parameters and calculations than existing methods but also is more accurate. Furthermore, this network introduces a motion amplitude information (MAI) method and a camera intrinsic embedding (CIE) module which are designed to provide better global information and 2D-to-3D conversion knowledge thereby improving the overall robustness and accuracy of the method. In contrast to previous methods, the proposed new network can be trained end-to-end in one single stage, and experiment results have demonstrated that this new method outperforms previous state-of-the-art methods on two benchmarks. The above first new method for 3D pose estimation is based on convolution neural network (CNN) for grouped feature fusion. In view of the rapid advancement and outstanding performance for transformer-based deep learning models, another novel method, called Kinematics and Trajectory Prior Knowledge-Enhanced Transformer (KTPFormer), is also proposed for 3D pose estimation with video inputs. This network contains two novel prior attention modules called Kinematic Prior Attention (KPA) and Trajectory Prior Attention (TPA). KPA models kinematic relationships in the human body by constructing a topology of kinematics. On the other hand, TPA builds a temporal topology to learn the priori knowledge of joint motion trajectory across frames. In this way, the two prior attention mechanisms can yield Q, K, V vectors with prior knowledge for the vanilla self-attention mechanisms, which helps them to model global dependencies and features more effectively. With a lightweight plug-and-play design, KPA and TPA can be easily integrated with various state-of-the-art models to further improve the performance in a significant margin with only a small increase in the computational overhead. For handling single image inputs, the third new network is designed in this study for 3D pose estimation, which effectively combines the graph and attention mechanism. This method can effectively model the topological information of the human body and learns global correlations among different body joints more efficiently. Being a demonstration for potential application for these proposed methods, motion retargeting technique is used to transfer the predicted 3D human poses from fashion images/videos to other people, so that different people can perform the same motion, e.g. catwalk, realizing multiplayer motion animation.
Pages:	xvii, 154 pages : color illustrations
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/13946

Show full item record

Google Scholar^TM

Check

Access

Google ScholarTM

Google Scholar^TM