A systematic vision-based methodology for holistic scene understanding in human-robot collaboration

Fan, Junming

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/108291

Title:	A systematic vision-based methodology for holistic scene understanding in human-robot collaboration
Authors:	Fan, Junming
Degree:	Ph.D.
Issue Date:	2024
Abstract:	The next generation of the industry has depicted a visionary blue-print of human-centricity in futuristic manufacturing systems. In the modern manufacturing sector, there has already begun a dramatic shift from the traditional mode of mass production towards mass personalization, driven by the increasing prevalence of personalization culture and customization requirements. The conventional approach for mass production has predominantly relied on the use of automated production lines, along with machines and robots that operate on preprogrammed routines. Although this method has demonstrated effectiveness in the era of mass production, the lack of intelligence and flexibility largely restrict its capacity to dynamically adjust to the frequently changing production schedule and specifications typical in mass personalization scenarios. To mitigate these limitations, human-robot collaboration (HRC) has emerged as an advanced manufacturing paradigm and is gaining traction as a promising solution to mass personalization since it can simultaneously leverage the consistent strength and repetitive precision of robots and the flexibility, creativity, and versatility of humans. Over the past decade, a considerable amount of research efforts have been dedicated to HRC, addressing issues such as system architecture, collaboration strategy planning, and safety considerations. Among these topics, context awareness has drawn significant attention as it forms the bedrock of critical functionalities such as collision avoidance and robot motion planning. Existing research works in context awareness have extensively concentrated on certain aspects of human recognition, such as activity recognition and intention prediction, due to the paramount importance of human safety in HRC systems. Nevertheless, there has been a noticeable shortage in addressing other vital components of the HRC scene, which can also substantially influence the collaborative working process. In order to fill this gap, this thesis aims to provide a systematic vision-based methodology for holistic scene understanding in HRC, which takes into account the cognition of HRC scene elements including 1) objects, 2) humans, and 3) environments, coupled with 4) visual reasoning to gather and compile visual information into semantic knowledge for subsequent robot decision-making and proactive collaboration. In this thesis, the four aspects are examined and potential solutions are explored to demonstrate the applicability of the vision-based holistic scene understanding scheme in HRC settings. Firstly, a high-resolution network-based two-stage 6-DoF (Degree of Freedom) pose estimation model is constructed to enhance the object perception skill for subsequent robotic manipulation and collaboration strategy planning. Given the visual observation of an industrial workpiece, the first stage makes a coarse estimation of the 6-DoF pose to narrow down the solution space, and the second stage takes the coarse result along with the original image to refine the pose parameters to produce finer estimation results. In HRC scenarios, the workpieces are frequently manipulated by human hands, leading to another issue – the hand-object occlusion. Regarding this problem, an integrated hand-object 3D dense pose estimation model is designed with an explicit occlusion-aware training strategy aiming to mitigate the occlusion-related accuracy degradation (Chapter 3). Then a vision-based human digital twin (HDT) modelling approach is explored in the HRC scenarios, hoping to serve as a holistic and centralized digital representation of human operator status for seamless integration into the cyber-physical production system (Chapter 4). The proposed HDT model is primarily composed of a convolutional neural network designed to concurrently monitor various aspects of hierarchical human status, including 3D human posture, action intention, and ergonomic risk assessment. Subsequently, based on the HDT information, a novel robotic motion planning strategy is introduced, which is focused on the adaptive optimization of the robotic motion trajectory, aiming to enhance the effectiveness and efficiency of robotic movements in complex environments. The proposed HDT modelling scheme provides an exemplary solution of how to model various human states from vision data with a unified deep learning model in an end-to-end manner. Thirdly, a research endeavour is devoted to the perception of the HRC environment, for which a multi-granularity HRC scene segmentation scheme is proposed, along with a specifically devised semantic segmentation network with a bunch of advanced network designs (Chapter 5). Traditional semantic segmentation models mostly rely on a single-granularity semantic level. This formulation cannot adapt to different HRC situations where the requirements of semantic granularity are diversified such as a close-range collaborative assembly task versus a robotic workspace navigation case. Aiming to address this issue, the proposed model is designed to provide a hierarchical representation of the HRC scene which can dynamically switch between different semantic levels to flexibly accommodate the constantly changing needs of various HRC tasks. Lastly, a vision-language reasoning approach is investigated to take a step further from visual perception to human-like reasoning and understanding of the HRC situation (Chapter 6). To address the inherent ambiguity of sole vision-based human-robot communication such as unclear reference of target objects or action intentions, linguistic data is introduced to complement visual data in the form of a vision-language guided referred object retrieval model. Based on the retrieved target object location, a large language model-based robotic action planning strategy is devised to adaptively generate executable robotic action code via natural form language interaction with the human operator. The incorporation of vision-language data demonstrates a viable pathway to achieve complex reasoning to enhance embodied robotic intelligence and maximize HRC working efficiency.
Subjects:	Human-robot interaction -- Industrial applications Human engineering Hong Kong Polytechnic University -- Dissertations
Pages:	xxiii, 180 pages : color illustrations
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/13056

Show full item record

Page views

129

Citations as of Feb 9, 2026

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM