A study on semantic scene understanding with multi-modal fusion for autonomous driving

Feng, Zhen

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/104941

Title:	A study on semantic scene understanding with multi-modal fusion for autonomous driving
Authors:	Feng, Zhen
Degree:	Ph.D.
Issue Date:	2024
Abstract:	Traffic scene understanding is the basis for the safe driving of autonomous vehicles. Semantic segmentation is able to distinguish the class of each pixel in an image, which makes it known as one of the important methods for traffic scene understanding. Due to the complexity and variability of traffic scenes, single-modal data often cannot meet the needs of all scenes. Semantic segmentation algorithms with multi-modal fusion can address the problem that single-modal data is affected by environmental noise leading to performance degradation. Currently, traffic scene understanding based on multi-modal fusion has received increasing attention, such as the fusion of Red-Green-Blue (RGB) images with thermal images and the fusion of RGB images with depth images. The aim of this study is to investigate the segmentation of negative obstacles in traffic scenes and the segmentation of all-day traffic scenes by fusing multi-modal data. Although current multi-modal fusion networks for negative obstacle segmentation have achieved acceptable results, their encoders only use one structure to extract one kind of feature, such as local features. Due to the limitation of the receptive field, the local features extracted by a convolutional network cannot fully represent the global information in the image, while the global information extracted by the self-attention module cannot focus on the local detail information as much as the local features. To address this issue, we propose Multi-modal Attention Fusion Network named MAFNet for the segmentation of road potholes with the fusion of RGB images and disparity images. Specifically, we combine a convolutional network and transformer network as an encoder to extract features from images. In addition, we design fusion modules based on attention modules to fuse the features of RGB images and disparity images. Experiments illustrate that our proposed MAFNet network achieves better results than existing state-of-the-art networks. Large-scale datasets are necessary for training high-quality networks. To address the scarcity of datasets for negative obstacle segmentation with multi-modal fusion, we build and release a dataset for the segmentation of negative obstacles with RGB images and depth images. To reduce the workload of manual labeling, we manually labelled 745 images and generate coarse labels for the remaining 3000 images using the existing dataset and the labelled images. Currently, multi-modal fusion networks have the disadvantage of slow inference when dealing with large-size input data. To address this issue, we propose Channel and Position-wise Knowledge Distillation (CPKD) framework. Specifically, we replace the heavyweight encoder of the teacher network with a lightweight network while introducing a downsampling layer into the beginning of the student network to reduce the amount of data. We design Channel and Position-wise Distillation (CPD) modules to transfer knowledge from the teacher network to the student network. The experimental results illustrate that our proposed CPKD framework can greatly improve the inference speed of the network and enable the student network to achieve satisfactory performance. To address the effect of the blurred edge of thermal images and the issue that the performance of RGB-thermal image fusion networks is easily affected by the alignment relationship changing, we proposed Cross-modal Edge-privileged Knowledge Distillation (CEKD) framework for segmentation. This framework transfers the capability of edge detection from the multi-modal teacher network to the thermal-image student network by knowledge distillation. The main aim of the CEKD framework is to improve the segmentation accuracy of the student network. We introduce an edge detection module into the teacher network and introduce the edge labels as privileged information to train the teacher network. We also design a Thermal Enhancement (TE) module for the student network to improve the contrast between the high-temperature objects and the low-temperature background. The experimental results illustrate that the thermal-only student network trained by our designed CEKD framework is able to learn edge detection capability from the teacher network. The experimental results also illustrate that our student network achieves better performance than the single-modal network for the segmentation of traffic scenes with only thermal images.
Subjects:	Image segmentation Image analysis Image processing -- Digital techniques Automated vehicles Hong Kong Polytechnic University -- Dissertations
Pages:	xx, 114 pages : color illustrations
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/12834

Show full item record

Page views

145

Last Week
11

Last month

Citations as of Nov 30, 2025

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM