SegLD : achieving universal, zero-shot and open-vocabulary segmentation through multimodal fusion via latent diffusion processes

Zheng, H; Ding, Y; Wang, Z; Huang, X

doi:10.1016/j.inffus.2024.102509

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/108002

DC Field	Value	Language
dc.contributor	Department of Building Environment and Energy Engineering	-
dc.creator	Zheng, H	-
dc.creator	Ding, Y	-
dc.creator	Wang, Z	-
dc.creator	Huang, X	-
dc.date.accessioned	2024-07-23T01:36:13Z	-
dc.date.available	2024-07-23T01:36:13Z	-
dc.identifier.uri	http://hdl.handle.net/10397/108002	-
dc.language.iso	en	en_US
dc.publisher	Elsevier	en_US
dc.subject	Contrastive loss	en_US
dc.subject	Latent diffusion process	en_US
dc.subject	Multimodel fusion	en_US
dc.subject	Open-vocabulary	en_US
dc.subject	Universal	en_US
dc.title	SegLD : achieving universal, zero-shot and open-vocabulary segmentation through multimodal fusion via latent diffusion processes	en_US
dc.type	Journal/Magazine Article	en_US
dc.identifier.volume	111	-
dc.identifier.doi	10.1016/j.inffus.2024.102509	-
dcterms.abstract	Open-vocabulary learning can identify categories marked during training (seen categories) and generalize to categories not annotated in the training set (unseen categories). It could theoretically extend segmentation systems to more universal applications. However, current open-vocabulary segmentation frameworks are primarily suited for specific tasks or require retraining according to the task, and they significantly underperform in inferring seen categories compared to fully supervised frameworks. Therefore, we introduce a universal open-vocabulary segmentation framework based on the latent diffusion process (SegLD), which requires only a single training session on a panoptic dataset to achieve inference across all open-vocabulary segmentation tasks, and reaches SOTA segmentation performance for both seen and unseen categories in every task. Specifically, SegLD comprises two stages: in the first stage, we deploy two parallel latent diffusion processes to deeply fuse the text (image caption or category labels) and image information, further aggregating the multi-scale features output from both latent diffusion processes on a scale basis. In the second stage, we introduce text queries, text list queries, and task queries, facilitating the learning of inter-category and inter-task differences through the computation of contrastive losses between them. Text queries are then further fed into a Transformer Decoder to obtain category-agnostic segmentation masks. Then we establish classification loss functions for the type of text input during training, whether image captions or category labels, to help assign a category label from the open vocabulary to each predicted binary mask. Experimental results show that, with just a single training session, SegLD significantly outperforms other contemporary SOTA fully supervised segmentation frameworks and open-vocabulary segmentation frameworks across almost all evaluation metrics for both known and unknown categories on the ADE20K, Cityscapes, and COCO datasets. This highlights SegLD's capability as a universal segmentation framework, with the potential to replace other segmentation frameworks and adapt to various segmentation domains. The project link for SegLD is https://zht-segld.github.io/.	-
dcterms.accessRights	embargoed access	en_US
dcterms.bibliographicCitation	Information fusion, Nov. 2024, v. 111, 102509	-
dcterms.isPartOf	Information fusion	-
dcterms.issued	2024-11	-
dc.identifier.scopus	2-s2.0-85196418989	-
dc.identifier.eissn	1566-2535	-
dc.identifier.artn	102509	-
dc.description.validate	202407 bcwh	-
dc.identifier.FolderNumber	a3082b	en_US
dc.identifier.SubFormID	49416	en_US
dc.description.fundingSource	RGC	en_US
dc.description.pubStatus	Published	en_US
dc.date.embargo	2026-11-30	en_US
dc.description.oaCategory	Green (AAM)	en_US
Appears in Collections:	Journal/Magazine Article

Open Access Information

Status	embargoed access
Embargo End Date	2026-11-30

Access

View full-text via PolyU eLinks

Show simple item record

Page views

80

Citations as of Nov 10, 2025

SCOPUS^TM
Citations

5

Citations as of Dec 19, 2025

WEB OF SCIENCE^TM
Citations

4

Citations as of Dec 18, 2025

Google Scholar^TM

Check

Open Access Information

Access

Page views

SCOPUSTM Citations

WEB OF SCIENCETM Citations

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

WEB OF SCIENCE^TM
Citations

Google Scholar^TM