Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/114608
PIRA download icon_1.1View/Download Full Text
DC FieldValueLanguage
dc.contributorDepartment of Electrical and Electronic Engineering-
dc.creatorHuang, Z-
dc.creatorMak, MW-
dc.creatorLee, KA-
dc.date.accessioned2025-08-18T03:02:10Z-
dc.date.available2025-08-18T03:02:10Z-
dc.identifier.urihttp://hdl.handle.net/10397/114608-
dc.descriptionInterspeech 2024, 1-5 September 2024, Kos, Greeceen_US
dc.language.isoenen_US
dc.publisherInternational Speech Communication Associationen_US
dc.rightsThe following publication Huang, Z., Mak, M.-W., Lee, K.A. (2024) MM-NodeFormer: Node Transformer Multimodal Fusion for Emotion Recognition in Conversation. Proc. Interspeech 2024, 4069-4073 is available at https://doi.org/10.21437/Interspeech.2024-538.en_US
dc.subjectEmotion recognition in conversationen_US
dc.subjectFeature fusionen_US
dc.subjectMultimodal networken_US
dc.titleMM-NodeFormer : Node transformer multimodal fusion for emotion recognition in conversationen_US
dc.typeConference Paperen_US
dc.identifier.spage4069-
dc.identifier.epage4073-
dc.identifier.doi10.21437/Interspeech.2024-538-
dcterms.abstractEmotion Recognition in Conversation (ERC) has great prospects in human-computer interaction and medical consultation. Existing ERC approaches mainly focus on information in the text and speech modalities and often concatenate multimodal features without considering the richness of emotional information in individual modalities. We propose a multimodal network called MM-NodeFormer for ERC to address this issue. The network leverages the characteristics of different Transformer encoding stages to fuse the emotional features from the text, audio, and visual modalities according to their emotional richness. The module considers text as the main modality and audio and visual as auxiliary modalities, leveraging the complementarity between the main and auxiliary modalities. We conducted extensive experiments on two public benchmark datasets, IEMOCAP and MELD, achieving an accuracy of 74.24% and 67.86%, respectively, significantly higher than many state-of-the-art approaches.-
dcterms.accessRightsopen accessen_US
dcterms.bibliographicCitationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2024, p. 4069-4073-
dcterms.issued2024-
dc.identifier.scopus2-s2.0-85214797293-
dc.description.validate202508 bcch-
dc.description.oaVersion of Recorden_US
dc.identifier.FolderNumberOA_Othersen_US
dc.description.fundingSourceRGCen_US
dc.description.pubStatusPublisheden_US
dc.description.oaCategoryVoR alloweden_US
Appears in Collections:Conference Paper
Files in This Item:
File Description SizeFormat 
huang24b_interspeech.pdf912.04 kBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Access
View full-text via PolyU eLinks SFX Query
Show simple item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.