[1]李丽红,李志勋,刘威伟,等.跨模态时空注意力与上下文门控的情感分析[J].郑州大学学报(工学版),2026,47(XX):1-8.[doi:10.13705/j.issn.1671-6833.2026.04.002]
 LI Lihong,LI Zhixun,LIU Weiwei,et al.Sentiment Analysis with Cross-modal Spatio-Temporal Attention and Contextual Gating[J].Journal of Zhengzhou University (Engineering Science),2026,47(XX):1-8.[doi:10.13705/j.issn.1671-6833.2026.04.002]
点击复制

跨模态时空注意力与上下文门控的情感分析()
分享到:

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:
47
期数:
2026年XX
页码:
1-8
栏目:
出版日期:
2026-09-10

文章信息/Info

Title:
Sentiment Analysis with Cross-modal Spatio-Temporal Attention and Contextual Gating
作者:
李丽红12李志勋12刘威伟12秦肖阳12
1. 华北理工大学 理学院,河北 唐山 063210;2. 华北理工大学 河北省数据科学与应用重点实验室,河北 唐山 063210
Author(s):
LI Lihong12 LI Zhixun12 LIU Weiwei12 QIN Xiaoyang12
1. Department of Science, North China University of Science and Technology, Tangshan 063210, China; 2. He bei Province Key Laboratory of Data Science and Application, North China University of Science and Technology, Tangshan 063210, China
关键词:
多模态情感分析 时空注意力 上下文门控 Transformer 跨模态融合 跨模态交互
Keywords:
multimodal sentiment analysis spatio-temporal attention contextual gating Transformer cross-modal fusion cross-modal interaction
分类号:
TP391. 1TN912. 3TP18
DOI:
10.13705/j.issn.1671-6833.2026.04.002
文献标志码:
A
摘要:
多模态情感分析因模态异质导致的交互不一致性、语言场景复杂性及静态跨模态注意力难以捕捉多模态数据时序动态性,限制了深层模态关联挖掘与情感分类性能。针对以上难题,提出一种多模态情感分析框架,引入跨模态时空注意力(CM-STA)捕获文本、图像与音频的时空依赖,增强跨模态交互;上下文门控(CG)动态筛选情感表达强相关的特征,突出关键情感信息;Transformer跨模态融合交互(TCMFI)通过多头自注意力与双线性池化实现深层跨模态融合,提升融合效率。所提模型在公开数据集TESS(音频)和MVSA-Multiple(文本、图像)上的实验准确率为81.45%、F1分数为80.84%、AUROC为96.40%,较最佳基线模型MISA分别提升0.95、0.24和7.91个百分点,计算复杂度的实验结果显示,所提模型占用GPU内存7.8 GB,GPU利用率98%。所提模型以低空间复杂度和高GPU利用率实现高效融合,性能优于对比基线模型。实验结果验证了所提模型在复杂多模态情感分析场景中展现出优异的性能与鲁棒性。
Abstract:
Multimodal sentiment analysis is limited in deep modal association mining and sentiment classification performance due to interaction inconsistency caused by modal heterogeneity, language scenario complexity, and static cross-modal attention’s inability to capture temporal dynamics of multimodal data. To address these challenges, a multimodal sentiment analysis framework is proposed, introducing Cross-Modal Spatio-Temporal Attention (CM-STA) to capture spatio-temporal dependencies of text, image, and audio, enhancing cross-modal interaction; Contextual Gating (CG) dynamically filters features strongly related to emotional expression, highlighting key emotional information; Transformer Cross-Modal Fusion Interaction (TCMFI) achieves deep cross-modal fusion through multi-head self-attention and bilinear pooling, improving fusion efficiency. The proposed model achieves 81.45% accuracy, 80.84% F1-score, and 96.40% AUROC on public datasets TESS (audio) and MVSA-Multiple (text, image), outperforming the best baseline model MISA by 0.95, 0.24, and 7.91 percentage points respectively. Computational complexity experiments show that the proposed model occupies 7.8 GB GPU memory with 98% GPU utilization. The proposed model achieves efficient fusion with low spatial complexity and high GPU utilization, outperforming comparative baseline models. Experimental results verify that the proposed model demonstrates excellent performance and robustness in complex multimodal sentiment analysis scenarios.

参考文献/References:

[1] CHANDRASEKARAN G, NGUYEN T N, HEMANTH D J. Multimodal sentimental analysis for social media applications: a comprehensive review[J]. WIREs Data Mining and Knowledge Discovery, 2021, 11(5): e1415.
[2] 吕学强, 田驰, 张乐, 等. 融合多特征和注意力机制的多模态情感分析模型[J]. 数据分析与知识发现, 2024, 8(5): 91-101.
LYU X Q, TIAN C, ZHANG L, et al. Multimodal sentiment analysis model integrating multi-features and attention mechanism[J]. Data Analysis and Knowledge Discovery, 2024, 8(5): 91-101.
[3] MORENCY L P, MIHALCEA R, DOSHI P. Towards multimodal sentiment analysis: harvesting opinions from the web[C]// The 13th International Conference on Multimodal Interfaces. New York: ACM, 2011: 169-176.
[4] WANG Y F, HE J H, WANG D, et al. Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis[J]. Neurocomputing, 2024, 572: 127181.
[5] LIU Z Z, ZHOU B, CHU D H, et al. Modality translation-based multimodal sentiment analysis under uncertain missing modalities[J]. Information Fusion, 2024, 101: 101973.
[6] KIM W, SON B, KIM I. ViLT: vision-and-language transformer without convolution or region supervision[EB/OL]. (2021-02-05)[2025-08-09]. https://arxiv.org/abs/2102.03334v2.
[7] LIU Z J, CAI L, YANG W J, et al. Sentiment analysis based on text information enhancement and multimodal feature fusion[J]. Pattern Recognition, 2024, 156: 110847.
[8] DOU Z Y, XU Y C, CAN Z, et al. An empirical study of training end-to-end vision-and-language transformers[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 18145-18155.
[9] ZENG Y, YAN W J, MAI S J, et al. Disentanglement Translation Network for multimodal sentiment analysis[J]. Information Fusion, 2024, 102: 102031.
[10] 陈燕, 赖宇斌, 肖澳, 等. 基于CLIP和交叉注意力的多模态情感分析模型[J]. 郑州大学学报(工学版), 2024, 45(2): 42-50.
CHEN Y, LAI Y B, XIAO A, et al. Multimodal sentiment analysis model based on CLIP and cross-attention[J]. Journal of Zhengzhou University (Engineering Science), 2024, 45(2): 42-50.
[11] KHAN M, TRAN P N, PHAM N T, et al. MemoCMT: multimodal emotion recognition using cross-modal transformer-based feature fusion[J]. Scientific Reports, 2025, 15: 5473.
[12] BAEVSKI A, HSU W N, XU Q T, et al. data2vec: a general framework for self-supervised learning in speech, vision and language[EB/OL]. (2022-02-07)[2025-08-09]. https://arxiv.org/abs/2202.03555.
[13] CUI Y M, CHE W X, LIU T, et al. Pre-training with whole word masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
[14] ZHANG H, WU C R, ZHANG Z Y, et al. ResNeSt: split-attention networks[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Piscataway: IEEE, 2022: 2735-2745.
[15] BAEVSKI A, ZHOU H, MOHAMED A, et al. wav2vec 2.0: a framework for self-supervised learning of speech representations[EB/OL]. (2020-06-20)[2025-08-09]. https://arxiv.org/abs/2006.11477.
[16] BHUIYAN A, HUANG J X. STCA: Utilizing a spatio-temporal cross-attention network for enhancing video person re-identification[J]. Image and Vision Computing, 2022, 123: 104474.
[17] BACHER ZADEH A, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]// The 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 2236-2246.
[18] SUN H, LIU J Q, CHEN Y W, et al. Modality-invariant temporal representation learning for multimodal sentiment classification[J]. Information Fusion, 2023, 91: 504-514.
[19] GOLAGANA V, ROW S V, RAO P S. Adaptive multimodal sentiment analysis: improving fusion accuracy with dynamic attention for missing modality[J]. Journal of Electrical Systems, 2024, 20(S1): 134-147.
[20] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. (2017-06-12)[2025-08-09].https://arxiv.org/abs/1706.03762.
[21] XIAO L W, WU X J, YANG S W, et al. Cross-modal fine-grained alignment and fusion network for multimodal aspect-based sentiment analysis[J]. Information Processing & Management, 2023, 60(6): 103508.
[22] ZHAO F, ZHANG C C, GENG B C. Deep multimodal data fusion[J]. ACM Computing Surveys, 2024, 56(9): 1-36.
[23] LOK E K. Toronto emotional speech set (TESS)[DB/OL].[2025-08-09]. https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess.
[24] DIEM L, ZAHARIEVA M. Video content representation using recurring regions detection[J]. Lecture Notes in Computer Science, 2016, 9516: 16-28.
[25] ZHANG H Y, WANG Y, YIN G H, et al. Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis[EB/OL]. (2023-10-09)[2025-08-09]. https://arxiv.org/abs/2310.05804.
[26] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[EB/OL]. (2018-02-03)[2025-08-09]. https://arxiv.org/abs/1802.00927.
[27] TSAI Y H, BAI S J, PU LIANG P, et al. Multimodal transformer for unaligned multimodal language sequences[C]// Proc Conf Assoc Comput Linguist Meet. Stroudsburg: ACL, 2019: 6558-6569.
[28] WANG D, GUO X T, TIAN Y M, et al. TETFN: a text enhanced transformer fusion network for multimodal sentiment analysis[J]. Pattern Recognition, 2023, 136: 109259.
[29] ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[EB/OL]. (2017-07-23)[2025-08-09]. https://arxiv.org/abs/1707.07250.
[30] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[EB/OL]. (2018-05-31)[2025-07-09]. https://arxiv.org/abs/1806.00064.
[31] HU J W, LIU Y C, ZHAO J M, et al. MMGCN: multimodal fusion via deep graph convolution network for emotion recognition in conversation[EB/OL]. (2021-07-14)[2025-08-09]. https://arxiv.org/abs/2107.06779.
[32] GHOSAL D, MAJUMDER N, PORIA S, et al. DialogueGCN: a graph convolutional neural network for emotion recognition in conversation[EB/OL]. (2019-08-30)[2025-08-09]. https://arxiv.org/abs/1908.11540.
[33] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and -specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1122-1131.
[34] XING T, DOU Y T, CHEN X L, et al. An adaptive multi-graph neural network with multimodal feature fusion learning for MDD detection[J]. Scientific Reports, 2024, 14: 28400.

备注/Memo

备注/Memo:
收稿日期:2025-09-01;修订日期:2025-10-03
基金项目:河北省数据科学与应用重点实验室项目(10120201)
作者简介:李丽红(1979— ) ,女,辽宁锦州人,华北理工大学教授,主要从事数据挖掘和三支决策研究,E-mail:22687426@qq.com。
更新日期/Last Update: 2026-01-13