[1]LI Lihong,LI Zhixun,LIU Weiwei,et al.Sentiment Analysis with Cross-modal Spatio-Temporal Attention and Contextual Gating[J].Journal of Zhengzhou University (Engineering Science),2026,47(XX):1-8.[doi:10.13705/j.issn.1671-6833.2026.04.002]
Copy
Journal of Zhengzhou University (Engineering Science)[ISSN
1671-6833/CN
41-1339/T] Volume:
47
Number of periods:
2026 XX
Page number:
1-8
Column:
Public date:
2026-09-10
- Title:
-
Sentiment Analysis with Cross-modal Spatio-Temporal Attention and Contextual Gating
- Author(s):
-
LI Lihong1; 2 ; LI Zhixun1; 2 ; LIU Weiwei1; 2 ; QIN Xiaoyang1; 2
-
1. Department of Science, North China University of Science and Technology, Tangshan 063210, China; 2. He bei Province Key Laboratory of Data Science and Application, North China University of Science and Technology, Tangshan 063210, China
-
- Keywords:
-
multimodal sentiment analysis; spatio-temporal attention; contextual gating; Transformer cross-modal fusion; cross-modal interaction
- CLC:
-
TP391. 1TN912. 3TP18
- DOI:
-
10.13705/j.issn.1671-6833.2026.04.002
- Abstract:
-
Multimodal sentiment analysis is limited in deep modal association mining and sentiment classification performance due to interaction inconsistency caused by modal heterogeneity, language scenario complexity, and static cross-modal attention’s inability to capture temporal dynamics of multimodal data. To address these challenges, a multimodal sentiment analysis framework is proposed, introducing Cross-Modal Spatio-Temporal Attention (CM-STA) to capture spatio-temporal dependencies of text, image, and audio, enhancing cross-modal interaction; Contextual Gating (CG) dynamically filters features strongly related to emotional expression, highlighting key emotional information; Transformer Cross-Modal Fusion Interaction (TCMFI) achieves deep cross-modal fusion through multi-head self-attention and bilinear pooling, improving fusion efficiency. The proposed model achieves 81.45% accuracy, 80.84% F1-score, and 96.40% AUROC on public datasets TESS (audio) and MVSA-Multiple (text, image), outperforming the best baseline model MISA by 0.95, 0.24, and 7.91 percentage points respectively. Computational complexity experiments show that the proposed model occupies 7.8 GB GPU memory with 98% GPU utilization. The proposed model achieves efficient fusion with low spatial complexity and high GPU utilization, outperforming comparative baseline models. Experimental results verify that the proposed model demonstrates excellent performance and robustness in complex multimodal sentiment analysis scenarios.