参考文献/References:
[1] CHANDRASEKARAN G, NGUYEN T N, HEMANTH D J. Multimodal sentimental analysis for social media applications: a comprehensive review[J]. WIREs Data Mining and Knowledge Discovery, 2021, 11(5): e1415.
[2] 吕学强, 田驰, 张乐, 等. 融合多特征和注意力机制的多模态情感分析模型[J]. 数据分析与知识发现, 2024, 8(5): 91-101.
LYU X Q, TIAN C, ZHANG L, et al. Multimodal sentiment analysis model integrating multi-features and attention mechanism[J]. Data Analysis and Knowledge Discovery, 2024, 8(5): 91-101.
[3] MORENCY L P, MIHALCEA R, DOSHI P. Towards multimodal sentiment analysis: harvesting opinions from the web[C]// The 13th International Conference on Multimodal Interfaces. New York: ACM, 2011: 169-176.
[4] WANG Y F, HE J H, WANG D, et al. Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis[J]. Neurocomputing, 2024, 572: 127181.
[5] LIU Z Z, ZHOU B, CHU D H, et al. Modality translation-based multimodal sentiment analysis under uncertain missing modalities[J]. Information Fusion, 2024, 101: 101973.
[6] KIM W, SON B, KIM I. ViLT: vision-and-language transformer without convolution or region supervision[EB/OL]. (2021-02-05)[2025-08-09]. https://arxiv.org/abs/2102.03334v2.
[7] LIU Z J, CAI L, YANG W J, et al. Sentiment analysis based on text information enhancement and multimodal feature fusion[J]. Pattern Recognition, 2024, 156: 110847.
[8] DOU Z Y, XU Y C, CAN Z, et al. An empirical study of training end-to-end vision-and-language transformers[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 18145-18155.
[9] ZENG Y, YAN W J, MAI S J, et al. Disentanglement Translation Network for multimodal sentiment analysis[J]. Information Fusion, 2024, 102: 102031.
[10] 陈燕, 赖宇斌, 肖澳, 等. 基于CLIP和交叉注意力的多模态情感分析模型[J]. 郑州大学学报(工学版), 2024, 45(2): 42-50.
CHEN Y, LAI Y B, XIAO A, et al. Multimodal sentiment analysis model based on CLIP and cross-attention[J]. Journal of Zhengzhou University (Engineering Science), 2024, 45(2): 42-50.
[11] KHAN M, TRAN P N, PHAM N T, et al. MemoCMT: multimodal emotion recognition using cross-modal transformer-based feature fusion[J]. Scientific Reports, 2025, 15: 5473.
[12] BAEVSKI A, HSU W N, XU Q T, et al. data2vec: a general framework for self-supervised learning in speech, vision and language[EB/OL]. (2022-02-07)[2025-08-09]. https://arxiv.org/abs/2202.03555.
[13] CUI Y M, CHE W X, LIU T, et al. Pre-training with whole word masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
[14] ZHANG H, WU C R, ZHANG Z Y, et al. ResNeSt: split-attention networks[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Piscataway: IEEE, 2022: 2735-2745.
[15] BAEVSKI A, ZHOU H, MOHAMED A, et al. wav2vec 2.0: a framework for self-supervised learning of speech representations[EB/OL]. (2020-06-20)[2025-08-09]. https://arxiv.org/abs/2006.11477.
[16] BHUIYAN A, HUANG J X. STCA: Utilizing a spatio-temporal cross-attention network for enhancing video person re-identification[J]. Image and Vision Computing, 2022, 123: 104474.
[17] BACHER ZADEH A, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]// The 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 2236-2246.
[18] SUN H, LIU J Q, CHEN Y W, et al. Modality-invariant temporal representation learning for multimodal sentiment classification[J]. Information Fusion, 2023, 91: 504-514.
[19] GOLAGANA V, ROW S V, RAO P S. Adaptive multimodal sentiment analysis: improving fusion accuracy with dynamic attention for missing modality[J]. Journal of Electrical Systems, 2024, 20(S1): 134-147.
[20] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. (2017-06-12)[2025-08-09].https://arxiv.org/abs/1706.03762.
[21] XIAO L W, WU X J, YANG S W, et al. Cross-modal fine-grained alignment and fusion network for multimodal aspect-based sentiment analysis[J]. Information Processing & Management, 2023, 60(6): 103508.
[22] ZHAO F, ZHANG C C, GENG B C. Deep multimodal data fusion[J]. ACM Computing Surveys, 2024, 56(9): 1-36.
[23] LOK E K. Toronto emotional speech set (TESS)[DB/OL].[2025-08-09]. https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess.
[24] DIEM L, ZAHARIEVA M. Video content representation using recurring regions detection[J]. Lecture Notes in Computer Science, 2016, 9516: 16-28.
[25] ZHANG H Y, WANG Y, YIN G H, et al. Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis[EB/OL]. (2023-10-09)[2025-08-09]. https://arxiv.org/abs/2310.05804.
[26] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[EB/OL]. (2018-02-03)[2025-08-09]. https://arxiv.org/abs/1802.00927.
[27] TSAI Y H, BAI S J, PU LIANG P, et al. Multimodal transformer for unaligned multimodal language sequences[C]// Proc Conf Assoc Comput Linguist Meet. Stroudsburg: ACL, 2019: 6558-6569.
[28] WANG D, GUO X T, TIAN Y M, et al. TETFN: a text enhanced transformer fusion network for multimodal sentiment analysis[J]. Pattern Recognition, 2023, 136: 109259.
[29] ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[EB/OL]. (2017-07-23)[2025-08-09]. https://arxiv.org/abs/1707.07250.
[30] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[EB/OL]. (2018-05-31)[2025-07-09]. https://arxiv.org/abs/1806.00064.
[31] HU J W, LIU Y C, ZHAO J M, et al. MMGCN: multimodal fusion via deep graph convolution network for emotion recognition in conversation[EB/OL]. (2021-07-14)[2025-08-09]. https://arxiv.org/abs/2107.06779.
[32] GHOSAL D, MAJUMDER N, PORIA S, et al. DialogueGCN: a graph convolutional neural network for emotion recognition in conversation[EB/OL]. (2019-08-30)[2025-08-09]. https://arxiv.org/abs/1908.11540.
[33] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and -specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1122-1131.
[34] XING T, DOU Y T, CHEN X L, et al. An adaptive multi-graph neural network with multimodal feature fusion learning for MDD detection[J]. Scientific Reports, 2024, 14: 28400.