Multimodal Sentiment Analysis Model Based on CLIP and Cross-attention

NAVIGATE

Table of Contents

STATISTICS

Viewed1549

Downloads3588

Multimodal Sentiment Analysis Model Based on CLIP and Cross-attention

[HTML] PDF下载 (3588)

[1]CHEN Yan,LAI Yubin,XIAO Ao,et al.Multimodal Sentiment Analysis Model Based on CLIP and Cross-attention[J].Journal of Zhengzhou University (Engineering Science),2024,45(02):42-50.[doi:10.13705/j.issn.1671-6833.2024.02.003]

Copy

Journal of Zhengzhou University (Engineering Science)[ISSN 1671-6833/CN 41-1339/T] Volume: 45 Number of periods: 2024 02 Page number: 42-50 Column: Public date: 2024-03-06

Title:: Multimodal Sentiment Analysis Model Based on CLIP and Cross-attention

Author(s):: CHEN Yan¹; 2; LAI Yubin¹; XIAO Ao¹; LIAO Yuxiang¹; CHEN Ningjiang¹; 1. School of Computer and Electronic Information Science, Guangxi University, Nanning 530000, China; 2. Guangxi Key Laboratory of Multimedia Communication and Network Technology, Guangxi University, Nanning 530000, China

Keywords:: sentiment analysis; multimodal learning; cross-attention; CLIP model; Transformer; feature fusion

CLC:: TP391

DOI:: 10.13705/j.issn.1671-6833.2024.02.003

Abstract:: In response to the issues of limited annotated data, insufficient fusion between modalities, and information redundancy in multimodal sentiment analysis, a multimodal sentiment analysis model called CLIP-CA-MSA based on contrastive language-image pretraining(CLIP) and cross-attention mechanism was proposed in this study. This model employed models such as BERT which was pre-trained by CLIP, and PIFT to extract feature vectors from videos and textual content. Subsequently, a cross-attention mechanism was applied to facilitate interaction between image feature vectors and text feature vectors, enhancing information exchange across different modalities. Finally, the uncertainty loss was utilized to compute the fused features, and the ultimate sentiment classification results were generated from the outputs. The experimental results showed that the model could increase accuracyrate by 5 percentage points to 14 percentage points and the F1 value by 3 percentage point to 12 percentage point over other multimodal models, which verifieed the superiority of the model in this study. And uses of ablation experiments to verified the validity of each module of the model. This model could effectively utilize the complementarity and correlation of multimodal data, and utilize uncertainty loss to improve the robustness and generalization ability of the model.

References:: [1] PANG B, LEE L, VAITHYANATHAN S. Thumbs up? sentiment classification using machine learning techniques [ C]∥Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing ( EMNLP 2002) . Stroudsburg: ACL, 2002: 79-86.
[2] ZHANG L,LIU B. Sentiment analysis and opinion mining [EB / OL] . (2015-12-31) [2023-04-24] . https:∥doi. org / 10. 1007 / 978-1-4899-7502-7_907-2.
[3] 李勇, 金庆雨, 张青川. 融合位置注意力机制和改进 BLSTM 的食品评论情感分析[ J] . 郑州大学学报( 工学版) , 2020, 41(1) :58-62.
LI Y, JIN Q Y, ZHANG Q C. Improved BLSTM food review sentiment analysis with positional attention mechanisms[ J] . Journal of Zhengzhou University ( Engineering Science) , 2020, 41(1) :58-62.
[4] MUNIKAR M, SHAKYA S, SHRESTHA A. Finegrained sentiment classification using BERT [ EB / OL ] . (2019- 10 - 04 ) [ 2023 - 04 - 24 ] . https:∥arxiv. org / abs/ 1910. 03474.
[5] ZHU X G, LI L, ZHANG W, et al. Dependency exploitation: a unified CNN-RNN approach for visual emotion recognition [ C ] ∥Proceedings of the 26th International Joint Conference on Artificial Intelligence. New York: ACM, 2017:3595-3601.
[6] YOU Q Z, JIN H L, LUO J B. Visual sentiment analysis by attending on local image regions[ C]∥Proceedings of the Thirty-first AAAI Conference on Artificial Intelligence. New York:ACM, 2017: 231-237.
[7] WANG H H, MEGHAWAT A, MORENCY L P, et al. Select-additive learning: improving generalization in multimodal sentiment analysis[ C]∥2017 IEEE International Conference on Multimedia and Expo ( ICME) . Piscataway: IEEE, 2017: 949-954.
[8] 吴思思, 马静. 基于感知融合的多任务多模态情感分析模型[ J] . 数据分析与知识发现,2023(10) :74-84.
WU S S,MA J. Multi-task & multi-modal sentiment analysis model based on aware fusion[ J] . Data Analysis and Knowledge Discovery, 2023(10) :74-84.
[9] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB / OL] . (2021-02-26) [ 2023- 04- 24] . https:∥ arxiv. org / abs/ 2103. 00020.
[10] 赖宇斌, 陈燕, 胡小春,等. 基于提示嵌入的突发公共卫生事件微博文本情感分析[ J] . 数据分析与知识发现,2023,7(11) :46-55.
LAI Y B,CHEN Y,HU X C. et al. Emotional analysis of public health emergency micro-blog based on prompt embedding [ J ]. Data Analysis and Knowledge Discovery, 2023,7(11):46-55.
[11] YU W M, XU H, MENG F Y, et al. CH-SIMS: a Chinese multimodal sentiment analysis dataset with finegrained annotation of modality [ C] ∥Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 3718-3727.
[12] ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis [ EB / OL ] . (2017 - 07 - 23 ) [ 2023 - 04 - 24 ] . https: ∥doi. org / 10. 48550 / arXiv. 1707. 07250.
[13] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[C]∥Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 2247-2256.
[14] TSAI Y H H, BAI S J, LIANG P P, et al. Multimodal Transformer for unaligned multimodal language sequences [ C] . ∥Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 6558-6569.
[15] YU W M, XU H, YUAN Z Q, et al. Learning modalityspecific representations with self-supervised multi-task learning for multimodal sentiment analysis[ C]∥Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto :AAAI, 2021: 10790-10797.
[16] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition [ EB / OL ] . (2014- 09 - 04 ) [ 2023 - 04 - 24 ] . https:∥arxiv. org / abs/ 1409. 1556.
[17] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [ C] ∥2016 IEEE Conference on Computer Vision and Pattern Recognition 、(CVPR) . Piscataway: IEEE, 2016: 770-778.
[18] LIU Z, MAO H Z, WU C Y, et al. A ConvNet for the 2020s[C]∥2022 IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) . Piscataway: IEEE, 2022: 11966-11976.
[19] BALTRUSAITIS T, ZADEH A, LIM Y C, et al. OpenFace 2. 0: facial behavior analysis toolkit[C]∥2018 13th IEEE International Conference on Automatic Face & Gesture Recognition ( FG 2018) . Piscataway: IEEE, 2018: 59-66.
[20] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB / OL]. (2020-10-22)[2023-04- 24] . https:∥arxiv. org / abs/ 2010. 11929.
[21] DESAI S, RAMASWAMY H G. Ablation-CAM: visual explanations for deep convolutional network via gradientfree localization [ C] ∥2020 IEEE Winter Conference on Applications of Computer Vision ( WACV) . Piscataway: IEEE, 2020: 972-980.
[22] LAN Z Z, CHEN M D, GOODMAN S, et al. ALBERT: a lite BERT for self-supervised learning of language representations[EB / OL] . ( 2019 - 09 - 26) [ 2023 - 04 - 24] . https:∥arxiv. org / abs/ 1909. 11942.
[23] DEVLIN J, CHANG M W, LEE K, et al. BERT: pretraining of deep bidirectional transformers for language understanding[ EB / OL] . ( 2018 - 11 - 11) [ 2023 - 04 - 24] . https:∥doi. org / 10. 48550 / arXiv. 1810. 04805.
[24] SUN Y, WANG S H, LI Y K, et al. ERNIE: enhanced representation through knowledge integration [ EB / OL ] . (2019 - 04 - 19 ) [ 2023 - 04 - 24 ] . https: ∥doi. org / 10. 48550 / arXiv. 1904. 09223.
[25] CUI Y M, CHE W X, LIU T, et al. Revisiting pretrained models for Chinese natural language processing [EB / OL] . (2020 - 04 - 29) [ 2023 - 04 - 24] . https:∥ doi. org / 10. 48550 / arXiv. 2004. 13922.
[26] LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach [ EB / OL ] . (2019 - 07 - 26 ) [ 2023 - 04 - 24 ] . https: ∥doi. org / 10. 48550 / arXiv. 1907. 11692.
[27] LUO H S, JI L, ZHONG M, et al. CLIP4Clip: an empirical study of CLIP for end to end video clip retrieval and captioning[J]. Neurocomputing, 2022, 508

Similar References:

Memo

Last Update: 2024-03-08