«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.issn.1671-6833.2024.02.003]
点击复制

基于 CLIP 和交叉注意力的多模态情感分析模型()

分享到：

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:: 45
期数:: 2024年02期

页码:: 42-50

栏目:

出版日期:: 2024-03-06

文章信息/Info

Title:: Multimodal Sentiment Analysis Model Based on CLIP and Cross-attention

作者:: 陈燕¹; 2; 赖宇斌¹; 肖澳¹; 廖宇翔¹; 陈宁江¹; 1. 广西大学计算机与电子信息学院,广西南宁 530000;2. 广西大学广西多媒体通信与网络技术重点实验室,广西南宁 530000

Author(s):: CHEN Yan¹; 2; LAI Yubin¹; XIAO Ao¹; LIAO Yuxiang¹; CHEN Ningjiang¹; 1. School of Computer and Electronic Information Science, Guangxi University, Nanning 530000, China; 2. Guangxi Key Laboratory of Multimedia Communication and Network Technology, Guangxi University, Nanning 530000, China

关键词:: 情感分析; 多模态学习; 交叉注意力; CLIP 模型; Transformer; 特征融合

Keywords:: sentiment analysis; multimodal learning; cross-attention; CLIP model; Transformer; feature fusion

分类号:: TP391

DOI:: 10.13705/j.issn.1671-6833.2024.02.003

文献标志码:: A

摘要:: 针对多模态情感分析中存在的标注数据量少、模态间融合不充分以及信息冗余等问题,提出了一种基于对比语言-图片训练(CLIP)和交叉注意力( CA) 的多模态情感分析( MSA) 模型 CLIP-CA-MSA。首先,该模型使用 CLIP 预训练的 BERT 模型、PIFT 模型来提取视频特征向量与文本特征;其次,使用交叉注意力机制将图像特征向量和文本特征向量进行交互,以加强不同模态之间的信息传递;最后,利用不确定性损失特征融合后计算输出最终的情感分类结果。实验结果表明:该模型比其他多模态模型准确率提高 5 百分点至 14 百分点,F1 值提高 3 百分点至 12 百分点,验证了该模型的优越性,并使用消融实验验证该模型各模块的有效性。该模型能够有效地利用多模态数据的互补性和相关性,同时利用不确定性损失来提高模型的鲁棒性和泛化能力。

Abstract:: In response to the issues of limited annotated data, insufficient fusion between modalities, and information redundancy in multimodal sentiment analysis, a multimodal sentiment analysis model called CLIP-CA-MSA based on contrastive language-image pretraining(CLIP) and cross-attention mechanism was proposed in this study. This model employed models such as BERT which was pre-trained by CLIP, and PIFT to extract feature vectors from videos and textual content. Subsequently, a cross-attention mechanism was applied to facilitate interaction between image feature vectors and text feature vectors, enhancing information exchange across different modalities. Finally, the uncertainty loss was utilized to compute the fused features, and the ultimate sentiment classification results were generated from the outputs. The experimental results showed that the model could increase accuracyrate by 5 percentage points to 14 percentage points and the F1 value by 3 percentage point to 12 percentage point over other multimodal models, which verifieed the superiority of the model in this study. And uses of ablation experiments to verified the validity of each module of the model. This model could effectively utilize the complementarity and correlation of multimodal data, and utilize uncertainty loss to improve the robustness and generalization ability of the model.

参考文献/References:

[1] PANG B, LEE L, VAITHYANATHAN S. Thumbs up? sentiment classification using machine learning techniques [ C]∥Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing ( EMNLP 2002) . Stroudsburg: ACL, 2002: 79-86.

[2] ZHANG L,LIU B. Sentiment analysis and opinion mining [EB / OL] . (2015-12-31) [2023-04-24] . https:∥doi. org / 10. 1007 / 978-1-4899-7502-7_907-2.

[3] 李勇, 金庆雨, 张青川. 融合位置注意力机制和改进 BLSTM 的食品评论情感分析[ J] . 郑州大学学报( 工学版) , 2020, 41(1) :58-62.

LI Y, JIN Q Y, ZHANG Q C. Improved BLSTM food review sentiment analysis with positional attention mechanisms[ J] . Journal of Zhengzhou University ( Engineering Science) , 2020, 41(1) :58-62.

[4] MUNIKAR M, SHAKYA S, SHRESTHA A. Finegrained sentiment classification using BERT [ EB / OL ] . (2019- 10 - 04 ) [ 2023 - 04 - 24 ] . https:∥arxiv. org / abs/ 1910. 03474.

[5] ZHU X G, LI L, ZHANG W, et al. Dependency exploitation: a unified CNN-RNN approach for visual emotion recognition [ C ] ∥Proceedings of the 26th International Joint Conference on Artificial Intelligence. New York: ACM, 2017:3595-3601.

[6] YOU Q Z, JIN H L, LUO J B. Visual sentiment analysis by attending on local image regions[ C]∥Proceedings of the Thirty-first AAAI Conference on Artificial Intelligence. New York:ACM, 2017: 231-237.

[7] WANG H H, MEGHAWAT A, MORENCY L P, et al. Select-additive learning: improving generalization in multimodal sentiment analysis[ C]∥2017 IEEE International Conference on Multimedia and Expo ( ICME) . Piscataway: IEEE, 2017: 949-954.

[8] 吴思思, 马静. 基于感知融合的多任务多模态情感分析模型[ J] . 数据分析与知识发现,2023(10) :74-84.

WU S S,MA J. Multi-task & multi-modal sentiment analysis model based on aware fusion[ J] . Data Analysis and Knowledge Discovery, 2023(10) :74-84.

[9] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB / OL] . (2021-02-26) [ 2023- 04- 24] . https:∥ arxiv. org / abs/ 2103. 00020.

[10] 赖宇斌, 陈燕, 胡小春,等. 基于提示嵌入的突发公共卫生事件微博文本情感分析[ J] . 数据分析与知识发现,2023,7(11) :46-55.

LAI Y B,CHEN Y,HU X C. et al. Emotional analysis of public health emergency micro-blog based on prompt embedding [ J ]. Data Analysis and Knowledge Discovery, 2023,7(11):46-55.

[11] YU W M, XU H, MENG F Y, et al. CH-SIMS: a Chinese multimodal sentiment analysis dataset with finegrained annotation of modality [ C] ∥Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 3718-3727.

[12] ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis [ EB / OL ] . (2017 - 07 - 23 ) [ 2023 - 04 - 24 ] . https: ∥doi. org / 10. 48550 / arXiv. 1707. 07250.

[13] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[C]∥Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 2247-2256.

[14] TSAI Y H H, BAI S J, LIANG P P, et al. Multimodal Transformer for unaligned multimodal language sequences [ C] . ∥Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 6558-6569.

[15] YU W M, XU H, YUAN Z Q, et al. Learning modalityspecific representations with self-supervised multi-task learning for multimodal sentiment analysis[ C]∥Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto :AAAI, 2021: 10790-10797.

[16] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition [ EB / OL ] . (2014- 09 - 04 ) [ 2023 - 04 - 24 ] . https:∥arxiv. org / abs/ 1409. 1556.

[17] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [ C] ∥2016 IEEE Conference on Computer Vision and Pattern Recognition 、(CVPR) . Piscataway: IEEE, 2016: 770-778.

[18] LIU Z, MAO H Z, WU C Y, et al. A ConvNet for the 2020s[C]∥2022 IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) . Piscataway: IEEE, 2022: 11966-11976.

[19] BALTRUSAITIS T, ZADEH A, LIM Y C, et al. OpenFace 2. 0: facial behavior analysis toolkit[C]∥2018 13th IEEE International Conference on Automatic Face & Gesture Recognition ( FG 2018) . Piscataway: IEEE, 2018: 59-66.

[20] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB / OL]. (2020-10-22)[2023-04- 24] . https:∥arxiv. org / abs/ 2010. 11929.

[21] DESAI S, RAMASWAMY H G. Ablation-CAM: visual explanations for deep convolutional network via gradientfree localization [ C] ∥2020 IEEE Winter Conference on Applications of Computer Vision ( WACV) . Piscataway: IEEE, 2020: 972-980.

[22] LAN Z Z, CHEN M D, GOODMAN S, et al. ALBERT: a lite BERT for self-supervised learning of language representations[EB / OL] . ( 2019 - 09 - 26) [ 2023 - 04 - 24] . https:∥arxiv. org / abs/ 1909. 11942.

[23] DEVLIN J, CHANG M W, LEE K, et al. BERT: pretraining of deep bidirectional transformers for language understanding[ EB / OL] . ( 2018 - 11 - 11) [ 2023 - 04 - 24] . https:∥doi. org / 10. 48550 / arXiv. 1810. 04805.

[24] SUN Y, WANG S H, LI Y K, et al. ERNIE: enhanced representation through knowledge integration [ EB / OL ] . (2019 - 04 - 19 ) [ 2023 - 04 - 24 ] . https: ∥doi. org / 10. 48550 / arXiv. 1904. 09223.

[25] CUI Y M, CHE W X, LIU T, et al. Revisiting pretrained models for Chinese natural language processing [EB / OL] . (2020 - 04 - 29) [ 2023 - 04 - 24] . https:∥ doi. org / 10. 48550 / arXiv. 2004. 13922.

[26] LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach [ EB / OL ] . (2019 - 07 - 26 ) [ 2023 - 04 - 24 ] . https: ∥doi. org / 10. 48550 / arXiv. 1907. 11692.

[27] LUO H S, JI L, ZHONG M, et al. CLIP4Clip: an empirical study of CLIP for end to end video clip retrieval and captioning[J]. Neurocomputing, 2022, 508

相似文献/References:

[1]李勇,金庆雨,张青川.融合位置注意力机制和改进BLSTM的食品评论情感分析[J].郑州大学学报(工学版),2020,41(01):58.[doi:10.13705/j.issn.1671-6833.2020.01.006]
　Li Yong,Jin Qingyu,Zhang Qingchuan.Improved BLSTM Food Review Sentiment Analysis with Positional Attention Mechanisms[J].Journal of Zhengzhou University (Engineering Science),2020,41(02):58.[doi:10.13705/j.issn.1671-6833.2020.01.006]

更新日期/Last Update: 2024-03-08

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

文章信息/Info

参考文献/References:

相似文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics