«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.issn.1671-6833.2025.06.016]
点击复制

面向学习情感的多模态多尺度面部表情识别分析()

分享到：

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:: 47
期数:: 2026年3期

页码:: 126-133

栏目:

出版日期:: 2026-05-27

文章信息/Info

Title:: Multimodal and Multiscale Facial Expression Recognition Analysis for Learning Emotions

文章编号:: 1671-6833(2026)03-0126-08

作者:: 姬莉霞, 任晗靓, 王　威, 杜云龙, 周洪鑫, 付元忠; 郑州大学网络空间安全学院,河南郑州 450002

Author(s):: JI Lixia, REN Hanliang, WANG Wei, DU Yunlong, ZHOU Hongxin, FU Yuanzhong; School of Cyber Science and Engineering , Zhengzhou University, Zhengzhou 450002, China

关键词:: 学习情感; 面部表情识别; 人工智能; 多模态; 多尺度特征

Keywords:: learning emotions; facial expression recognition; artificial intelligence; multimodal; multiscale features

分类号:: TP391:TP181

DOI:: 10.13705/j.issn.1671-6833.2025.06.016

文献标志码:: A

摘要:: 针对学习情感识别中细微特征难以捕捉、数据样本稀缺等问题,提出了一种基于生成扩散模型与多模态多尺度视觉编码的面部情感识别方法。首先,构建融合多尺度全局和局部细节特征的学习情感数据集,并利用生成扩散模型对稀缺情感样本进行扩充,从而缓解少样本学习场景的数据约束;其次,设计了一种多模态多尺度视觉编码机制,通过融合原始人脸图像的全局特征与显著区域的局部细节信息,实现微表情与细粒度情感特征的高精度建模与有效融合。最后,在CNN、视觉Transformer及混合架构等多类模型上进行实验。结果表明,本文方法整体识别准确率达 68.10%,相较于现有基准方法平均提升约2.98%,最高提升5.30%。消融实验进一步验证了生成扩散模型与多模态多尺度融合模块在增强模型对微表情细节捕捉及整体识别鲁棒性方面的有效性与协同作用。

Abstract:: To address the challenges of capturing subtle features and the scarcity of data samples in learning emotion recognition, a facial emotion recognition method based on a generative diffusion model and multimodal multiscale visual encoding was proposed. Firstly, a learning emotion dataset integrating multi-scale global and local detailed features was constructed, and the generative diffusion model was used to augment scarce emotional samples, thereby alleviating data constraints in few-shot learning scenarios. Secondly, a multimodal multiscale visual encoding mechanism was designed, which achieved high-precision modeling and effective fusion of micro-expressions and fine-grained emotional features by combining global features of facial images with local details from salient regions. Finally, the experiments were conducted on various models, including CNNs, Vision Transformers, and hybrid architectures. The results showed that the proposed method achieved an overall recognition accuracy of 68.10%, with an average improvement of 2.98% and a maximum improvement of 5.30% compared with existing baseline methods. The ablation experiments further verified the effectiveness and synergistic contribution of the generative diffusion model and the multimodal multiscale fusion module in enhancing the model’s capability to capture micro-expression details and improving overall recognition robustness.

参考文献/References:

[1]翟雪松, 许家奇, 王永固. 在线教育中的学习情感计算研究———基于多源数据融合视角[J]. 华东师范大学学报(教育科学版), 2022, 40(9): 32-44.

ZHAI X S, XU J Q, WANG Y G. Research on learning affective computing in online education: from the perspective of multi-source data fusion[J]. Journal of East China Normal University (Educational Sciences), 2022, 40(9): 32-44.

[2]SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL].(2014-09-04)[2025-05-25]. https:∥arxiv.org/abs/1409.1556.

[3]JAIN D K, SHAMSOLMOALI P, SEHDEV P. Extended deep neural network for facial emotion recognition[J]. Pattern Recognition Letters, 2019, 120: 69-74.

[4]SAJJAD M, ULLAH F U M, ULLAH M, et al. A comprehensive survey on deep facial expression recognition: challenges, applications, and future guidelines[J]. Alexandria Engineering Journal, 2023, 68: 817-840.

[5]HU M, YANG C J, ZHENG Y Q, et al. Facial expression recognition based on fusion features of center-symmetric local signal magnitude pattern[J]. IEEE Access, 2019, 7: 118435-118445.

[6]SINGH S, NASOZ F. Facial expression recognition with convolutional neural networks[C]∥2020 10th Annual Computing and Communication Workshop and Conference (CCWC). Piscataway: IEEE, 2020: 324-328.

[7]SHARMA A, BAJAJ V, ARORA J. Machine learning techniques for real-time emotion detection from facial expressions[C]∥2023 2nd Edition of IEEE Delhi Section Flagship Conference (DELCON). Piscataway: IEEE, 2023: 1-6.

[8]DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. (2020-10-22)[202505-25]. https:∥arxiv.org/abs/2010.11929.

[9]XUE F L, WANG Q C, GUO G D. TransFER: learning relation-aware facial expression representations with transformers[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2022: 3581-3590.

[10] SUN L C, LIAN Z, LIU B, et al. MAE-DFER: efficient masked autoencoder for self-supervised dynamic facial expression recognition[C]∥Proceedings of the 31st ACM International Conference on Multimedia. New York:ACM, 2023: 6110-6121.

[11] GAO J X, ZHAO Y Y. TFE: a transformer architecture for occlusion aware facial expression recognition[J]. Frontiers in Neurorobotics, 2021, 15: 763100.

[12] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[EB/OL]. (2020-06-19)[2025-0525]. https:∥arxiv.org/abs/2006.11239.

[13] LI J T, DONG Z Z, LU S Y, et al. CAS(ME)3: a third generation facial spontaneous micro-expression database with depth information and high ecological validity[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(3): 2782-2800.

[14] LI S, DENG W H, DU J P. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild[C]∥2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2017: 2584-2593.

[15]WANG Z X, ZHANG J W, CHEN R J, et al. RestoreFormer: high-quality blind face restoration from undegraded key-value pairs[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 17491-17500.

[16]姬莉霞, 周洪鑫, 肖士杰, 等. 一种基于邻域注意力的扩散模型训练方法研究[J]. 计算机工程, 2025, 51(8): 262-269.

JI L X, ZHOU H X, XIAO S J, et al. A research on training method for diffusion model based on neighborhood attention[J]. Computer Engineering, 2025, 51(8): 262-269.

[17] LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2022: 9992-10002.

[18] ANWAR M A, TAHIR S F, FAHAD L G, et al. Image forgery detection by transforming local descriptors into deep-derived features [J]. Applied Soft Computing, 2023, 147: 110730.

[19] YANG Z, WANG J Q, TANG Y S, et al. LAVT: language-aware vision transformer for referring image segmentation[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 18134-18144.

[20] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]∥2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2016: 770-778.

[21] TAN M X, LE Q V. EfficientNet: rethinking model scaling for convolutional neural networks[EB/OL]. (201905-28)[2025-05-25]. https:∥arxiv. org/abs/1905.11946.

[22] LIU Z, MAO H Z, WU C Y, et al. A ConvNet for the 2020s[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 11966-11976.

[23] TU Z, TALEBI H, ZHANG H, et al. MaxViT: multi-axis vision transformer[C]∥2022 European Conference on Computer Vision (ECCV). Cham: Springer, 2022: 459-479.

[24] ZHENG C, MENDIETA M, CHEN C. POSTER: a pyramid cross-fusion transformer network for facial expression recognition[C]∥2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). Piscataway: IEEE, 2023: 3138-3147.

更新日期/Last Update: 2026-05-27

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

文章信息/Info

参考文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics