Multimodal and Multiscale Facial Expression Recognition Analysis for Learning Emotions

NAVIGATE

Table of Contents

STATISTICS

Viewed411

Downloads762

Multimodal and Multiscale Facial Expression Recognition Analysis for Learning Emotions

[HTML] PDF下载 (762)

[1]JI Lixia,REN Hanliang,WANG Wei,et al.Multimodal and Multiscale Facial Expression Recognition Analysis for Learning Emotions[J].Journal of Zhengzhou University (Engineering Science),2026,47(3):126-133.[doi:10.13705/j.issn.1671-6833.2025.06.016]

Copy

Journal of Zhengzhou University (Engineering Science)[ISSN 1671-6833/CN 41-1339/T] Volume: 47 Number of periods: 2026 Issue 3 Page number: 126-133 Column: Public date: 2026-05-27

Title:: Multimodal and Multiscale Facial Expression Recognition Analysis for Learning Emotions

Author(s):: JI Lixia, REN Hanliang, WANG Wei, DU Yunlong, ZHOU Hongxin, FU Yuanzhong; School of Cyber Science and Engineering , Zhengzhou University, Zhengzhou 450002, China

Keywords:: learning emotions; facial expression recognition; artificial intelligence; multimodal; multiscale features

CLC:: TP391:TP181

DOI:: 10.13705/j.issn.1671-6833.2025.06.016

Abstract:: To address the challenges of capturing subtle features and the scarcity of data samples in learning emotion recognition, a facial emotion recognition method based on a generative diffusion model and multimodal multiscale visual encoding was proposed. Firstly, a learning emotion dataset integrating multi-scale global and local detailed features was constructed, and the generative diffusion model was used to augment scarce emotional samples, thereby alleviating data constraints in few-shot learning scenarios. Secondly, a multimodal multiscale visual encoding mechanism was designed, which achieved high-precision modeling and effective fusion of micro-expressions and fine-grained emotional features by combining global features of facial images with local details from salient regions. Finally, the experiments were conducted on various models, including CNNs, Vision Transformers, and hybrid architectures. The results showed that the proposed method achieved an overall recognition accuracy of 68.10%, with an average improvement of 2.98% and a maximum improvement of 5.30% compared with existing baseline methods. The ablation experiments further verified the effectiveness and synergistic contribution of the generative diffusion model and the multimodal multiscale fusion module in enhancing the model’s capability to capture micro-expression details and improving overall recognition robustness.

References:: [1]翟雪松, 许家奇, 王永固. 在线教育中的学习情感计算研究———基于多源数据融合视角[J]. 华东师范大学学报(教育科学版), 2022, 40(9): 32-44.
ZHAI X S, XU J Q, WANG Y G. Research on learning affective computing in online education: from the perspective of multi-source data fusion[J]. Journal of East China Normal University (Educational Sciences), 2022, 40(9): 32-44.
[2]SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL].(2014-09-04)[2025-05-25]. https:∥arxiv.org/abs/1409.1556.
[3]JAIN D K, SHAMSOLMOALI P, SEHDEV P. Extended deep neural network for facial emotion recognition[J]. Pattern Recognition Letters, 2019, 120: 69-74.
[4]SAJJAD M, ULLAH F U M, ULLAH M, et al. A comprehensive survey on deep facial expression recognition: challenges, applications, and future guidelines[J]. Alexandria Engineering Journal, 2023, 68: 817-840.
[5]HU M, YANG C J, ZHENG Y Q, et al. Facial expression recognition based on fusion features of center-symmetric local signal magnitude pattern[J]. IEEE Access, 2019, 7: 118435-118445.
[6]SINGH S, NASOZ F. Facial expression recognition with convolutional neural networks[C]∥2020 10th Annual Computing and Communication Workshop and Conference (CCWC). Piscataway: IEEE, 2020: 324-328.
[7]SHARMA A, BAJAJ V, ARORA J. Machine learning techniques for real-time emotion detection from facial expressions[C]∥2023 2nd Edition of IEEE Delhi Section Flagship Conference (DELCON). Piscataway: IEEE, 2023: 1-6.
[8]DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. (2020-10-22)[202505-25]. https:∥arxiv.org/abs/2010.11929.
[9]XUE F L, WANG Q C, GUO G D. TransFER: learning relation-aware facial expression representations with transformers[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2022: 3581-3590.
[10] SUN L C, LIAN Z, LIU B, et al. MAE-DFER: efficient masked autoencoder for self-supervised dynamic facial expression recognition[C]∥Proceedings of the 31st ACM International Conference on Multimedia. New York:ACM, 2023: 6110-6121.
[11] GAO J X, ZHAO Y Y. TFE: a transformer architecture for occlusion aware facial expression recognition[J]. Frontiers in Neurorobotics, 2021, 15: 763100.
[12] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[EB/OL]. (2020-06-19)[2025-0525]. https:∥arxiv.org/abs/2006.11239.
[13] LI J T, DONG Z Z, LU S Y, et al. CAS(ME)3: a third generation facial spontaneous micro-expression database with depth information and high ecological validity[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(3): 2782-2800.
[14] LI S, DENG W H, DU J P. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild[C]∥2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2017: 2584-2593.
[15]WANG Z X, ZHANG J W, CHEN R J, et al. RestoreFormer: high-quality blind face restoration from undegraded key-value pairs[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 17491-17500.
[16]姬莉霞, 周洪鑫, 肖士杰, 等. 一种基于邻域注意力的扩散模型训练方法研究[J]. 计算机工程, 2025, 51(8): 262-269.
JI L X, ZHOU H X, XIAO S J, et al. A research on training method for diffusion model based on neighborhood attention[J]. Computer Engineering, 2025, 51(8): 262-269.
[17] LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2022: 9992-10002.
[18] ANWAR M A, TAHIR S F, FAHAD L G, et al. Image forgery detection by transforming local descriptors into deep-derived features [J]. Applied Soft Computing, 2023, 147: 110730.
[19] YANG Z, WANG J Q, TANG Y S, et al. LAVT: language-aware vision transformer for referring image segmentation[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 18134-18144.
[20] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]∥2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2016: 770-778.
[21] TAN M X, LE Q V. EfficientNet: rethinking model scaling for convolutional neural networks[EB/OL]. (201905-28)[2025-05-25]. https:∥arxiv. org/abs/1905.11946.
[22] LIU Z, MAO H Z, WU C Y, et al. A ConvNet for the 2020s[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 11966-11976.
[23] TU Z, TALEBI H, ZHANG H, et al. MaxViT: multi-axis vision transformer[C]∥2022 European Conference on Computer Vision (ECCV). Cham: Springer, 2022: 459-479.
[24] ZHENG C, MENDIETA M, CHEN C. POSTER: a pyramid cross-fusion transformer network for facial expression recognition[C]∥2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). Piscataway: IEEE, 2023: 3138-3147.

Similar References:

Memo

Last Update: 2026-05-27