A Method for Personalized Speech-Driven 3D Facial Animation Generation

NAVIGATE

Table of Contents

STATISTICS

Viewed62

Downloads326

A Method for Personalized Speech-Driven 3D Facial Animation Generation

PDF下载 (326)

[1]LI Wei,SONG Yupu,LIU Yazhi,et al.A Method for Personalized Speech-Driven 3D Facial Animation Generation[J].Journal of Zhengzhou University (Engineering Science),2027,48(XX):1-8.[doi:10.13705/j.issn.1671-6833.2026.04.023]

Copy

Journal of Zhengzhou University (Engineering Science)[ISSN 1671-6833/CN 41-1339/T] Volume: 48 Number of periods: 2027 XX Page number: 1-8 Column: Public date: 2027-12-10

Title:: A Method for Personalized Speech-Driven 3D Facial Animation Generation

Author(s):: LI Wei^1,2, SONG Yupu^1,2, LIU Yazhi^1,2, AN Yi^1,2; (1. College of Artificial Intelligence, North China University of Science and Technology, Tangshan 063210, China; 2. Hebei Key Laboratory of Industrial Intelligent Perception, North China University of Science and Technology, Tangshan 063210, China)

Keywords:: speech-driven animation; 3D facial animation; deep learning; diffusion model; personalization

CLC:: TP391.41 TN912.3

DOI:: 10.13705/j.issn.1671-6833.2026.04.023

Abstract:: To address the challenges of speech-driven 3D facial animation, including difficult alignment between speech and motion, loss of identity features, and limited personalized dynamic expression, a conditional diffusionbased generation framework was proposed. The framework used a dual-path style encoding structure to extract hierarchical identity features and dynamic motion features, and then applied a bidirectional attention mechanism to deeply fuse speech features with noisy motion features. Based on this design, an improved Transformer decoderguided by style conditions was introduced to generate high-quality motion sequences. Experiments on the BIWI, VOCASET, and 3DMEAD datasets showed that the proposed method achieved the best results in average vertex error (MVE) , lipvertex error (LVE) , and facial dynamic deviation (FDD) . Compared with the best baseline method on each metric, MVE, LVE, and FDD were reduced by 4.8%, 15.4%, and 13.4% respectively on BIWI, LVE was reduced by 14.9% on VOCASET, and MVE and FDD were reduced by 10.2% and 13.7% respectively on 3DMEAD. Subjective evaluation results further confirmed its advantages in visual naturalness and realism. The proposed method provided a new technical approach for high-fidelity generation, identity preservation, and personalized modeling of 3D facial animation.

References:: [1] Cudeiro D, Bolkart T, Laidlaw C, et al. Capture, learning, and synthesis of 3D speaking styles[C]∥Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ). Piscataway: IEEE, 2019: 10093-10103.
[2] Fan Yingruo, Lin Zhaojiang, Saito J, et al. FaceFormer: speech-driven 3D facial animation with Transformers[C]∥Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ). Piscataway: IEEE, 2022: 18749-18758.
[3] Xing Jinbo, Xia Menghan, Zhang Yuechen, et al. CodeTalker: speech-driven 3D facial animation with discrete motion prior[C]∥Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway: IEEE, 2023: 12780-12790.
[4] Sohl-Dickstein J, Weiss E A, Maheswaranathan N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[C]∥Proceedings of the 32nd International Conference on Machine Learning. New York: ACM, 2015: 2256-2265.
[5] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[C]∥34th Conference on Neural Information Processing Systems. Vancouver: NeurIPS, 2020: 6840-6851.
[6] Bigioi D, Basak S, Stypułkowski M, et al. Speech driven video editing via an audio-conditioned diffusion model[J]. Image and Vision Computing, 2024, 142: 104911.
[7] Tang Ying, Liu Yazhi, Li Xiong, et al. Adaptive diffusion landmark dynamic rendering for realistic talking face video generation[J]. The Visual Computer, 2025, 41(11): 8935-8945.
[8] Tevet G, Raab S, Gordon B, et al. Human motion diffusion model[PP/OL]. V2. arXiv (2022-10-03) [2026-03-01] . https:∥arxiv. org/abs/2209. 14916.
[9] Stan S, Haque K I, Yumak Z. FaceDiffuser: speechdriven 3D facial animation synthesis using diffusion[C]∥Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games. New York: ACM, 2023:1-11.
[10] Lin Yihong, Fan Zhaoxin, Wu Xianjia, et al. GLDiTalker: speech-driven 3D facial animation with graph latent diffusion transformer[PP/OL]. V5. arXiv (2025-12-05)[2026-03-01]. https:∥arxiv. org/abs/2408. 01826.
[11] Ramesh A, Dhariwal P, Nichol A, et al. Hierarchical text-conditional image generation with CLIP latents[PP/OL]. arXiv (2022-04-13) [2026-03-01] . https:∥arxiv. org/abs/2204. 06125.
[12] Hsu W N, Bolte B, Tsai Y H, et al. HuBERT: self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451-3460.
[13] Bai Shaojie, J. Kolter Z, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling[PP/OL]. V2. arXiv (2018-04-19)[2026-03-01]. https:∥arxiv. org/abs/1803. 01271.
[14] Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Stroudsburg: ACL, 2014: 1724-1734.
[15] Rao Zhuang, Ding Dazhao, Wang Yijing. Human activity recognition method based on CSI principal component segmentation[J]. Journal of Zhengzhou University ( Engineering Science) , 2025, 46(6): 49-57. [ 饶壮, 丁大钊, 王依菁. 基于CSI主成分分割的人体动作识别方法[J] . 郑州大学学报(工学版), 2025, 46 (6): 49-57. ]
[16] Yang Jie, Fan Jiarou, Wang Yiru, et al. Hierarchical feature embedding for attribute recognition[C]∥Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) . Piscataway: IEEE, 2020: 13052-13061.
[17] Su Jianlin, Ahmed M, Lu Yu, et al. RoFormer: enhanced transformer with rotary position embedding[J]. Neurocomputing, 2024, 568: 127063.
[18] Sun Zhiyao, Luy Tian, Ye Sheng, et al. DiffPoseTalk: speech-driven stylistic 3D facial animation and head pose generation via diffusion models[J]. ACM Transactions on Graphics, 2024, 43(4): 1-9.
[19] Fanelli G, Gall J, Romsdorfer H, et al. A 3-D audiovisual corpus of affective communication[J]. IEEE Transactions on Multimedia, 2010, 12(6): 591-598.
[20] Daněček R, Chhatre K, Tripathi S, et al. Emotional speech-driven animation with content-emotion disentanglement[C]∥SIGGRAPH Asia 2023. New York: ACM, 2023: 1-13.
[21] Wang Kaisiyuan, Wu Qianyi, Song Linsen, et al. MEAD: a large-scale audio-visual dataset for emotional talking-face generation[C] ∥European Conference on Computer Vision. Cham: Springer, 2020: 700-717.
[22] Feng Yao, Feng Haiwen, Black M J, et al. Learning an animatable detailed 3D face model from in-the-wild images[J]. ACM Transactions on Graphics, 2021, 40(4): 1-13.
[23] Zielonka W, Bolkart T, Thies J. Towards metrical reconstruction of human faces[C]∥European Conference on Computer Vision. Cham: Springer, 2022: 250-269.
[24] Peng Ziqiao, Luo Yihao, Shi Yue, et al. SelfTalk: a self-supervised commutative training diagram to comprehend 3D talking faces[C]∥Proceedings of the 31st ACM International Conference on Multimedia. New York: ACM, 2023: 5292-5301.
[25] Song Wenfeng, Wang Xuan, Zheng Shi, et al. TalkingStyle: personalized speech-driven 3D facial animation with style preservation[J]. IEEE Transactions on Visualization and Computer Graphics, 2025, 31(9) : 4682-4694.

Similar References:

Memo

Last Update: 2026-05-15