[1]李伟,宋玉璞,刘亚志,等.个性化语音驱动的三维面部动画生成方法[J].郑州大学学报(工学版),2027,48(XX):1-8.[doi:10.13705/j.issn.1671-6833.2026.04.023]
 LI Wei,SONG Yupu,LIU Yazhi,et al.A Method for Personalized Speech-Driven 3D Facial Animation Generation[J].Journal of Zhengzhou University (Engineering Science),2027,48(XX):1-8.[doi:10.13705/j.issn.1671-6833.2026.04.023]
点击复制

个性化语音驱动的三维面部动画生成方法()
分享到:

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:
48
期数:
2027年XX
页码:
1-8
栏目:
出版日期:
2027-12-10

文章信息/Info

Title:
A Method for Personalized Speech-Driven 3D Facial Animation Generation
作者:
李伟1,2宋玉璞1,2刘亚志1,2安逸1,2
(1.华北理工大学 人工智能学院,河北 唐山 063210;2.华北理工大学 河北省工业智能感知重点实验室,河北 唐山 063210)
Author(s):
LI Wei1,2, SONG Yupu1,2, LIU Yazhi1,2, AN Yi1,2
(1. College of Artificial Intelligence, North China University of Science and Technology, Tangshan 063210, China; 2. Hebei Key Laboratory of Industrial Intelligent Perception, North China University of Science and Technology, Tangshan 063210, China)
关键词:
语音驱动动画三维面部动画深度学习扩散模型个性化
Keywords:
speech-driven animation 3D facial animation deep learning diffusion model personalization
分类号:
TP391.41 TN912.3
DOI:
10.13705/j.issn.1671-6833.2026.04.023
文献标志码:
A
摘要:
为解决语音驱动三维面部动画生成中语音与动作对齐困难、身份特征易丢失以及个性化动态表现不足的问题,提出一种基于条件扩散模型的生成方法。该方法设计双路风格编码结构,分别提取层次化的身份特征与动态运动特征,并通过双向注意力机制实现语音特征与加噪运动特征的深度融合。在此基础上,引入风格条件引导的改进Transformer解码器,以合成高质量运动序列。在BIWI、VOCASET和3DMEAD数据集上的实验结果表明,所提方法在平均顶点误差(MVE)、唇部顶点误差(LVE)和面部动态偏差(FDD)指标上均取得最优性能。与对应指标上的最佳基线方法相比,在BIWI上的MVE、LVE和FDD分别降低4.8%、15.4%和13.4%;在VOCASET上的LVE降低14.9%;在3DMEAD上的MVE和FDD分别降低10.2%和13.7%。主观评测结果进一步验证了所提方法在视觉自然度与真实感方面的优势。所提方法为三维面部动画的高保真生成、身份保持与个性化建模提供了新的技术路径。
Abstract:
To address the challenges of speech-driven 3D facial animation, including difficult alignment between speech and motion, loss of identity features, and limited personalized dynamic expression, a conditional diffusionbased generation framework was proposed. The framework used a dual-path style encoding structure to extract hierarchical identity features and dynamic motion features, and then applied a bidirectional attention mechanism to deeply fuse speech features with noisy motion features. Based on this design, an improved Transformer decoderguided by style conditions was introduced to generate high-quality motion sequences. Experiments on the BIWI, VOCASET, and 3DMEAD datasets showed that the proposed method achieved the best results in average vertex error (MVE) , lipvertex error (LVE) , and facial dynamic deviation (FDD) . Compared with the best baseline method on each metric, MVE, LVE, and FDD were reduced by 4.8%, 15.4%, and 13.4% respectively on BIWI, LVE was reduced by 14.9% on VOCASET, and MVE and FDD were reduced by 10.2% and 13.7% respectively on 3DMEAD. Subjective evaluation results further confirmed its advantages in visual naturalness and realism. The proposed method provided a new technical approach for high-fidelity generation, identity preservation, and personalized modeling of 3D facial animation.

参考文献/References:

[1] Cudeiro D, Bolkart T, Laidlaw C, et al. Capture, learning, and synthesis of 3D speaking styles[C]∥Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ). Piscataway: IEEE, 2019: 10093-10103.
[2] Fan Yingruo, Lin Zhaojiang, Saito J, et al. FaceFormer: speech-driven 3D facial animation with Transformers[C]∥Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ). Piscataway: IEEE, 2022: 18749-18758.
[3] Xing Jinbo, Xia Menghan, Zhang Yuechen, et al. CodeTalker: speech-driven 3D facial animation with discrete motion prior[C]∥Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway: IEEE, 2023: 12780-12790.
[4] Sohl-Dickstein J, Weiss E A, Maheswaranathan N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[C]∥Proceedings of the 32nd International Conference on Machine Learning. New York: ACM, 2015: 2256-2265.
[5] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[C]∥34th Conference on Neural Information Processing Systems. Vancouver: NeurIPS, 2020: 6840-6851.
[6] Bigioi D, Basak S, Stypułkowski M, et al. Speech driven video editing via an audio-conditioned diffusion model[J]. Image and Vision Computing, 2024, 142: 104911.
[7] Tang Ying, Liu Yazhi, Li Xiong, et al. Adaptive diffusion landmark dynamic rendering for realistic talking face video generation[J]. The Visual Computer, 2025, 41(11): 8935-8945.
[8] Tevet G, Raab S, Gordon B, et al. Human motion diffusion model[PP/OL]. V2. arXiv (2022-10-03) [2026-03-01] . https:∥arxiv. org/abs/2209. 14916.
[9] Stan S, Haque K I, Yumak Z. FaceDiffuser: speechdriven 3D facial animation synthesis using diffusion[C]∥Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games. New York: ACM, 2023:1-11.
[10] Lin Yihong, Fan Zhaoxin, Wu Xianjia, et al. GLDiTalker: speech-driven 3D facial animation with graph latent diffusion transformer[PP/OL]. V5. arXiv (2025-12-05)[2026-03-01]. https:∥arxiv. org/abs/2408. 01826.
[11] Ramesh A, Dhariwal P, Nichol A, et al. Hierarchical text-conditional image generation with CLIP latents[PP/OL]. arXiv (2022-04-13) [2026-03-01] . https:∥arxiv. org/abs/2204. 06125.
[12] Hsu W N, Bolte B, Tsai Y H, et al. HuBERT: self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451-3460.
[13] Bai Shaojie, J. Kolter Z, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling[PP/OL]. V2. arXiv (2018-04-19)[2026-03-01]. https:∥arxiv. org/abs/1803. 01271.
[14] Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Stroudsburg: ACL, 2014: 1724-1734.
[15] Rao Zhuang, Ding Dazhao, Wang Yijing. Human activity recognition method based on CSI principal component segmentation[J]. Journal of Zhengzhou University ( Engineering Science) , 2025, 46(6): 49-57. [ 饶壮, 丁大钊, 王依菁. 基于CSI主成分分割的人体动作识别方法[J] . 郑州大学学报(工学版), 2025, 46 (6): 49-57. ]
[16] Yang Jie, Fan Jiarou, Wang Yiru, et al. Hierarchical feature embedding for attribute recognition[C]∥Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) . Piscataway: IEEE, 2020: 13052-13061.
[17] Su Jianlin, Ahmed M, Lu Yu, et al. RoFormer: enhanced transformer with rotary position embedding[J]. Neurocomputing, 2024, 568: 127063.
[18] Sun Zhiyao, Luy Tian, Ye Sheng, et al. DiffPoseTalk: speech-driven stylistic 3D facial animation and head pose generation via diffusion models[J]. ACM Transactions on Graphics, 2024, 43(4): 1-9.
[19] Fanelli G, Gall J, Romsdorfer H, et al. A 3-D audiovisual corpus of affective communication[J]. IEEE Transactions on Multimedia, 2010, 12(6): 591-598.
[20] Daněček R, Chhatre K, Tripathi S, et al. Emotional speech-driven animation with content-emotion disentanglement[C]∥SIGGRAPH Asia 2023. New York: ACM, 2023: 1-13.
[21] Wang Kaisiyuan, Wu Qianyi, Song Linsen, et al. MEAD: a large-scale audio-visual dataset for emotional talking-face generation[C] ∥European Conference on Computer Vision. Cham: Springer, 2020: 700-717.
[22] Feng Yao, Feng Haiwen, Black M J, et al. Learning an animatable detailed 3D face model from in-the-wild images[J]. ACM Transactions on Graphics, 2021, 40(4): 1-13.
[23] Zielonka W, Bolkart T, Thies J. Towards metrical reconstruction of human faces[C]∥European Conference on Computer Vision. Cham: Springer, 2022: 250-269.
[24] Peng Ziqiao, Luo Yihao, Shi Yue, et al. SelfTalk: a self-supervised commutative training diagram to comprehend 3D talking faces[C]∥Proceedings of the 31st ACM International Conference on Multimedia. New York: ACM, 2023: 5292-5301.
[25] Song Wenfeng, Wang Xuan, Zheng Shi, et al. TalkingStyle: personalized speech-driven 3D facial animation with style preservation[J]. IEEE Transactions on Visualization and Computer Graphics, 2025, 31(9) : 4682-4694.

备注/Memo

备注/Memo:
收稿日期:2026-03-17;修订日期:2026-04-12
基金项目:河北省高等学校科学技术研究项目(ZD2022102)
作者简介:李伟(1979— ) ,男,河北平山人,华北理工大学高级工程师,主要从事图像与计算机视觉、计算机网络技术研究,E-mail:lw@ncst.edu.cn。
更新日期/Last Update: 2026-05-15