[1]陈恩庆,李佳惠,郭新.基于扩散模型和交叉注意力机制的骨骼点动作识别方法[J].郑州大学学报(工学版),2027,48(XX):1-8.[doi:10.13705/j.issn.1671-6833.2026.04.011]
 CHEN Enqing,LI Jiahui,GUO Xin.Diffusion Method and Cross-attention Mechanisms for Skeleton-based Action Recognition Method[J].Journal of Zhengzhou University (Engineering Science),2027,48(XX):1-8.[doi:10.13705/j.issn.1671-6833.2026.04.011]
点击复制

基于扩散模型和交叉注意力机制的骨骼点动作识别方法()
分享到:

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:
48
期数:
2027年XX
页码:
1-8
栏目:
出版日期:
2027-12-10

文章信息/Info

Title:
Diffusion Method and Cross-attention Mechanisms for Skeleton-based Action Recognition Method
作者:
陈恩庆 李佳惠 郭新
郑州大学 电气与信息工程学院,河南 郑州 450001
Author(s):
CHEN Enqing LI Jiahui GUO Xin
School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou 450001, China
关键词:
骨骼点动作识别 自监督学习 掩蔽重建 扩散模型 交叉注意力机制
Keywords:
skeleton-based action recognition self-supervised learning masked reconstruction diffusion model cross-attention mechanism
分类号:
TP391TP181
DOI:
10.13705/j.issn.1671-6833.2026.04.011
文献标志码:
A
摘要:
针对人体动作识别中骨骼序列因遮挡或关节点缺失导致的动作信息不完整,以及在标注样本有限情况下模型泛化能力不足等问题,提出了一种结合扩散模型和交叉注意力机制的骨骼点动作识别方法(DCMAE)。在自监督学习框架下,采用时空掩蔽策略,通过扩散模型在去噪过程中学习动作序列的全局分布特性,提升模型在数据缺损情况下的分类准确率;在解码阶段通过交叉注意力机制引入编码器特征,实现时空维度的信息交互与引导,从而增强模型在少标签条件下的泛化能力。实验在NTU RGB+D 60和NTU RGB+D 120数据集上进行,结果表明,所提方法在数据缺损情况下和少标签条件下的识别准确率较SkeletonMAE模型最高分别提升14.9百分点和3百分点。研究结果表明:所提方法能够有效增强骨骼动作识别模型对缺损数据和少标签数据的鲁棒性,为自监督动作识别提供了新思路。
Abstract:
To address the problems of incomplete motion information caused by occlusion or missing joints in skeleton-based action recognition, as well as the limited generalization ability of models under few-label conditions, a skeleton-based action recognition method DCMAE was proposed, which integrated a diffusion model with a cross-attention mechanism. Within a self-supervised learning framework, a spatio-temporal masking strategy was adopted, where the diffusion model learned the global distribution characteristics of motion sequences during the denoising process to improve classification accuracy under data-missing conditions. In the decoding stage, the cross-attention mechanism introduced encoder features to achieve spatio-temporal information interaction and guidance, thereby enhancing the model’s generalization ability in few-label conditions. Experiments conducted on the NTU RGB+D 60 and NTU RGB+D 120 datasets showed that the proposed method achieves accuracy improvements of up to 14.9 percentage points and 3 percentage points, respectively, over the SkeletonMAE models under data-missing conditions and few-label conditions. The results demonstrated that the proposed method effectively enhanced the robustness of skeleton-based action recognition models to data-missing and few-label data, providing a new perspective for self-supervised action recognition research.

参考文献/References:

[1] Xin Wentian, Liu Ruyi, Liu Yi, et al. Transformer for Skeleton-based action recognition: a review of recent advances[J]. Neurocomputing, 2023, 537: 164-186.
[2] Gui Jie, Chen Tuo, Zhang Jing, et al. A survey on self-supervised learning: algorithms, applications, and future trends[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 9052-9071.
[3] Zhang Jiahang, Lin Lilang, Yang Shuai, et al. Self-supervised skeleton-based action representation learning: a benchmark and beyond[PP/OL]. V3. arXiv (2025-12-26)[2025-10-10]. https://arxiv.org/abs/2406.02978.
[4] Gao Lingling, Ji Yanli, Yang Yang, et al. Global-local cross-view fisher discrimination for view-invariant action recognition[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 5255-5264.
[5] Chen Zhan, Liu Hong, Guo, Tianyu et al. Contrastive learning from spatio-temporal mixed skeleton sequences for self-supervised skeleton-based action recognition[PP/OL]. V1. arXiv (2022-07-07)[2025-10-10]. https://arxiv.org/abs/2207. 03065.
[6] Mao Yunyao, Deng Jiajun, Zhou Wengang, et al. Masked motion predictors are strong 3D action representation learners[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2023: 10147-10157.
[7] Tomczak J M, Welling M. VAE with a VampPrior[PP/OL]. V5. arXiv (2018-02-26)[2025-10-10]. https://arxiv.org/abs/1705. 07120.
[8] Liu Ziyu, Zhang Hongwen, Chen Zhenghao, et al. Disentangling and unifying graph convolutions for skeleton-based action recognition[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2020: 140-149.
[9] Fuest M, Ma P C, Gui Ming, et al. Diffusion models and representation learning: a survey[PP/OL]. V1. arXiv (2024-06-30)[2025-10-10]. https://arxiv.org/abs/2407.00783.
[10] SONG Y, SOHL-DICKSTEIN J, KINGMA D P, et al. Score-based generative modeling through stochastic differential equations[PP/OL]. V2. arXiv (2021-02-10)[2025-10-10]. https://arxiv.org/abs/2011. 13456.
[11] Rombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 10674-10685.
[12] Lukoianov A, De Ocariz Borde H S, Greenewald K, et al. Score distillation via reparametrized DDIM[PP/OL]. V3. arXiv (2024-10-10)[2025-10-10]. https://arxiv.org/abs/2405. 15891.
[13] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[PP/OL]. V2. arXiv (2020-12-16)[2025-10-10]. https://arxiv.org/abs/2006. 11239.
[14] He Kaiming, Chen Xinlei, Xie Saining, et al. Masked autoencoders are scalable vision learners[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 15979-15988.
[15] Wu Wenhan, Hua Yilei, Zheng Ce, et al. Skeleton-MAE: spatial-temporal masked autoencoders for self-supervised skeleton action recognition[C]//Proceedings of the 2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). Piscataway: IEEE, 2023: 224-229.
[16] Qiu Helei, Hou Biao, Ren Bo, et al. Spatio-temporal tuples transformer for skeleton-based action recognition[PP/OL]. V1. arXiv (2022-01-08)[2025-10-10]. https://doi.org/10.48550/arXiv.2201.02849.
[17] Chen C R, Fan Quanfu, Panda R. CrossViT: cross-attention multi-scale vision transformer for image classification[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 762-770.
[18] Wei Chen, Mangalam K, Huang Poyao, et al. Diffusion models as masked autoencoders[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2023: 16238-16248.
[19] Zhang Fuqiang, Bai Junyan, Mu Hui. Human-machine interaction oriented gesture recognition method based on improved GAN[J]. Journal of Zhengzhou University (Engineering Science), 2025, 46(2): 43-50. [张富强, 白筠妍, 穆慧. 基于改进GAN的人机交互手势行为识别方法[J]. 郑州大学学报(工学版), 2025, 46(2): 43-50.]
[20] Yue Rujing, Tian Zhiqiang, Du Shaoyi. Action recognition based on RGB and skeleton data sets: a survey[J]. Neurocomputing, 2022, 512: 287-306.
[21] Liu Jun, Shahroudy A, Perez M, et al. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(10): 2684-2701.
[22] Li Linguo, Wang Minsi, Ni Bingbing, et al. 3D human action representation learning via cross-view consistency pursuit[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 4739-4748.
[23] Guo Tianyu, Liu Hong, Chen Zhan, et al. Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(1): 762-770.
[24] Hua Yilei, Wu Wenhan, Zheng Ce, et al. Part aware contrastive learning for self-supervised action recognition[PP/OL]. V2. arXiv (2023-05-11)[2025-10-10]. https://doi.org/10.48550/arXiv.2305.00666.
[25] Chen Yuxiao, Zhao Long, Yuan Jianbo, et al. Hierarchically self-supervised transformer for human skeleton representation learning[C]//Computer Vision-ECCV 2022. Cham: Springer, 2022: 185-202.
[26] Wang Xueting, Guo Xin, Wang Song, et al. Human skeleton action recognition method based on variational autoencoder masked reconstruction[J]. Journal of Graphics, 2025, 46(2): 270-278. [王雪婷, 郭新, 汪松, 等. 基于变分自编码器掩蔽重建的骨骼点动作识别方法[J]. 图学学报, 2025, 46(2): 270-278.]

备注/Memo

备注/Memo:
收稿日期:2026-01-31;修订日期:2026-03-02
基金项目:国家自然科学基金资助项目(62101503) ;河南省科技攻关计划项目(242102211017)
作者简介:陈恩庆(1977— ) ,男,河南郑州人,郑州大学教授,博士,主要从事计算机视觉、模式识别和多媒体信息处理研究,E-mail:ieeqchen@zzu.edu.cn。
通信作者:郭新(1988— ) ,女,河南郑州人,郑州大学副教授,博士,主要从事机器学习与人工智能研究,E-mail:iexguo@zzu.edu.cn。
更新日期/Last Update: 2026-03-13