Diffusion Method and Cross-attention Mechanisms for Skeleton-based Action Recognition Method

NAVIGATE

Table of Contents

STATISTICS

Viewed91

Downloads116

Diffusion Method and Cross-attention Mechanisms for Skeleton-based Action Recognition Method

PDF下载 (116)

[1]CHEN Enqing,LI Jiahui,GUO Xin.Diffusion Method and Cross-attention Mechanisms for Skeleton-based Action Recognition Method[J].Journal of Zhengzhou University (Engineering Science),2027,48(XX):1-8.[doi:10.13705/j.issn.1671-6833.2026.04.011]

Copy

Journal of Zhengzhou University (Engineering Science)[ISSN 1671-6833/CN 41-1339/T] Volume: 48 Number of periods: 2027 XX Page number: 1-8 Column: Public date: 2027-12-10

Title:: Diffusion Method and Cross-attention Mechanisms for Skeleton-based Action Recognition Method

Author(s):: CHEN Enqing, LI Jiahui, GUO Xin; School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou 450001, China

Keywords:: skeleton-based action recognition; self-supervised learning; masked reconstruction; diffusion model; cross-attention mechanism

CLC:: TP391TP181

DOI:: 10.13705/j.issn.1671-6833.2026.04.011

Abstract:: To address the problems of incomplete motion information caused by occlusion or missing joints in skeleton-based action recognition, as well as the limited generalization ability of models under few-label conditions, a skeleton-based action recognition method DCMAE was proposed, which integrated a diffusion model with a cross-attention mechanism. Within a self-supervised learning framework, a spatio-temporal masking strategy was adopted, where the diffusion model learned the global distribution characteristics of motion sequences during the denoising process to improve classification accuracy under data-missing conditions. In the decoding stage, the cross-attention mechanism introduced encoder features to achieve spatio-temporal information interaction and guidance, thereby enhancing the model’s generalization ability in few-label conditions. Experiments conducted on the NTU RGB+D 60 and NTU RGB+D 120 datasets showed that the proposed method achieves accuracy improvements of up to 14.9 percentage points and 3 percentage points, respectively, over the SkeletonMAE models under data-missing conditions and few-label conditions. The results demonstrated that the proposed method effectively enhanced the robustness of skeleton-based action recognition models to data-missing and few-label data, providing a new perspective for self-supervised action recognition research.

References:: [1] Xin Wentian, Liu Ruyi, Liu Yi, et al. Transformer for Skeleton-based action recognition: a review of recent advances[J]. Neurocomputing, 2023, 537: 164-186.
[2] Gui Jie, Chen Tuo, Zhang Jing, et al. A survey on self-supervised learning: algorithms, applications, and future trends[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 9052-9071.
[3] Zhang Jiahang, Lin Lilang, Yang Shuai, et al. Self-supervised skeleton-based action representation learning: a benchmark and beyond[PP/OL]. V3. arXiv (2025-12-26)[2025-10-10]. https://arxiv.org/abs/2406.02978.
[4] Gao Lingling, Ji Yanli, Yang Yang, et al. Global-local cross-view fisher discrimination for view-invariant action recognition[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 5255-5264.
[5] Chen Zhan, Liu Hong, Guo, Tianyu et al. Contrastive learning from spatio-temporal mixed skeleton sequences for self-supervised skeleton-based action recognition[PP/OL]. V1. arXiv (2022-07-07)[2025-10-10]. https://arxiv.org/abs/2207. 03065.
[6] Mao Yunyao, Deng Jiajun, Zhou Wengang, et al. Masked motion predictors are strong 3D action representation learners[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2023: 10147-10157.
[7] Tomczak J M, Welling M. VAE with a VampPrior[PP/OL]. V5. arXiv (2018-02-26)[2025-10-10]. https://arxiv.org/abs/1705. 07120.
[8] Liu Ziyu, Zhang Hongwen, Chen Zhenghao, et al. Disentangling and unifying graph convolutions for skeleton-based action recognition[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2020: 140-149.
[9] Fuest M, Ma P C, Gui Ming, et al. Diffusion models and representation learning: a survey[PP/OL]. V1. arXiv (2024-06-30)[2025-10-10]. https://arxiv.org/abs/2407.00783.
[10] SONG Y, SOHL-DICKSTEIN J, KINGMA D P, et al. Score-based generative modeling through stochastic differential equations[PP/OL]. V2. arXiv (2021-02-10)[2025-10-10]. https://arxiv.org/abs/2011. 13456.
[11] Rombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 10674-10685.
[12] Lukoianov A, De Ocariz Borde H S, Greenewald K, et al. Score distillation via reparametrized DDIM[PP/OL]. V3. arXiv (2024-10-10)[2025-10-10]. https://arxiv.org/abs/2405. 15891.
[13] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[PP/OL]. V2. arXiv (2020-12-16)[2025-10-10]. https://arxiv.org/abs/2006. 11239.
[14] He Kaiming, Chen Xinlei, Xie Saining, et al. Masked autoencoders are scalable vision learners[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 15979-15988.
[15] Wu Wenhan, Hua Yilei, Zheng Ce, et al. Skeleton-MAE: spatial-temporal masked autoencoders for self-supervised skeleton action recognition[C]//Proceedings of the 2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). Piscataway: IEEE, 2023: 224-229.
[16] Qiu Helei, Hou Biao, Ren Bo, et al. Spatio-temporal tuples transformer for skeleton-based action recognition[PP/OL]. V1. arXiv (2022-01-08)[2025-10-10]. https://doi.org/10.48550/arXiv.2201.02849.
[17] Chen C R, Fan Quanfu, Panda R. CrossViT: cross-attention multi-scale vision transformer for image classification[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 762-770.
[18] Wei Chen, Mangalam K, Huang Poyao, et al. Diffusion models as masked autoencoders[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2023: 16238-16248.
[19] Zhang Fuqiang, Bai Junyan, Mu Hui. Human-machine interaction oriented gesture recognition method based on improved GAN[J]. Journal of Zhengzhou University (Engineering Science), 2025, 46(2): 43-50. [张富强, 白筠妍, 穆慧. 基于改进GAN的人机交互手势行为识别方法[J]. 郑州大学学报(工学版), 2025, 46(2): 43-50.]
[20] Yue Rujing, Tian Zhiqiang, Du Shaoyi. Action recognition based on RGB and skeleton data sets: a survey[J]. Neurocomputing, 2022, 512: 287-306.
[21] Liu Jun, Shahroudy A, Perez M, et al. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(10): 2684-2701.
[22] Li Linguo, Wang Minsi, Ni Bingbing, et al. 3D human action representation learning via cross-view consistency pursuit[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 4739-4748.
[23] Guo Tianyu, Liu Hong, Chen Zhan, et al. Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(1): 762-770.
[24] Hua Yilei, Wu Wenhan, Zheng Ce, et al. Part aware contrastive learning for self-supervised action recognition[PP/OL]. V2. arXiv (2023-05-11)[2025-10-10]. https://doi.org/10.48550/arXiv.2305.00666.
[25] Chen Yuxiao, Zhao Long, Yuan Jianbo, et al. Hierarchically self-supervised transformer for human skeleton representation learning[C]//Computer Vision-ECCV 2022. Cham: Springer, 2022: 185-202.
[26] Wang Xueting, Guo Xin, Wang Song, et al. Human skeleton action recognition method based on variational autoencoder masked reconstruction[J]. Journal of Graphics, 2025, 46(2): 270-278. [王雪婷, 郭新, 汪松, 等. 基于变分自编码器掩蔽重建的骨骼点动作识别方法[J]. 图学学报, 2025, 46(2): 270-278.]

Similar References:

Memo

Last Update: 2026-03-13