Object-centric Video Prediction Algorithm Based on Dynamic Memory and Motion Information

NAVIGATE

Table of Contents

STATISTICS

Viewed523

Downloads576

Object-centric Video Prediction Algorithm Based on Dynamic Memory and Motion Information

[HTML] PDF下载 (576)

[1]HAN Chenchen,LU Xiankai,WANG Zhicheng,et al.Object-centric Video Prediction Algorithm Based on Dynamic Memory and Motion Information[J].Journal of Zhengzhou University (Engineering Science),2025,46(05):51-59.[doi:10.13705/j.issn.1671-6833.2025.02.011]

Copy

Journal of Zhengzhou University (Engineering Science)[ISSN 1671-6833/CN 41-1339/T] Volume: 46 Number of periods: 2025 05 Page number: 51-59 Column: Public date: 2025-08-10

Title:: Object-centric Video Prediction Algorithm Based on Dynamic Memory and Motion Information

Author(s):: HAN Chenchen; LU Xiankai; WANG Zhicheng; XIONG Xiaozhou; School of Software, Shandong University, Jinan 250101, China

Keywords:: video prediction; object-centric learning; scene parsing; unsupervised learning; spatiotemporal prediction

CLC:: TP391.4TP183

DOI:: 10.13705/j.issn.1671-6833.2025.02.011

Abstract:: In response to the challenges of maintaining structural and temporal consistency between video frames in video prediction tasks, an object-centric video prediction algorithm based on dynamic memory and motion information was proposed. Firstly, by introducing an object-centric model, the objects in the scene were decoupled to ensure the consistency and stability of long-term dynamic modeling of video objects, to effectively maintain the structural consistency of video objects. Secondly, an object dynamic memory module was designed to capture the longterm dependencies of videos and model object dynamics, to overcome the shortcomings of existing video prediction methods in predicting dynamic interactions between objects and enhancing the temporal consistency of video objects. Thirdly, the feature similarity matrix of adjacent frames was used to capture the motion information between frames and model the spatiotemporal relationships of the video sequence, further strengthened the temporal consistency of video objects. Finally, a cross-attention mechanism was utilized to integrate the temporal and structural information of video objects, further improved the video prediction performance. Experiments on video prediction were conducted on the Obj3D and CLEVRER datasets with complex object interactions. The results showed that compared to the state-of-the-art object-centric video prediction algorithms, the proposed algorithm increased performance on the PSNR and SSIM metrics by 4.5% and 1.4%, respectively, and also achieved a 20% reduction in the LPIPS metric.

References:: [1]李卫军, 张新勇, 高庾潇, 等. 基于门控时空注意力的视频帧预测模型[J]. 郑州大学学报(工学版), 2024, 45(1): 70-77, 121.
LI W J, ZHANG X Y, GAO Y X, et al. Video frame prediction model based on gated spatio-temporal attention [J]. Journal of Zhengzhou University (Engineering Science), 2024, 45(1): 70-77, 121.
[2]MARTINEZ J, BLACK M J, ROMERO J. On human motion prediction using recurrent neural networks[C]∥ 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 2891-2900.
[3]CASTREJON L, BALLAS N, COURVILLE A. Improved conditional VRNNs for video prediction[C]∥2019 IEEE/ CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 7608-7617.
[4]DAI K, LI X T, YE Y M, et al. MSTCGAN: multiscale time conditional generative adversarial network for longterm satellite image sequence prediction [J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-16.
[5]SUN F, BAI C, SONG Y, et al. MMINR: multi-frameto-multi-frame inference with noise resistance for precipitation nowcasting with radar[C]∥The 26th International Conference on Pattern Recognition. Piscataway: IEEE, 2022: 97-103.
[6]PAN T, JIANG Z Q, HAN J N, et al. Taylor saves forlater: disentanglement for video prediction using Taylor representation [J]. Neurocomputing, 2022, 472: 166-174.
[7]LEE W, JUNG W, ZHANG H, et al. Revisiting hierarchical approach for persistent long-term video prediction[EB/ OL]. (2021-04-14)[2024-08-10].https:∥doi.org/10. 48550/arXiv.2104.06697.
[8]LOCATELLO F, WEISSENBORN D, UNTERTHINER T, et al. Object-centric learning with slot attention[J]. Advances in Neural Information Processing Systems, 2020, 33: 11525-11538.
[9]LIN Z H, LI M M, ZHENG Z B, et al. Self-attention ConvLSTM for spatiotemporal prediction[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11531-11538.
[10]WANG Y B, LONG M S, WANG J M, et al. PredRNN: recurrent neural networks for predictive learning using spatiotemporal LSTMs[J]. Advances in Neural Information Processing Systems, 2017, 30: 879-888.
[11] VILLEGAS R, YANG J M, HONG S, et al. Decomposing motion and content for natural video sequence prediction[EB/OL]. (2017-07-25)[2024-08-10].https:∥ doi.org/10.48550/arXiv.1706.08033.
[12] VOLETI V S, JOLICOEUR-MARTINEAU A, PAL C. MCVD: masked conditional video diffusion for prediction, generation, and interpolation[J]. Advances in Neural Information Processing Systems, 2022, 36: 23371-23385.
[13] AKAN A K, ERDEM E, ERDEM A, et al. SLAMP: stochastic latent appearance and motion prediction[C]∥ 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 14708-14717.
[14]WANG T C, LIU M Y, ZHU J Y, et al. Video-to-video synthesis[EB/OL]. (2018-08-20)[2024-08-10].https:∥doi.org/10.48550/arXiv.1808.06601.
[15] BEI X Z, YANG Y C, SOATTO S. Learning semanticaware dynamics for video prediction[C]∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 902-912.
[16]WU Y, GAO R R, PARK J, et al. Future video synthesis with object motion prediction[C]∥2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 5538-5547.
[17]WU Y F, YOON J, AHN S. Generative video transformer: can objects be the words? [EB/OL]. (2021-07-20) [2024-08-10]. https:∥doi. org/10. 48550/arXiv. 2107.09240
[18]WU Z Y, DVORNIK N, GREFF K, et al. SlotFormer: unsupervised visual dynamics simulation with object-centric models[EB/OL]. (2022-10-12)[2024-08-10]. https:∥doi.org/10.48550/arXiv.2210.05861.
[19] VILLAR-CORRALES A, WAHDAN I, BEHNKE S. Object-centric video prediction via decoupling of object dynamics and interactions[C]∥2023 IEEE International Conference on Image Processing (ICIP). Piscataway: IEEE, 2023: 570-574.
[20] ELSAYED G F, MAHENDRAN A, VAN STEENKISTE S, et al. SAVi++: towards end-to-end object-centric learning from real-world videos[EB/OL]. (2022-0615)[2024-08-10]. https:∥doi. org/10.48550/arXiv. 2206.07764.
[21]WATTERS N, MATTHEY L, BURGESS C P, et al. Spatial broadcast decoder: a simple architecture for learning disentangled representations in VAEs[EB/OL]. (201906-21)[2024-08-10].https:∥doi.org/10.48550/arXiv. 1901.07017.
[22] ZHONG Y Q, LIANG L M, ZHARKOV I, et al. MMVP: motion-matrix-based video prediction[C]∥2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 4250-4260.
[23] LIN Z X, WU Y F, PERI S, et al. Improving generative imagination in object-centric world models[EB/OL]. (2020-10-05)[2024-08-10]. https:∥doi. org/10. 48550/arXiv.2010.02054.
[24] YI K X, GAN C, LI Y Z, et al. CLEVRER: CoLlision events for video REpresentation and reasoning[EB/OL]. (2019-10-03)[2024-08-10]. https:∥doi. org/10. 48550/arXiv.1910.01442.
[25] ZADAIANCHUK A, SEITZER M, MARTIUS G. Objectcentric learning for real-world videos by predicting temporal feature similarities[J]. Advances in Neural Information Processing Systems, 2023, 36: 61514-61545.
[26]WANG Z, BOVIK A C, SHEIKH H R, et al. Image quality assessment: from error visibility to structural similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4): 600-612.
[27] ZHANG R, ISOLA P, EFROS A A, et al. The unreasonable effectiveness of deep features as a perceptual metric [C]∥2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 586595.
[28] JIN B B, HU Y, TANG Q K, et al. Exploring spatialtemporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction[C]∥2020 IEEE/ CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 4553-4562.
[29] GAO Z Y, TAN C, WU L R, et al. SimVP: simpler yet better video prediction[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 3160-3170.

Similar References:

Memo

Last Update: 2025-09-19