[1]韩晨晨,卢宪凯,王志成,等.基于动态记忆与运动信息的目标中心视频预测算法[J].郑州大学学报(工学版),2025,46(05):51-59.[doi:10.13705/j.issn.1671-6833.2025.02.011]
 HAN Chenchen,LU Xiankai,WANG Zhicheng,et al.Object-centric Video Prediction Algorithm Based on Dynamic Memory and Motion Information[J].Journal of Zhengzhou University (Engineering Science),2025,46(05):51-59.[doi:10.13705/j.issn.1671-6833.2025.02.011]
点击复制

基于动态记忆与运动信息的目标中心视频预测算法()
分享到:

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:
46
期数:
2025年05期
页码:
51-59
栏目:
出版日期:
2025-08-10

文章信息/Info

Title:
Object-centric Video Prediction Algorithm Based on Dynamic Memory and Motion Information
文章编号:
1671-6833(2025)05-0051-09
作者:
韩晨晨 卢宪凯 王志成 熊筱舟
山东大学 软件学院,山东 济南 250101
Author(s):
HAN Chenchen LU Xiankai WANG Zhicheng XIONG Xiaozhou
School of Software, Shandong University, Jinan 250101, China
关键词:
视频预测 目标中心学习 场景解析 无监督学习 时空预测
Keywords:
video prediction object-centric learning scene parsing unsupervised learning spatiotemporal prediction
分类号:
TP391.4TP183
DOI:
10.13705/j.issn.1671-6833.2025.02.011
文献标志码:
A
摘要:
针对在视频预测任务中需要维持视频帧间目标空间和时间一致性的问题,提出了基于动态记忆与运动信息的目标中心视频预测算法。首先,引入目标中心模型解耦场景中的目标,确保视频目标在长期动态预测中的一致性和稳定性,有效维持目标的空间一致性;其次,设计目标动态记忆模块,用于捕捉视频的长期依赖并对目标动态进行精确建模,克服现有视频预测方法在预测目标间动态交互上的不足,提升预测目标的时间一致性;再次,利用相邻帧的特征相似性矩阵捕捉帧间运动信息,构建视频序列的时空关系,强化帧间的时间一致性;最后,利用交叉注意力机制融合视频目标的时序和结构信息来提升视频预测效果。通过在具有复杂目标交互的Obj3D和CLEVRER数据集上进行视频预测实验,结果表明:相较于较先进的基于目标中心的视频预测算法,所提算法在PSNR、SSIM两个指标上性能分别提升了4.5%,1.4%,并在LPIPS指标上降低了20%。
Abstract:
In response to the challenges of maintaining structural and temporal consistency between video frames in video prediction tasks, an object-centric video prediction algorithm based on dynamic memory and motion information was proposed. Firstly, by introducing an object-centric model, the objects in the scene were decoupled to ensure the consistency and stability of long-term dynamic modeling of video objects, to effectively maintain the structural consistency of video objects. Secondly, an object dynamic memory module was designed to capture the longterm dependencies of videos and model object dynamics, to overcome the shortcomings of existing video prediction methods in predicting dynamic interactions between objects and enhancing the temporal consistency of video objects. Thirdly, the feature similarity matrix of adjacent frames was used to capture the motion information between frames and model the spatiotemporal relationships of the video sequence, further strengthened the temporal consistency of video objects. Finally, a cross-attention mechanism was utilized to integrate the temporal and structural information of video objects, further improved the video prediction performance. Experiments on video prediction were conducted on the Obj3D and CLEVRER datasets with complex object interactions. The results showed that compared to the state-of-the-art object-centric video prediction algorithms, the proposed algorithm increased performance on the PSNR and SSIM metrics by 4.5% and 1.4%, respectively, and also achieved a 20% reduction in the LPIPS metric.

参考文献/References:

[1]李卫军, 张新勇, 高庾潇, 等. 基于门控时空注意力的视频帧预测模型[J]. 郑州大学学报(工学版), 2024, 45(1): 70-77, 121. 

LI W J, ZHANG X Y, GAO Y X, et al. Video frame prediction model based on gated spatio-temporal attention [J]. Journal of Zhengzhou University (Engineering Science), 2024, 45(1): 70-77, 121. 
[2]MARTINEZ J, BLACK M J, ROMERO J. On human motion prediction using recurrent neural networks[C]∥ 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 2891-2900. 
[3]CASTREJON L, BALLAS N, COURVILLE A. Improved conditional VRNNs for video prediction[C]∥2019 IEEE/ CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 7608-7617. 
[4]DAI K, LI X T, YE Y M, et al. MSTCGAN: multiscale time conditional generative adversarial network for longterm satellite image sequence prediction [J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-16. 
[5]SUN F, BAI C, SONG Y, et al. MMINR: multi-frameto-multi-frame inference with noise resistance for precipitation nowcasting with radar[C]∥The 26th International Conference on Pattern Recognition. Piscataway: IEEE, 2022: 97-103. 
[6]PAN T, JIANG Z Q, HAN J N, et al. Taylor saves forlater: disentanglement for video prediction using Taylor representation [J]. Neurocomputing, 2022, 472: 166-174. 
[7]LEE W, JUNG W, ZHANG H, et al. Revisiting hierarchical approach for persistent long-term video prediction[EB/ OL]. (2021-04-14)[2024-08-10].https:∥doi.org/10. 48550/arXiv.2104.06697. 
[8]LOCATELLO F, WEISSENBORN D, UNTERTHINER T, et al. Object-centric learning with slot attention[J]. Advances in Neural Information Processing Systems, 2020, 33: 11525-11538. 
[9]LIN Z H, LI M M, ZHENG Z B, et al. Self-attention ConvLSTM for spatiotemporal prediction[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11531-11538. 
[10]WANG Y B, LONG M S, WANG J M, et al. PredRNN: recurrent neural networks for predictive learning using spatiotemporal LSTMs[J]. Advances in Neural Information Processing Systems, 2017, 30: 879-888. 
[11] VILLEGAS R, YANG J M, HONG S, et al. Decomposing motion and content for natural video sequence prediction[EB/OL]. (2017-07-25)[2024-08-10].https:∥ doi.org/10.48550/arXiv.1706.08033. 
[12] VOLETI V S, JOLICOEUR-MARTINEAU A, PAL C. MCVD: masked conditional video diffusion for prediction, generation, and interpolation[J]. Advances in Neural Information Processing Systems, 2022, 36: 23371-23385. 
[13] AKAN A K, ERDEM E, ERDEM A, et al. SLAMP: stochastic latent appearance and motion prediction[C]∥ 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 14708-14717. 
[14]WANG T C, LIU M Y, ZHU J Y, et al. Video-to-video synthesis[EB/OL]. (2018-08-20)[2024-08-10].https:∥doi.org/10.48550/arXiv.1808.06601. 
[15] BEI X Z, YANG Y C, SOATTO S. Learning semanticaware dynamics for video prediction[C]∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 902-912. 
[16]WU Y, GAO R R, PARK J, et al. Future video synthesis with object motion prediction[C]∥2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 5538-5547. 
[17]WU Y F, YOON J, AHN S. Generative video transformer: can objects be the words? [EB/OL]. (2021-07-20) [2024-08-10]. https:∥doi. org/10. 48550/arXiv. 2107.09240 
[18]WU Z Y, DVORNIK N, GREFF K, et al. SlotFormer: unsupervised visual dynamics simulation with object-centric models[EB/OL]. (2022-10-12)[2024-08-10]. https:∥doi.org/10.48550/arXiv.2210.05861. 
[19] VILLAR-CORRALES A, WAHDAN I, BEHNKE S. Object-centric video prediction via decoupling of object dynamics and interactions[C]∥2023 IEEE International Conference on Image Processing (ICIP). Piscataway: IEEE, 2023: 570-574. 
[20] ELSAYED G F, MAHENDRAN A, VAN STEENKISTE S, et al. SAVi++: towards end-to-end object-centric learning from real-world videos[EB/OL]. (2022-0615)[2024-08-10]. https:∥doi. org/10.48550/arXiv. 2206.07764. 
[21]WATTERS N, MATTHEY L, BURGESS C P, et al. Spatial broadcast decoder: a simple architecture for learning disentangled representations in VAEs[EB/OL]. (201906-21)[2024-08-10].https:∥doi.org/10.48550/arXiv. 1901.07017. 
[22] ZHONG Y Q, LIANG L M, ZHARKOV I, et al. MMVP: motion-matrix-based video prediction[C]∥2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 4250-4260. 
[23] LIN Z X, WU Y F, PERI S, et al. Improving generative imagination in object-centric world models[EB/OL]. (2020-10-05)[2024-08-10]. https:∥doi. org/10. 48550/arXiv.2010.02054. 
[24] YI K X, GAN C, LI Y Z, et al. CLEVRER: CoLlision events for video REpresentation and reasoning[EB/OL]. (2019-10-03)[2024-08-10]. https:∥doi. org/10. 48550/arXiv.1910.01442. 
[25] ZADAIANCHUK A, SEITZER M, MARTIUS G. Objectcentric learning for real-world videos by predicting temporal feature similarities[J]. Advances in Neural Information Processing Systems, 2023, 36: 61514-61545. 
[26]WANG Z, BOVIK A C, SHEIKH H R, et al. Image quality assessment: from error visibility to structural similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4): 600-612. 
[27] ZHANG R, ISOLA P, EFROS A A, et al. The unreasonable effectiveness of deep features as a perceptual metric [C]∥2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 586595. 
[28] JIN B B, HU Y, TANG Q K, et al. Exploring spatialtemporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction[C]∥2020 IEEE/ CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 4553-4562. 
[29] GAO Z Y, TAN C, WU L R, et al. SimVP: simpler yet better video prediction[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 3160-3170.

更新日期/Last Update: 2025-09-19