LI W J, ZHANG X Y, GAO Y X, et al. Video frame prediction model based on gated spatio-temporal attention [J]. Journal of Zhengzhou University (Engineering Science), 2024, 45(1): 70-77, 121.
[2]MARTINEZ J, BLACK M J, ROMERO J. On human motion prediction using recurrent neural networks[C]∥ 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 2891-2900.
[3]CASTREJON L, BALLAS N, COURVILLE A. Improved conditional VRNNs for video prediction[C]∥2019 IEEE/ CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 7608-7617.
[4]DAI K, LI X T, YE Y M, et al. MSTCGAN: multiscale time conditional generative adversarial network for longterm satellite image sequence prediction [J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-16.
[5]SUN F, BAI C, SONG Y, et al. MMINR: multi-frameto-multi-frame inference with noise resistance for precipitation nowcasting with radar[C]∥The 26th International Conference on Pattern Recognition. Piscataway: IEEE, 2022: 97-103.
[6]PAN T, JIANG Z Q, HAN J N, et al. Taylor saves forlater: disentanglement for video prediction using Taylor representation [J]. Neurocomputing, 2022, 472: 166-174.
[7]LEE W, JUNG W, ZHANG H, et al. Revisiting hierarchical approach for persistent long-term video prediction[EB/ OL]. (2021-04-14)[2024-08-10].https:∥doi.org/10. 48550/arXiv.2104.06697.
[8]LOCATELLO F, WEISSENBORN D, UNTERTHINER T, et al. Object-centric learning with slot attention[J]. Advances in Neural Information Processing Systems, 2020, 33: 11525-11538.
[9]LIN Z H, LI M M, ZHENG Z B, et al. Self-attention ConvLSTM for spatiotemporal prediction[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11531-11538.
[10]WANG Y B, LONG M S, WANG J M, et al. PredRNN: recurrent neural networks for predictive learning using spatiotemporal LSTMs[J]. Advances in Neural Information Processing Systems, 2017, 30: 879-888.
[11] VILLEGAS R, YANG J M, HONG S, et al. Decomposing motion and content for natural video sequence prediction[EB/OL]. (2017-07-25)[2024-08-10].https:∥ doi.org/10.48550/arXiv.1706.08033.
[12] VOLETI V S, JOLICOEUR-MARTINEAU A, PAL C. MCVD: masked conditional video diffusion for prediction, generation, and interpolation[J]. Advances in Neural Information Processing Systems, 2022, 36: 23371-23385.
[13] AKAN A K, ERDEM E, ERDEM A, et al. SLAMP: stochastic latent appearance and motion prediction[C]∥ 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 14708-14717.
[14]WANG T C, LIU M Y, ZHU J Y, et al. Video-to-video synthesis[EB/OL]. (2018-08-20)[2024-08-10].https:∥doi.org/10.48550/arXiv.1808.06601.
[15] BEI X Z, YANG Y C, SOATTO S. Learning semanticaware dynamics for video prediction[C]∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 902-912.
[16]WU Y, GAO R R, PARK J, et al. Future video synthesis with object motion prediction[C]∥2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 5538-5547.
[17]WU Y F, YOON J, AHN S. Generative video transformer: can objects be the words? [EB/OL]. (2021-07-20) [2024-08-10]. https:∥doi. org/10. 48550/arXiv. 2107.09240
[18]WU Z Y, DVORNIK N, GREFF K, et al. SlotFormer: unsupervised visual dynamics simulation with object-centric models[EB/OL]. (2022-10-12)[2024-08-10]. https:∥doi.org/10.48550/arXiv.2210.05861.
[19] VILLAR-CORRALES A, WAHDAN I, BEHNKE S. Object-centric video prediction via decoupling of object dynamics and interactions[C]∥2023 IEEE International Conference on Image Processing (ICIP). Piscataway: IEEE, 2023: 570-574.
[20] ELSAYED G F, MAHENDRAN A, VAN STEENKISTE S, et al. SAVi++: towards end-to-end object-centric learning from real-world videos[EB/OL]. (2022-0615)[2024-08-10]. https:∥doi. org/10.48550/arXiv. 2206.07764.
[21]WATTERS N, MATTHEY L, BURGESS C P, et al. Spatial broadcast decoder: a simple architecture for learning disentangled representations in VAEs[EB/OL]. (201906-21)[2024-08-10].https:∥doi.org/10.48550/arXiv. 1901.07017.
[22] ZHONG Y Q, LIANG L M, ZHARKOV I, et al. MMVP: motion-matrix-based video prediction[C]∥2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 4250-4260.
[23] LIN Z X, WU Y F, PERI S, et al. Improving generative imagination in object-centric world models[EB/OL]. (2020-10-05)[2024-08-10]. https:∥doi. org/10. 48550/arXiv.2010.02054.
[24] YI K X, GAN C, LI Y Z, et al. CLEVRER: CoLlision events for video REpresentation and reasoning[EB/OL]. (2019-10-03)[2024-08-10]. https:∥doi. org/10. 48550/arXiv.1910.01442.
[25] ZADAIANCHUK A, SEITZER M, MARTIUS G. Objectcentric learning for real-world videos by predicting temporal feature similarities[J]. Advances in Neural Information Processing Systems, 2023, 36: 61514-61545.
[26]WANG Z, BOVIK A C, SHEIKH H R, et al. Image quality assessment: from error visibility to structural similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4): 600-612.
[27] ZHANG R, ISOLA P, EFROS A A, et al. The unreasonable effectiveness of deep features as a perceptual metric [C]∥2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 586595.
[28] JIN B B, HU Y, TANG Q K, et al. Exploring spatialtemporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction[C]∥2020 IEEE/ CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 4553-4562.
[29] GAO Z Y, TAN C, WU L R, et al. SimVP: simpler yet better video prediction[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 3160-3170.