[1]李卫军,张新勇,高庾潇,等.基于门控时空注意力的视频帧预测模型[J].郑州大学学报(工学版),2024,(01):70-77.[doi:10.13705/j.issn.1671-6833.2024.01.017]
 LI Weijun,ZHANG Xinyong,GAO Yuxiao,et al.Video Frame Prediction Model Based on Gated Spatio-Temporal Attention[J].Journal of Zhengzhou University (Engineering Science),2024,(01):70-77.[doi:10.13705/j.issn.1671-6833.2024.01.017]
点击复制

基于门控时空注意力的视频帧预测模型()
分享到:

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:
期数:
2024年01期
页码:
70-77
栏目:
出版日期:
2024-01-19

文章信息/Info

Title:
Video Frame Prediction Model Based on Gated Spatio-Temporal Attention
作者:
李卫军 张新勇 高庾潇 顾建来 刘锦彤
1. 北方民族大学 计算机科学与工程学院,宁夏 银川 750021;2. 北方民族大学 图像图形智能处理国家民委重点实 验室,宁夏 银川 750021
Author(s):
LI Weijun ZHANG Xinyong GAO Yuxiao GU Jianlai LIU Jintong
1. School of Computer Science and Engineering, North Minzu University, Yinchuan 750021, China; 2. The Key Laboratory of Images and Graphics Intelligent Processing of State Ethnic Affairs Commission, North Minzu University, Yinchuan 750021, China
关键词:
视频帧预测 卷积神经网络 注意力机制 门控卷积 编解码网络
Keywords:
video frame prediction convolutional neural network attention mechanism gated convolution codec network
DOI:
10.13705/j.issn.1671-6833.2024.01.017
文献标志码:
A
摘要:
针对循环式视频帧预测架构存在精度低、训练缓慢,以及结构复杂和误差累积等问题,提出了一种基于门 控时空注意力的视频帧预测模型。 首先,通过空间编码器提取视频帧序列的高级语义信息,同时保留背景特征;其 次,建立门控时空注意力机制,采用多尺度深度条形卷积和通道注意力来学习帧内及帧间的时空特征,并利用门控 融合机制平衡时空注意力的特征学习能力;最后,由空间解码器将高级特征解码为预测的真实图像,并补充背景语 义以完善细节。 在 Moving MNIST、TaxiBJ、WeatherBench、KITTI 数据集上的实验结果显示,同多进多出模型 SimVP 相比,MSE 分别降低了 14. 7%、6. 7%、10. 5%、18. 5%,在消融扩展实验中,所提模型达到了较好的综合性能,具有预 测精度高、计算量低和推理效率高等优势。
Abstract:
A video frame prediction model based on gated spatio-temporal attention was proposed to address the issues of low accuracy, slow training, complex structure, and error accumulation in recurrent video frame prediction architectures. Firstly, high-level semantic information of the video frame sequence was extracted by a spatial encoder while preserving background features. Secondly, a gated spatio-temporal attention mechanism was established, utilizing multi-scale deep bar convolutions and channel attention to learn both intra-frame and inter-frame spatio-temporal features. A gate fusion mechanism was employed to balance the feature learning capability of spatiotemporal attention. Finally, a spatial decoder reconstructed the high-level features into predicted realistic images and complements background semantics to enhance the details. Experimental results on the Moving MNIST, TaxiBJ, WeatherBench, and KITTI datasets showed that compared to the multi-input multi-output model SimVP, the mean squared error (MSE) was reduced by 14. 7%, 6. 7%, 10. 5%, and 18. 5%, respectively. In ablation and expansion experiments, the proposed model achieved good overall performance, demonstrating advantages such as high prediction accuracy, low computational complexity, and efficient inference.

相似文献/References:

[1]郝旺身,陈耀,孙浩,等.基于全矢-CNN的轴承故障诊断研究[J].郑州大学学报(工学版),2020,41(05):92.[doi:10.13705/j.issn.1671-6833.2020.03.004]
 Hao Wangs body,Chen Yao,Sun Hao,et al.Bearing Fault Diagnosis Based on Full Vector-CNN[J].Journal of Zhengzhou University (Engineering Science),2020,41(01):92.[doi:10.13705/j.issn.1671-6833.2020.03.004]
[2]贲可荣,杨佳辉,张献,等.基于Transformer和卷积神经网络的代码克隆检测[J].郑州大学学报(工学版),2023,44(06):12.[doi:10.13705/j.issn.1671-6833.2023.03.012]
 BEN Kerong,YANG Jiahui,ZHANG Xian,et al.Code Clone Detection Based on Transformer and Convolutional Neural Network[J].Journal of Zhengzhou University (Engineering Science),2023,44(01):12.[doi:10.13705/j.issn.1671-6833.2023.03.012]

更新日期/Last Update: 2024-01-24