«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.issn.1671-6833.2024.01.015]
点击复制

面向图像分类的Vision Transformer研究综述()

分享到：

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:: 45
期数:: 2024年04期

页码:: 19-29

栏目:

出版日期:: 2024-06-16

文章信息/Info

Title:: A Review of Vision Transformer for Image Classification

文章编号:: 1671-6833(2024)04-0019-11

作者:: 智敏; 陆静芳; 内蒙古师范大学计算机科学技术学院,内蒙古呼和浩特 010022

Author(s):: ZHI Min; LU Jingfang; School of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China

关键词:: ViT模型; 图像分类; 多头注意力; 前馈网络层; 位置编码

Keywords:: ViT model; image classification; multihead attention; feed-forward network layer; position encoding

分类号:: TP181TP391

DOI:: 10.13705/ j.issn.1671-6833.2024.01.015

文献标志码:: A

摘要:: 作为一种基于Transformer架构的模型,ViT已经在图像分类任务中展现出了良好的效果。对ViT在图像分类任务上的应用进行系统性归纳总结。首先,简单介绍了ViT框架及其4个模块(patch模块、位置编码、多头注意力和前馈神经网络)的功能特性;其次,以ViT中4个模块的改进措施为脉络综述其在图像分类任务中的应用; 再次,由于不同的模型结构和改进措施对最终的分类性能产生显著影响,还对文中出现的各类ViT进行了横向对比,并详细列出模型的参数和分类精度及其优缺点;最后,指出ViT在图像分类任务中的优势和局限性,并提出未来可能的研究方向以打破其局限性,进一步扩展ViT在其他计算机视觉任务中的应用,同时,还可以探索将ViT扩展到视频理解等更广泛的计算机视觉领域。

Abstract:: ViT as a model based on the Transformer architecture has shown good results in image classification tasks. In this study, the application of ViT on image classification tasks was systematically summarized. Firstly, the functional characteristics of the ViT framework and its four modules (patch module, position encoding, multihead attention mechanism and feed-forward neural network) were briefly introduced. Secondly, the application of ViT in image classification tasks was summarized with the improvement measures of the four modules. Due to the fact that different model structures and improvement measures could have a significant impact on the final classification performance, a side-by-side comparison of various types of ViTs was made in this paper. Finally, the advantages and limitations of ViT in image classification were pointed out, and possible future research directions were proposed to break the limitations, and further to extend the application of ViT in other computer vision tasks. The extension of ViT to a wider range of computer vision fields, such as video understanding, was explored.

参考文献/References:

[1] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]∥2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2016: 770-778.

[2] TAN M X, LE Q V. EfficientNet: rethinking model scaling for convolutional neural networks[EB/OL]. (2020-09-11) [2023-08-09]. https:∥arxiv. org/ abs/1905.11946.

[3] RADOSAVOVIC I, KOSARAJU R P, GIRSHICK R, et al. Designing network design spaces[C]∥2020 IEEE/ CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2020: 10425-10433.

[4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]∥Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.

[5] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[EB/OL]. (2021-06-03)[2023-08-09]. https:∥arxiv.org/abs/2010.11929.

[6] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with Transformers[J] Lecture Notes in Artificial Intelligence, 2020,12346: 213-229.

[7] WANG H Y, ZHU Y K, ADAM H, et al. MaX-Deep-Lab: end-to-end panoptic segmentation with mask Transformers[C]∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 5459-5470.

[8] CHENG B W, SCHWING A G, KIRILLOV A. Per-pixel classification is not all you need for semantic segmentation [EB/OL]. (2021-08-31)[2023-08-09]. https:∥ arxiv.org/abs/2107.06278.

[9] CHEN X, YAN B, ZHU J W, et al. Transformer tracking[C]∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 8122-8131.

[10] JIANG Y F, CHANG S Y, WANG Z Y. TransGAN: two pure Transformers can make one strong GAN, and that can scale up[EB/OL]. (2021-12-09)[2023-08-09]. https:∥arxiv.org/abs/2102.07074.

[11] CHEN H T, WANG Y H, GUO T Y, et al. Pre-trained image processing Transformer[C]∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 12294-12305.

[12] TAY Y, DEHGHANI M, BAHRI D, et al. Efficient Transformers: a survey[J]. ACM Computing Surveys, 2023, 55(6): 1-28.

[13] KHAN S, NASEER M, HAYAT M, et al. Transformers in vision: a survey[J]. ACM Computing Surveys, 2021, 54(S10): 1-41.

[14] HAN K, WANG Y H, CHEN H T, et al. A survey on Vision Transformer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 87-110.

[15] LIN T Y, WANG Y X, LIU X Y, et al. A survey of Transformers[J]. AI Open, 2022, 3: 111-132.

[16]毕莹, 薛冰, 张孟杰. GP算法在图像分析上的应用综述[J]. 郑州大学学报(工学版), 2018, 39(6): 3-13. BI Y, XUE B, ZHANG M J. A survey on genetic programming to image analysis[J]. Journal of Zhengzhou University (Engineering Science), 2018, 39(6): 3-13.

[17] YUAN L, CHEN Y P, WANG T, et al. Tokens-to-token ViT: training Vision Transformers from scratch on ImageNet[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 558-567.

[18]WU H P, XIAO B, CODELLA N, et al. CvT: introducing convolutions to Vision Transformers[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2022: 22-31.

[19]WANG W H, XIE E Z, LI X, et al. Pyramid Vision Transformer: a versatile backbone for dense prediction without convolutions[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 568-578.

[20]WANG W H, XIE E Z, LI X, et al. PVTv2: improved baselines with pyramid Vision Transformer[J]. Computational Visual Media, 2022, 8(3): 415-424.

[21] PAN Z Z, ZHUANG B H, HE H Y, et al. Less is more: pay less attention in Vision Transformers[EB/OL]. (2021-12-23)[2023-08-09]. https:∥arxiv. org/ abs/2105.14217.

[22] SHAW P, USZKOREIT J, VASWANI A. Self-attention with relative position representations[EB/OL]. (2018-04-12) [2023-08-09]. https:∥arxiv. org/ abs/1803.02155.

[23] CHU X X, TIAN Z, ZHANG B, et al. Conditional positional encodings for Vision Transformers [EB/OL]. (2023-02-13)[2023-08-09]. https:∥arxiv. org/ abs/2102.10882.

[24] DONG X Y, BAO J M, CHEN D D, et al. CSWin Transformer: a general Vision Transformer backbone with cross-shaped windows[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 12114-12124.

[25] LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: hierarchical Vision Transformer using shifted windows[C]∥ 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 10012-10022.

[26] ZHANG Z M, GONG X. Axially expanded windows for local-global interaction in Vision Transformers[EB/OL]. (2022-11-13)[2023-08-09]. https:∥arxiv. org/ abs/2209.08726.

[27] TU Z Z, TALEBI H, ZHANG H, et al. MaxViT: multiaxis Vision Transformer[C]∥European Conference on Computer Vision. Cham: Springer, 2022: 459-479.

[28] FANG J M, XIE L X, WANG X G, et al. MSG-Transformer: exchanging local spatial information by manipulating messenger tokens[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 12053-12062.

[29] HAN K, XIAO A, WU E H, et al. Transformer in Transformer[EB/OL]. (2021-08-26)[2023-08-09]. https:∥arxiv.org/abs/2103.00112.

[30] CHU X X, TIAN Z, WANG Y Q, et al. Twins: revisiting the design of spatial attention in Vision Transformers [EB/OL]. (2021-09-30)[2023-08-09]. https:∥ arxiv.org/abs/2104.13840.

[31] FAN Q H, HUANG H B, GUAN J Y, et al. Rethinking local perception in lightweight Vision Transformer[EB/ OL]. (2023-06-01)[2023-08-09]. https:∥arxiv. org/abs/2303.17803.

[32] GUO J Y, HAN K, WU H, et al. CMT: convolutional neural networks meet Vision Transformers[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 12165-12175.

[33]WOO S, DEBNATH S, HU R H, et al. ConvNeXt V2: co-designing and scaling ConvNets with masked autoencoders[C]∥2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2023: 16133-16142.

[34] SANDLER M, HOWARD A, ZHU M L, et al. Mobile-NetV2: inverted residuals and linear bottlenecks[C]∥ 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 4510-4520.

[35] LIU Z, MAO H Z, WU C Y, et al. A ConvNet for the 2020s[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 11966-11976.

[36] REN S C, ZHOU D Q, HE S F, et al. Shunted self-attention via multi-scale token aggregation[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 10853-10862.

[37] YUAN K, GUO S P, LIU Z W, et al. Incorporating convolution designs into Visual Transformers[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2022: 559-568.

[38] LEE-THORP J, AINSLIE J, ECKSTEIN I, et al. FNet: mixing tokens with Fourier Transforms[EB/OL]. (2022-05-26) [2023-08-09]. https:∥arxiv. org/ abs/2105.03824.

[39] MARTINS A F T, FARINHAS A, TREVISO M, et al. Sparse and continuous attention mechanisms[EB/OL]. (2020-10-29)[2023-08-09]. https:∥arxiv. org/ abs/2006.07214.

[40] MARTINS P H, MARINHO Z, MARTINS A F T. ∞-former: infinite memory Transformer[EB/OL]. (2022-05-25) [2023-08-09]. https:∥arxiv. org/ abs/2109.00301.

[41] RAO Y M, ZHAO W L, ZHU Z, et al. Global filter networks for image classification[EB/OL]. (2021-10-26) [2023-08-09]. https:∥arxiv.org/abs/2107.00645.

[42] YU W H, LUO M, ZHOU P, et al. MetaFormer is actually what you need for vision[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 10819-10829.

[43] BERTASIUS G, WANG H, TORRESANI L. Is spacetime attention all you need for video understanding? [EB/ OL]. (2021-02-24)[2023-08-09]. https:∥arxiv. org/abs/2102.05095.

更新日期/Last Update: 2024-06-14

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

文章信息/Info

参考文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics