A Review of Vision Transformer for Image Classification

NAVIGATE

Table of Contents

STATISTICS

Viewed1054

Downloads1765

A Review of Vision Transformer for Image Classification

[HTML] PDF下载 (1765)

[1]ZHI Min,LU Jingfang.A Review of Vision Transformer for Image Classification[J].Journal of Zhengzhou University (Engineering Science),2024,45(04):19-29.[doi:10.13705/ j.issn.1671-6833.2024.01.015]

Copy

Journal of Zhengzhou University (Engineering Science)[ISSN 1671-6833/CN 41-1339/T] Volume: 45 Number of periods: 2024 04 Page number: 19-29 Column: Public date: 2024-06-16

Title:: A Review of Vision Transformer for Image Classification

Author(s):: ZHI Min; LU Jingfang; School of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China

Keywords:: ViT model; image classification; multihead attention; feed-forward network layer; position encoding

CLC:: TP181TP391

DOI:: 10.13705/ j.issn.1671-6833.2024.01.015

Abstract:: ViT as a model based on the Transformer architecture has shown good results in image classification tasks. In this study, the application of ViT on image classification tasks was systematically summarized. Firstly, the functional characteristics of the ViT framework and its four modules (patch module, position encoding, multihead attention mechanism and feed-forward neural network) were briefly introduced. Secondly, the application of ViT in image classification tasks was summarized with the improvement measures of the four modules. Due to the fact that different model structures and improvement measures could have a significant impact on the final classification performance, a side-by-side comparison of various types of ViTs was made in this paper. Finally, the advantages and limitations of ViT in image classification were pointed out, and possible future research directions were proposed to break the limitations, and further to extend the application of ViT in other computer vision tasks. The extension of ViT to a wider range of computer vision fields, such as video understanding, was explored.

References:: [1] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]∥2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2016: 770-778.
[2] TAN M X, LE Q V. EfficientNet: rethinking model scaling for convolutional neural networks[EB/OL]. (2020-09-11) [2023-08-09]. https:∥arxiv. org/ abs/1905.11946.

[3] RADOSAVOVIC I, KOSARAJU R P, GIRSHICK R, et al. Designing network design spaces[C]∥2020 IEEE/ CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2020: 10425-10433.

[4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]∥Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.

[5] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[EB/OL]. (2021-06-03)[2023-08-09]. https:∥arxiv.org/abs/2010.11929.

[6] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with Transformers[J] Lecture Notes in Artificial Intelligence, 2020,12346: 213-229.

[7] WANG H Y, ZHU Y K, ADAM H, et al. MaX-Deep-Lab: end-to-end panoptic segmentation with mask Transformers[C]∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 5459-5470.

[8] CHENG B W, SCHWING A G, KIRILLOV A. Per-pixel classification is not all you need for semantic segmentation [EB/OL]. (2021-08-31)[2023-08-09]. https:∥ arxiv.org/abs/2107.06278.

[9] CHEN X, YAN B, ZHU J W, et al. Transformer tracking[C]∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 8122-8131.

[10] JIANG Y F, CHANG S Y, WANG Z Y. TransGAN: two pure Transformers can make one strong GAN, and that can scale up[EB/OL]. (2021-12-09)[2023-08-09]. https:∥arxiv.org/abs/2102.07074.

[11] CHEN H T, WANG Y H, GUO T Y, et al. Pre-trained image processing Transformer[C]∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 12294-12305.

[12] TAY Y, DEHGHANI M, BAHRI D, et al. Efficient Transformers: a survey[J]. ACM Computing Surveys, 2023, 55(6): 1-28.

[13] KHAN S, NASEER M, HAYAT M, et al. Transformers in vision: a survey[J]. ACM Computing Surveys, 2021, 54(S10): 1-41.

[14] HAN K, WANG Y H, CHEN H T, et al. A survey on Vision Transformer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 87-110.

[15] LIN T Y, WANG Y X, LIU X Y, et al. A survey of Transformers[J]. AI Open, 2022, 3: 111-132.

[16]毕莹, 薛冰, 张孟杰. GP算法在图像分析上的应用综述[J]. 郑州大学学报(工学版), 2018, 39(6): 3-13. BI Y, XUE B, ZHANG M J. A survey on genetic programming to image analysis[J]. Journal of Zhengzhou University (Engineering Science), 2018, 39(6): 3-13.

[17] YUAN L, CHEN Y P, WANG T, et al. Tokens-to-token ViT: training Vision Transformers from scratch on ImageNet[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 558-567.

[18]WU H P, XIAO B, CODELLA N, et al. CvT: introducing convolutions to Vision Transformers[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2022: 22-31.

[19]WANG W H, XIE E Z, LI X, et al. Pyramid Vision Transformer: a versatile backbone for dense prediction without convolutions[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 568-578.
[20]WANG W H, XIE E Z, LI X, et al. PVTv2: improved baselines with pyramid Vision Transformer[J]. Computational Visual Media, 2022, 8(3): 415-424.

[21] PAN Z Z, ZHUANG B H, HE H Y, et al. Less is more: pay less attention in Vision Transformers[EB/OL]. (2021-12-23)[2023-08-09]. https:∥arxiv. org/ abs/2105.14217.

[22] SHAW P, USZKOREIT J, VASWANI A. Self-attention with relative position representations[EB/OL]. (2018-04-12) [2023-08-09]. https:∥arxiv. org/ abs/1803.02155.

[23] CHU X X, TIAN Z, ZHANG B, et al. Conditional positional encodings for Vision Transformers [EB/OL]. (2023-02-13)[2023-08-09]. https:∥arxiv. org/ abs/2102.10882.

[24] DONG X Y, BAO J M, CHEN D D, et al. CSWin Transformer: a general Vision Transformer backbone with cross-shaped windows[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 12114-12124.

[25] LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: hierarchical Vision Transformer using shifted windows[C]∥ 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 10012-10022.

[26] ZHANG Z M, GONG X. Axially expanded windows for local-global interaction in Vision Transformers[EB/OL]. (2022-11-13)[2023-08-09]. https:∥arxiv. org/ abs/2209.08726.

[27] TU Z Z, TALEBI H, ZHANG H, et al. MaxViT: multiaxis Vision Transformer[C]∥European Conference on Computer Vision. Cham: Springer, 2022: 459-479.

[28] FANG J M, XIE L X, WANG X G, et al. MSG-Transformer: exchanging local spatial information by manipulating messenger tokens[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 12053-12062.

[29] HAN K, XIAO A, WU E H, et al. Transformer in Transformer[EB/OL]. (2021-08-26)[2023-08-09]. https:∥arxiv.org/abs/2103.00112.

[30] CHU X X, TIAN Z, WANG Y Q, et al. Twins: revisiting the design of spatial attention in Vision Transformers [EB/OL]. (2021-09-30)[2023-08-09]. https:∥ arxiv.org/abs/2104.13840.

[31] FAN Q H, HUANG H B, GUAN J Y, et al. Rethinking local perception in lightweight Vision Transformer[EB/ OL]. (2023-06-01)[2023-08-09]. https:∥arxiv. org/abs/2303.17803.

[32] GUO J Y, HAN K, WU H, et al. CMT: convolutional neural networks meet Vision Transformers[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 12165-12175.

[33]WOO S, DEBNATH S, HU R H, et al. ConvNeXt V2: co-designing and scaling ConvNets with masked autoencoders[C]∥2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2023: 16133-16142.

[34] SANDLER M, HOWARD A, ZHU M L, et al. Mobile-NetV2: inverted residuals and linear bottlenecks[C]∥ 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 4510-4520.

[35] LIU Z, MAO H Z, WU C Y, et al. A ConvNet for the 2020s[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 11966-11976.

[36] REN S C, ZHOU D Q, HE S F, et al. Shunted self-attention via multi-scale token aggregation[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 10853-10862.

[37] YUAN K, GUO S P, LIU Z W, et al. Incorporating convolution designs into Visual Transformers[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2022: 559-568.

[38] LEE-THORP J, AINSLIE J, ECKSTEIN I, et al. FNet: mixing tokens with Fourier Transforms[EB/OL]. (2022-05-26) [2023-08-09]. https:∥arxiv. org/ abs/2105.03824.

[39] MARTINS A F T, FARINHAS A, TREVISO M, et al. Sparse and continuous attention mechanisms[EB/OL]. (2020-10-29)[2023-08-09]. https:∥arxiv. org/ abs/2006.07214.

[40] MARTINS P H, MARINHO Z, MARTINS A F T. ∞-former: infinite memory Transformer[EB/OL]. (2022-05-25) [2023-08-09]. https:∥arxiv. org/ abs/2109.00301.

[41] RAO Y M, ZHAO W L, ZHU Z, et al. Global filter networks for image classification[EB/OL]. (2021-10-26) [2023-08-09]. https:∥arxiv.org/abs/2107.00645.

[42] YU W H, LUO M, ZHOU P, et al. MetaFormer is actually what you need for vision[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 10819-10829.

[43] BERTASIUS G, WANG H, TORRESANI L. Is spacetime attention all you need for video understanding? [EB/ OL]. (2021-02-24)[2023-08-09]. https:∥arxiv. org/abs/2102.05095.

Similar References:

Memo

Last Update: 2024-06-14