STATISTICS

Viewed254

Downloads192

A Review of Vision Transformer for Image Classification
[1]ZHI Min,LU Jingfang.A Review of Vision Transformer for Image Classification[J].Journal of Zhengzhou University (Engineering Science),2024,45(04):19-29.[doi:10.13705/ j.issn.1671-6833.2024.01.015]
Copy
References:
[1] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]∥2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2016: 770-778. 
[2] TAN M X, LE Q V. EfficientNet: rethinking model scaling for convolutional neural networks[EB/OL]. (2020-09-11) [2023-08-09]. https:∥arxiv. org/ abs/1905.11946. 
[3] RADOSAVOVIC I, KOSARAJU R P, GIRSHICK R, et al. Designing network design spaces[C]∥2020 IEEE/ CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2020: 10425-10433. 
[4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]∥Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010. 
[5] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[EB/OL]. (2021-06-03)[2023-08-09]. https:∥arxiv.org/abs/2010.11929. 
[6] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with Transformers[J] Lecture Notes in Artificial Intelligence, 2020,12346: 213-229. 
[7] WANG H Y, ZHU Y K, ADAM H, et al. MaX-Deep-Lab: end-to-end panoptic segmentation with mask Transformers[C]∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 5459-5470. 
[8] CHENG B W, SCHWING A G, KIRILLOV A. Per-pixel classification is not all you need for semantic segmentation [EB/OL]. (2021-08-31)[2023-08-09]. https:∥ arxiv.org/abs/2107.06278. 
[9] CHEN X, YAN B, ZHU J W, et al. Transformer tracking[C]∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 8122-8131. 
[10] JIANG Y F, CHANG S Y, WANG Z Y. TransGAN: two pure Transformers can make one strong GAN, and that can scale up[EB/OL]. (2021-12-09)[2023-08-09]. https:∥arxiv.org/abs/2102.07074. 
[11] CHEN H T, WANG Y H, GUO T Y, et al. Pre-trained image processing Transformer[C]∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 12294-12305. 
[12] TAY Y, DEHGHANI M, BAHRI D, et al. Efficient Transformers: a survey[J]. ACM Computing Surveys, 2023, 55(6): 1-28. 
[13] KHAN S, NASEER M, HAYAT M, et al. Transformers in vision: a survey[J]. ACM Computing Surveys, 2021, 54(S10): 1-41. 
[14] HAN K, WANG Y H, CHEN H T, et al. A survey on Vision Transformer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 87-110. 
[15] LIN T Y, WANG Y X, LIU X Y, et al. A survey of Transformers[J]. AI Open, 2022, 3: 111-132. 
[16]毕莹, 薛冰, 张孟杰. GP算法在图像分析上的应用综 述[J]. 郑州大学学报(工学版), 2018, 39(6): 3-13. BI Y, XUE B, ZHANG M J. A survey on genetic programming to image analysis[J]. Journal of Zhengzhou University (Engineering Science), 2018, 39(6): 3-13. 
[17] YUAN L, CHEN Y P, WANG T, et al. Tokens-to-token ViT: training Vision Transformers from scratch on ImageNet[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 558-567. 
[18]WU H P, XIAO B, CODELLA N, et al. CvT: introducing convolutions to Vision Transformers[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2022: 22-31. 
[19]WANG W H, XIE E Z, LI X, et al. Pyramid Vision Transformer: a versatile backbone for dense prediction without convolutions[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 568-578.
[20]WANG W H, XIE E Z, LI X, et al. PVTv2: improved baselines with pyramid Vision Transformer[J]. Computational Visual Media, 2022, 8(3): 415-424. 
[21] PAN Z Z, ZHUANG B H, HE H Y, et al. Less is more: pay less attention in Vision Transformers[EB/OL]. (2021-12-23)[2023-08-09]. https:∥arxiv. org/ abs/2105.14217. 
[22] SHAW P, USZKOREIT J, VASWANI A. Self-attention with relative position representations[EB/OL]. (2018-04-12) [2023-08-09]. https:∥arxiv. org/ abs/1803.02155.
[23] CHU X X, TIAN Z, ZHANG B, et al. Conditional positional encodings for Vision Transformers [EB/OL]. (2023-02-13)[2023-08-09]. https:∥arxiv. org/ abs/2102.10882. 
[24] DONG X Y, BAO J M, CHEN D D, et al. CSWin Transformer: a general Vision Transformer backbone with cross-shaped windows[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 12114-12124. 
[25] LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: hierarchical Vision Transformer using shifted windows[C]∥ 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 10012-10022. 
[26] ZHANG Z M, GONG X. Axially expanded windows for local-global interaction in Vision Transformers[EB/OL]. (2022-11-13)[2023-08-09]. https:∥arxiv. org/ abs/2209.08726. 
[27] TU Z Z, TALEBI H, ZHANG H, et al. MaxViT: multiaxis Vision Transformer[C]∥European Conference on Computer Vision. Cham: Springer, 2022: 459-479. 
[28] FANG J M, XIE L X, WANG X G, et al. MSG-Transformer: exchanging local spatial information by manipulating messenger tokens[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 12053-12062. 
[29] HAN K, XIAO A, WU E H, et al. Transformer in Transformer[EB/OL]. (2021-08-26)[2023-08-09]. https:∥arxiv.org/abs/2103.00112. 
[30] CHU X X, TIAN Z, WANG Y Q, et al. Twins: revisiting the design of spatial attention in Vision Transformers [EB/OL]. (2021-09-30)[2023-08-09]. https:∥ arxiv.org/abs/2104.13840. 
[31] FAN Q H, HUANG H B, GUAN J Y, et al. Rethinking local perception in lightweight Vision Transformer[EB/ OL]. (2023-06-01)[2023-08-09]. https:∥arxiv. org/abs/2303.17803. 
[32] GUO J Y, HAN K, WU H, et al. CMT: convolutional neural networks meet Vision Transformers[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 12165-12175. 
[33]WOO S, DEBNATH S, HU R H, et al. ConvNeXt V2: co-designing and scaling ConvNets with masked autoencoders[C]∥2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2023: 16133-16142. 
[34] SANDLER M, HOWARD A, ZHU M L, et al. Mobile-NetV2: inverted residuals and linear bottlenecks[C]∥ 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 4510-4520. 
[35] LIU Z, MAO H Z, WU C Y, et al. A ConvNet for the 2020s[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 11966-11976. 
[36] REN S C, ZHOU D Q, HE S F, et al. Shunted self-attention via multi-scale token aggregation[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 10853-10862. 
[37] YUAN K, GUO S P, LIU Z W, et al. Incorporating convolution designs into Visual Transformers[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2022: 559-568. 
[38] LEE-THORP J, AINSLIE J, ECKSTEIN I, et al. FNet: mixing tokens with Fourier Transforms[EB/OL]. (2022-05-26) [2023-08-09]. https:∥arxiv. org/ abs/2105.03824. 
[39] MARTINS A F T, FARINHAS A, TREVISO M, et al. Sparse and continuous attention mechanisms[EB/OL]. (2020-10-29)[2023-08-09]. https:∥arxiv. org/ abs/2006.07214. 
[40] MARTINS P H, MARINHO Z, MARTINS A F T. ∞-former: infinite memory Transformer[EB/OL]. (2022-05-25) [2023-08-09]. https:∥arxiv. org/ abs/2109.00301. 
[41] RAO Y M, ZHAO W L, ZHU Z, et al. Global filter networks for image classification[EB/OL]. (2021-10-26) [2023-08-09]. https:∥arxiv.org/abs/2107.00645. 
[42] YU W H, LUO M, ZHOU P, et al. MetaFormer is actually what you need for vision[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 10819-10829. 
[43] BERTASIUS G, WANG H, TORRESANI L. Is spacetime attention all you need for video understanding? [EB/ OL]. (2021-02-24)[2023-08-09]. https:∥arxiv. org/abs/2102.05095.
Similar References:
Memo

-

Last Update: 2024-06-14
Copyright © 2023 Editorial Board of Journal of Zhengzhou University (Engineering Science)