인공지능/논문 자료

An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale

백관구 2023. 8. 3. 17:34
반응형

 

논문명

- An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale

게재 일자

- 2021년 6월 3일

URL 링크

- https://arxiv.org/pdf/2010.11929.pdf

 

반응형

 


 

Abstract

1. Introduction

2. Related Work

3. Method

    3.1. Vision Transformer (ViT)

    3.2. Fine-Tuning and Higher Resolution

4. Experiments

    4.1. Setup

    4.2. Comparison to State of the Art

    4.3. Pre-Training Data Requirements

    4.4. Scaling Study

    4.5. Inspecting Vision Transformer

    4.6. Self-Supervision

5. Conclusion

Acknowledgements

References

Appendix

    A. Multihead Self-Attention

    B. Experiment Details

        B.1. Training

            B.1.1. Fine-Tuning

            B.1.2. Self-Supervision

    C. Additional Results

    D. Additional Analyses

        D.1. SGD v.s. ADAM for ResNets

        D.2. Transformer Shape

        D.3. Head Type and CLASS Token

        D.4. Positional Embedding

        D.5. Empirical Computational Costs

        D.6. Axial Attention

        D.7. Attention Distance

        D.8. Attention Maps

        D.9. ObjectNet Results

        D.10. VTAB Breakdown

 

반응형