Response to: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" This paper proposes an idea of applying Transformers to a computer vision classification task with very little modification of the original Transformer architecture. They simply split an image into P x P sized patches, using each patch as some kind of ‘token’ by flattening them using a learnable embedding. Such tokens are combined with simple learnable 1D positional embeddings, and are fed into a BERT-like Transformer architecture to predict the classification label. The paper’s experiments show that accuracies using their ViT (Vision Transformers) model outperforms exiting CNN-based SOTA models. The most interesting aspect about this paper was the idea of ‘applying Transformers to computer vision’ itself. I am not so familiar with computer vision works, but have fiddled a lot with Transformer based NLP models such as BERT or GPT. Therefore, I was used to the Transformer architecture, but never thought about applying it to two dimensional images. This is because of the computational efficiency of the Transformer architecture. As mentioned in the paper, Transformers is the de-facto architecture in NLP, and is notably better than previous NLP- targeting models such RNN. However, they suffer from quadratic computation (O(n2) for a sequence length of n) because each token needs to attend every other token in the sequence. This makes it rather hard to imagine applying Transformers to an image, because it will be extremely time-consuming for each pixel in an image to attend to every other pixel. However, this paper aims to overcome that problem by ‘grouping’ pixels into patches, and treating each patch as a single token. Regardless of its accuracy gains, I thought this was a novel idea, and also became interested in how famous computer vision architectures can be applied to NLP tasks. It was also very impressive and surprising to see how this methodology actually works, i.e. shows comparable or better performance than the famous ResNet. However, while reading the paper, some questions and ideas popped into my head. First, despite the fact that the paper’s approach was novel, it continuously emphasizes that the ViT model performs well with large training datasets. It can be seen from the experiment results that for small pre-training datasets such as ImageNet, previous CNN based models show higher accuracies compared to ViT. While recent deep learning trends are in search of educing better performance with less datasets, the paper’s experiment results may appear to be not so appealing. Second, I believe that a more detailed ablation study on patch sizes could have helped readers get a better insight on how the method applies to actual accuracy. It seems like the paper mainly focuses on a 16 x 16 or 14 x 14 patch, and for such sizes, the accuracy noted is above SOTA level. However, I think the way the image is split up, and what each patch holds will matter a lot when performing the classification task. For example, is the patch useful when it looks at small, detailed features? Or is it useful when it holds large shapes or backgrounds? Maybe such ablation studies can help find an optimal patch size for different sized images. Furthermore, patch size ablation may help after when they are evaluated on multiple datasets (apart form classification), since it may be used to find an optimal patch size for different computer vision tasks.