Response to: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"


This paper proposes an idea of applying Transformers to a computer vision classification task 
with very little modification of the original Transformer architecture. 
They simply split an image into P x P sized patches, using each patch as some kind of ‘token’ 
by flattening them using a learnable embedding. Such tokens are combined with simple learnable 
1D positional embeddings, and are fed into a BERT-like Transformer architecture to predict the classification label.
The paper’s experiments show that accuracies using their ViT (Vision Transformers) model outperforms 
exiting CNN-based SOTA models.

The most interesting aspect about this paper was the idea of ‘applying Transformers to computer vision’ itself. 
I am not so familiar with computer vision works, but have fiddled a lot with Transformer 
based NLP models such as BERT or GPT. Therefore, I was used to the Transformer architecture, 
but never thought about applying it to two dimensional images. This is because of the computational 
efficiency of the Transformer architecture. As mentioned in the paper, Transformers is the de-facto 
architecture in NLP, and is notably better than previous NLP- targeting models such RNN. However, 
they suffer from quadratic computation (O(n2) for a sequence length of n) because each token needs to attend 
every other token in the sequence. This makes it rather hard to imagine applying Transformers to an image, 
because it will be extremely time-consuming for each pixel in an image to attend to every other pixel. 
However, this paper aims to overcome that problem by ‘grouping’ pixels into patches, and treating each patch 
as a single token. Regardless of its accuracy gains, I thought this was a novel idea, 
and also became interested in how famous computer vision architectures can be applied to NLP tasks.
It was also very impressive and surprising to see how this methodology actually works, 
i.e. shows comparable or better performance than the famous ResNet.

However, while reading the paper, some questions and ideas popped into my head. 
First, despite the fact that the paper’s approach was novel, it continuously emphasizes that the ViT model 
performs well with large training datasets. It can be seen from the experiment results that for small 
pre-training datasets such as ImageNet, previous CNN based models show higher accuracies compared to ViT.
While recent deep learning trends are in search of educing better performance with less datasets, 
the paper’s experiment results may appear to be not so appealing. Second, I believe that a more detailed 
ablation study on patch sizes could have helped readers get a better insight on how the method applies 
to actual accuracy. It seems like the paper mainly focuses on a 16 x 16 or 14 x 14 patch, and for such sizes, 
the accuracy noted is above SOTA level. However, I think the way the image is split up, and what each patch 
holds will matter a lot when performing the classification task. For example, is the patch useful 
when it looks at small, detailed features? Or is it useful when it holds large shapes or backgrounds? 
Maybe such ablation studies can help find an optimal patch size for different sized images. 
Furthermore, patch size ablation may help after when they are evaluated on multiple datasets 
(apart form classification), since it may be used to find an optimal patch size for different computer vision tasks.