Response to "Evaluating Machine Accuracy on ImageNet" The authors scrutinize the accuracy of models trained on ImageNet by comparing them to five trained human labelers. They find that only the latest models are on par with humans. Furthermore, human accuracy on inanimate objects is higher than the best model on ImageNets. They also observe that human labelers are robust to small distribution shift while the machines trained on ImageNet achieve much lower top-1 accuracy on ImageNet-V2 validation set. To that end, they provide three guidelines for more reliable machine evaluation: 1) Measure the multi-label accuracy, 2) Report performance on dogs, other animals, and inanimate objects separately, 3) Evaluate performance to distribution shift. I totally agree author’s motivation that current evaluation protocol based on top-1 and top-5 accuracy may cause severe misunderstanding of machines. I have had a new understanding of why the top-1 and top-5 accuracy may not reflect the performance of machine. Some images have multi objects or allowing five guesses on all images can make certain class distinctions trivial. It is very impressive in this work to report the various types of accuracy including accuracies of all images, images without dogs and object images only, etc. The gaps between accuracies show some problems of evaluation protocol based on ImageNet. I think that showing the existence of machines that are robust to small distribution shifts is also a big contribution of this work. As the author mentioned, further directions of the human generalization such as adversarial robustness or bias can be explored, which is helpful to figure out exactly the current methods. By observing whether robustness is equally present in the case of non-experts, it is possible to determine whether the direction we should study is to increase the robustness of the model or to produce a model with expert-like performance.