整篇文章看完最大的感受是,这真的又是一篇很 Kaiming 风格的工作,即抛掉那些前人方法里繁琐的部分,用简单明了的方式整出强大的性能,简单又 work,令人佩服。主要体现在一下几个方面:
一些 takeaway message:
Information density is different between language and vision. Languages are human-generated signals that are highly semantic and information-dense. When training a model to predict only a few missing words per sentence, this task appears to induce sophisticated language understanding. Images, on the contrary, are natural signals with heavy spatial redundancy -- e.g., a missing patch can be recovered from neighboring patches with little high-level understanding of parts, objects, and scenes
也正因此本文选择 mask 掉非常大比例的 patch (70%) 以求达到
This strategy largely reduces redundancy and creates a challenging self-supervisory task that requires holistic understanding beyond low-level image statistics
Frankly, this is probably closer than convnets to how humans comprehend visual scenes: a very small number of high acuity patches generated by fast microsaccades
可说实话我不觉得,上图中那种 sparse patch input,我上我不行;其实当初我看到有人讨论为何 reconstruction pretext task 不有效,论据是“一个人可能对一张百元大钞很清楚 (理解语义、好的表征),但你让他凭空画出来,大部分人做不到”,不过本文模型的 reconstruction 结果也是比较模糊,从这个角度想反而印证了这个说法?
总之是一篇很 kaiming 很不错的文章,文章在 Introduction 结尾处提到:
In these tasks, our pre-training achieves better results than its supervised pre-training counterparts, and more importantly, we observe significant gains by scaling up models. These observations are aligned with those witnessed in self-supervised pre-training in NLP [14, 40, 41, 4] and we hope that they will enable our field to explore a similar trajectory