首页

如何评价VOLO: Vision Outlooker for Visual Recognition？第1页

网友的相关建议:

精度确实很高，但是

Outlooker和Dynamic Convolution也太像了吧。Issue里也有人提到过:Compare to DynamicConv · Issue #5 · sail-sg/volo

虽然包装得很好，但是 Unfold + Matrix Multiplication + Fold 就是等价于普通的Conv操作啊。因为这个地方每个空间位置的weight是不同的，所以就变成了Dynamic Conv。

用DynamicConv也没什么，最近的一些Conv+Transformer的Hybrid网络证明了这种混合更容易取得很好的accuracy。但是非要claim attention-based models are indeed able to outperform CNNs. 这个claim也许是真的，但不是这个工作所证明的。

2. 从Table.3 看出，每个scale的网络都有特定的drop path rate 和 Crop ratio. 只能说卡多任性吧。。。能用上A100的壕

3. 另外从2.知道，又不差卡，为什么不在ImageNet-21K 也做一下实验呢?既然都要宣传SOTA了，更容易对比的SOTA不香吗？一定要在没有用额外数据 这种前提条件下。回头能不能也设置个setting，没有用A100/V100，只用了1080Ti的情况下达到了sota？[doge]

有这么多资源不把实验做扎实就出来占坑，让大多数没那么多资源的咋办呀？[囧] 要不写个abstract 画个图先把坑占了？[doge]

Update:

今天arxiv更新的FAIR（3巨头之二）+UC Berkeley（Tete Xiao, Trevor Darrell)的文章Early Convolutions Help Transformers See Better https://arxiv.org/pdf/2106.14881.pdf ，感觉很舒服，特别是和VOLO的overclaim对比。

首先，FAIR的这个工作分析，实验，描述都非常清晰，有什么就说什么，没有什么不会过分claim。印象比较深刻的是有句”Moreover, under carefully controlled comparisons, we find that ViTs are only able to surpass state-of-the-art CNNs when equipped with a convolutional stem“

Conv+Transformer又不寒碜，VOLO强行把DynamicConv弄成Unfold + Matrix Multiplication + Fold，再claim attention-based models are indeed able to outperform CNNs.[囧]

不知道LeCun会不会看到，以及看到有没有兴趣怼一下[doge]

如何评价VOLO: Vision Outlooker for Visual Recognition？的其他答案点击这里