Dec 6, 2021
Transformer emerged as a powerful tool for visual recognition. Besides demonstrating competitive performance on a broad range of visual benchmarks, recent works also argue that Transformers are much more robust than Convolutional Neural Networks (CNNs). Nonetheless, surprisingly, we find these conclusions are drawn from unfair experimental settings, where Transformers and CNNs are compared at different scales and are applied with distinct training frameworks. In this paper, we aim to provide the first fair in-depth comparisons between transformers and CNNs, focusing on robustness evaluations. With our unified training setup, we first challenge the previous belief that Transformers outshine CNNs when measuring adversarial robustness. More surprisingly, we find CNNs can even achieve exactly the same robustness as Transformers on defending against adversarial attackers, if they properly adopt Transformers' training recipes. While regarding generalization on out-of-distribution samples, we show pretraining on (external) large-scale datasets is not a fundamental request for enabling Transformers to achieve better performance than CNNs. Moreover, our ablations suggest such stronger generalization is largely benefited by the self-attention-like architectures per se, rather than other training setups. We hope that this work will help the community better understand and benchmark the robustness comparisons between Transformers and CNNs.
Neural Information Processing Systems (NeurIPS) is a multi-track machine learning and computational neuroscience conference that includes invited talks, demonstrations, symposia and oral and poster presentations of refereed papers. Following the conference, there are workshops which provide a less formal setting.
Professional recording and live streaming, delivered globally.
Presentations on similar topic, category or speaker