A ConvNet for the 2020s
This is a brief overview of Facebook AI’s latest Paper on ConvNets.
The “Roaring 20s” of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classifica-tion model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks.
What is this paper about?
It is an exploratory journey starting from basic ConvNets to Vision Transformers and to Hierarchical Vision Transformers, to discover various key points that have contributed to better performance along the way and also test the limitation of pure ConvNets.
The outcome of this exploration is a family of pure ConvNet models dubbed Con-vNeXt.
Why are ConvNets still dominant?
- In many applications, the “sliding window strategy” is intrinsic to visual processing.
- They have several built-in inductive biases.
- Without ConvNet’s inductive biases, the vanilla ViT model faces many challenges, mainly its quadratic complexity with respect to the input size.
- Hierarchical Transformers takes on a hybrid approach to bridge the gap between ConvNets and ViT, and it brings back in the convolution.
- These attempts to bring back convolutions come with a price, a naive implementation of sliding window self-attention can be expensive, the speed can be optimized with advanced approaches but the system becomes more sophisticated in design.
So, it’s almost ironic that a ConvNet already satisfies many of those desired properties, albeit in a straightforward, no-frills way.
Steps in this exploratory research
- Start with a standard ResNet (like a ResNet50), which would be trained with improved methods.
- Gradually modernize the architecture to the construction of a Hierarchical Vision Transform (eg, Swin-T).
- The explorations are directed to investigate and follow different levels of designs from a Swin Transformer while maintaining the network’s simplicity as a standard ConvNet.
Here is the figure of the outcome of this journey:
We modernize a standard ConvNet (ResNet) towards the design of a hierarchical vision Transformer (Swin), without introducing any attention-based modules. The foreground bars are model accuracies in the ResNet-50/Swin-T FLOP regime; results for the ResNet-200/Swin-B regime are shown with the gray bars. A hatched bar means the modification is not adopted. Detailed results for both regimes are in the appendix. Many Transformer archi-tectural choices can be incorporated in a ConvNet, and they lead to increasingly better performance. In the end, our pure ConvNet model, named ConvNeXt, can outperform the Swin Transformer.

Result of this research
As a result, we propose a family of pure ConvNets dubbed ConvNeXt.
The resulting network, called “ConvNeXt” outperforms all state-of-the-art Convolutional Networks, including Swin Transformers.
Here is the Architecture of ConvNeXt compared to ResNet block and Swin Transformer:

Here are the links to their Github implementation and Research paper…
That’s all for this one folk!
If you like my articles, you can help me by following me here on Medium
See you in the neXt one ;) ciao…
References:
[1] Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie, A ConvNet for the 2020s(2022), arXiv preprint arXiv:2201.03545
Email: mkataria920@gmail.com
LinkedIN: https://www.linkedin.com/in/mansirkataria/
Twitter: https://twitter.com/_mansi___
Medium: https://zoomout.medium.com