HyperAIHyperAI
Back to Headlines

Meta's ConvNeXt: The CNN Architecture That Surpasses ViT's Performance

4 months ago

Researchers from Meta have developed a Convolutional Neural Network (CNN) called ConvNeXt that challenges the dominance of Vision Transformers (ViTs) in computer vision tasks. Contrary to the belief that ViTs render traditional CNNs obsolete, Meta’s team demonstrated that by optimizing the hyperparameters of the ResNet architecture, they could achieve performance on par with or even better than ViTs. The Research and Findings Macro Design Meta's researchers began by fine-tuning the macro design of the ResNet model. They adjusted the stage ratios, inspired by the Swin Transformer, to 1:1:3:1. This change improved the accuracy of the ResNet-50 model from 78.8% to 79.4% on the ImageNet dataset. They also modified the first convolution layer's kernel size from 7x7 to 4x4 and increased the stride from 2 to 4, reducing the image size more aggressively and slightly increasing the accuracy to 79.5%. ResNeXt-ification Next, the team incorporated elements from the ResNeXt architecture, which uses group convolutions to reduce computational complexity. Initially, this led to a drop in accuracy to 78.3%, but by increasing the model's width to match the Swin Transformer, they managed to push the accuracy up to 80.5%. Inverted Bottleneck The researchers then adopted the inverted bottleneck structure from transformers, which follows a narrow → wide → narrow pattern. This change increased the accuracy to 80.6%. They further refined the block by swapping the order of the depthwise convolution and the first pointwise convolution, which briefly reduced the accuracy to 79.9% but was offset by using a 7x7 kernel size in the depthwise convolution, bringing the accuracy back to 80.6%. Micro Design The final phase involved micro-level optimizations, such as replacing ReLU with GELU and reducing the number of activation functions. By placing GELU only between the two pointwise convolutions, the accuracy rose to 81.3%, matching that of the Swin-T architecture. Additional improvements came from using a single batch normalization layer before the first pointwise convolution and substituting batch norm with layer norm, raising the accuracy to 82.0% while maintaining a lower computational complexity. ConvNeXt Architecture ConvNeXt Block The ConvNeXt block is the fundamental unit of the architecture, consisting of a depthwise convolution layer followed by a layer normalization, then a pair of pointwise convolution layers with a GELU activation function in between. The key differences from traditional ResNet blocks include the use of depthwise convolutions and layer norms, which contribute to the model's efficiency and performance. ConvNeXt Block Transition At the stage transitions, the ConvNeXt block transition handles the changes in spatial dimensions and channel counts. It includes a projection layer to adjust the residual connection's dimensions and a downsampling layer to halve the spatial dimension. These transitions ensure smooth transitions between stages and maintain the model's effectiveness. Full Implementation The entire ConvNeXt architecture is built by stacking these blocks and transitions. The model starts with a stem stage, which reduces the image size by a factor of 4 using a 4x4 kernel. Each subsequent stage (res2 through res5) uses a series of ConvNeXt blocks, with stage transitions occurring between res2 and res3, res3 and res4, and res4 and res5. The architecture ends with an adaptive average pooling layer and a fully connected output layer. Industry Implications The success of ConvNeXt in achieving state-of-the-art results while maintaining the simplicity and efficiency of CNNs has significant implications for the field of computer vision. It demonstrates that traditional CNNs can still be competitive with newer transformer-based models when optimized correctly. This research challenges the prevailing notion that transformers are the only path forward for advanced AI models and may encourage further hybrid approaches that combine the strengths of both architectures. Company Profiles Meta: Formerly known as Facebook, Meta is a leading technology company that focuses on developing innovative AI solutions and social media platforms. Their investment in research and development, particularly in AI, is evident through partnerships and acquisitions like the one with Scale AI. Scale AI: A prominent data-labeling company, Scale AI provides high-quality training data for AI models. The company has played a crucial role in the development of many leading AI systems, including those from Meta, OpenAI, and Google. Scale's expertise in generating and labeling large datasets was instrumental in the success of ConvNeXt and other cutting-edge AI models. References Zhuang Liu et al. "A ConvNet for the 2020s." ArXiv. https://arxiv.org/pdf/2201.03545 Facebook Research. "ConvNeXt." GitHub. https://github.com/facebookresearch/ConvNeXt/blob/main/models/convnext.py Kaiming He et al. "Deep Residual Learning for Image Recognition." ArXiv. https://arxiv.org/pdf/1512.03385 Ze Liu et al. "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows." ArXiv. https://arxiv.org/pdf/2103.14030 Saining Xie et al. "Aggregated Residual Transformations for Deep Neural Networks." ArXiv. https://arxiv.org/pdf/1611.05431 Muhammad Ardi. "Paper Walkthrough: Residual Network (ResNet)." Python in Plain English. https://python.plainenglish.io/paper-walkthrough-residual-network-resnet-62af58d1c521 Muhammad Ardi Putra. "The CNN That Challenges ViT — ConvNeXt." GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/main/The%20CNN%20That%20Challenges%20ViT%20-%20ConvNeXt.ipynb This research highlights the ongoing importance of CNNs in the evolving landscape of AI, emphasizing that innovation and optimization can bring significant improvements to existing architectures.

Related Links

Towards Data Science