CMNet: a novel model and design rationale based on comparison studies and synergy of CNN and MetaFormer

Haowen Yu, Liming Chen

Research output: Contribution to journalArticlepeer-review

27 Downloads (Pure)

Abstract

Convolutional- and Transformer-based backbone architecture are two dominant, widely accepted, models in computer vision. Nevertheless, it is still a challenge, thus a focus of research, to decide which backbone architecture performs better, and under which circumstances. In this paper, we conduct an in-depth investigation into the differences of the macroscopic backbone design of the CNN and Transformer models with the ultimate purpose of developing new models to combine the strengths of both types of architectures for effective image classification. Specifically, we first analyze the model structures of both models and identified four main differences, then we design four sets of ablation experiments using the ImageNet-1K dataset with an image classification problem as an example to study the impacts of these four differences on model performance. Based on the experimental results, we derive four observations as rules of thumb for designing a vision model backbone architecture. Informed by the experiment findings, we then conceive a novel model called CMNet which marries the experiment-proved best design practices of CNN and Transformer architectures. Finally, we carry out extensive experiments on CMNet using the same dataset against baseline classifiers. Initial results prove CMNet achieves the highest top-1 accuracy of 80.08% on the ImageNet-1K validation set, this is a very competitive value compared to previous classical models with similar computational complexity. Details of the implementation, algorithms and codes, are publicly available on Github: https://github.com/Arwin-Yu/CMNet.
Original languageEnglish
Article number109
Pages (from-to)1-13
Number of pages13
JournalMachine Vision and Applications
Volume34
Issue number6
Early online date22 Sept 2023
DOIs
Publication statusPublished online - 22 Sept 2023

Bibliographical note

Publisher Copyright:
© 2023, The Author(s).

Keywords

  • Attention mechanism
  • Transformer
  • Convolutional neural network
  • MetaFormer

Fingerprint

Dive into the research topics of 'CMNet: a novel model and design rationale based on comparison studies and synergy of CNN and MetaFormer'. Together they form a unique fingerprint.

Cite this