Abstract
Automatic segmentation of polyps in endoscopic images is crucial for early diagnosis and surgical planning of colorectal cancer. However, polyps closely resemble surrounding mucosal tissue in both texture and indistinct borders and vary in size, appearance, and location which possess great challenge to polyp segmentation. Although some recent attempts have been made to apply Vision Transformer (ViT) to polyp segmentation and achieved promising performance, their application in clinical scenarios is still limited by high computational complexity, large model size, redundant dependencies, and significant training costs. To address these limitations, we propose a novel ViT-based approach named Polyp-LVT, strategically replacing the attention layer in the encoder with a global max pooling layer, which significantly reduces the model’s parameter count and computational cost while keeping the performance undegraded. Furthermore, we introduce a network block, named Inter-block Feature Fusion Module (IFFM), into the decoder, aiming to offer a streamlined yet highly efficient feature extraction. We conduct extensive experiments on three public polyp image benchmarks to evaluate our method. The experimental results show that compared with the baseline models, our Polyp-LVT network achieves a nearly 44% reduction in model parameters while gaining comparable segmentation performance.
Original language | English |
---|---|
Article number | 112181 |
Journal | Knowledge-Based Systems |
Volume | 300 |
Early online date | 27 Jun 2024 |
DOIs | |
Publication status | Published (in print/issue) - 27 Sept 2024 |
Bibliographical note
Publisher Copyright:© 2024 Elsevier B.V.
Data Access Statement
Data will be made available on request.Keywords
- Polyp segmentation
- Lightweight vision transformer
- Pooling layer
- Colorectal cancer