Abstract
Real-time RGB-D semantic segmentation is critical for a wide range of visual scene understanding tasks. Although recent methods have achieved notable progress, they still face significant challenges in fully exploiting depth information and achieving robust RGB-D fusion, particularly under the constraints of real-time performance. Balancing segmentation accuracy and computational efficiency remains a key bottleneck for practical deployment. To address these challenges, we propose ADBNet, an efficient asymmetric dual-branch network for real-time indoor semantic segmentation. ADBNet employs an asymmetric encoder with enhanced depth processing and incorporates a novel Conv-Former Pyramid Vision Transformer (CF-PVT) featuring decomposed convolutional attention to improve depth feature extraction. Furthermore, an Adaptive Feature Recalibration and Fusion (AFRF) module is introduced to enable effective cross-modal alignment and multi-scale feature fusion. Experiments on the NYU-Depth V2 dataset demonstrate that ADBNet achieves an excellent trade-off between accuracy and efficiency, running at 66 FPS with 79.40% pixel accuracy and 56.0% mean IoU.
| Original language | English |
|---|---|
| Article number | 113885 |
| Pages (from-to) | 1-42 |
| Number of pages | 42 |
| Journal | Knowledge-Based Systems |
| Volume | 326 |
| Early online date | 7 Jul 2025 |
| DOIs | |
| Publication status | Published (in print/issue) - 27 Sept 2025 |
Bibliographical note
Publisher Copyright:© 2025 Elsevier B.V.
Data Access Statement
The data used in this manuscript is publicly available and can be accessed via References [1] and [2].Keywords
- Real-time RGB-D semantic segmentation
- Indoor-scene understanding
- Depth-feature extraction
- Asymmetric dual-branch network
- Cross-model fusion
- Cross-modal fusion