ADBNet: Asymmetric dual-branch network for indoor real-time RGB-D semantic segmentation

Cunlu Xu, Gang Ma, Feng Gao, Bin Wang, Jun Liu

Research output: Contribution to journalArticlepeer-review

Abstract

Real-time RGB-D semantic segmentation is critical for a wide range of visual scene understanding tasks. Although recent methods have achieved notable progress, they still face significant challenges in fully exploiting depth information and achieving robust RGB-D fusion, particularly under the constraints of real-time performance. Balancing segmentation accuracy and computational efficiency remains a key bottleneck for practical deployment. To address these challenges, we propose ADBNet, an efficient asymmetric dual-branch network for real-time indoor semantic segmentation. ADBNet employs an asymmetric encoder with enhanced depth processing and incorporates a novel Conv-Former Pyramid Vision Transformer (CF-PVT) featuring decomposed convolutional attention to improve depth feature extraction. Furthermore, an Adaptive Feature Recalibration and Fusion (AFRF) module is introduced to enable effective cross-modal alignment and multi-scale feature fusion. Experiments on the NYU-Depth V2 dataset demonstrate that ADBNet achieves an excellent trade-off between accuracy and efficiency, running at 66 FPS with 79.40% pixel accuracy and 56.0% mean IoU.
Original languageEnglish
Article number113885
Pages (from-to)1-42
Number of pages42
JournalKnowledge-Based Systems
Volume326
Early online date7 Jul 2025
DOIs
Publication statusPublished (in print/issue) - 27 Sept 2025

Bibliographical note

Publisher Copyright:
© 2025 Elsevier B.V.

Data Access Statement

The data used in this manuscript is publicly available and can be accessed via References [1] and [2].

Keywords

  • Real-time RGB-D semantic segmentation
  • Indoor-scene understanding
  • Depth-feature extraction
  • Asymmetric dual-branch network
  • Cross-model fusion
  • Cross-modal fusion

Fingerprint

Dive into the research topics of 'ADBNet: Asymmetric dual-branch network for indoor real-time RGB-D semantic segmentation'. Together they form a unique fingerprint.

Cite this