Vision-audio multimodal object recognition using hybrid and tensor fusion techniques

Md. Redwan Ahmed, Rezaul Haque, S.M. Arafat Rahman, Ahmed Wasif Reza, Nazmul Siddique, Hui Wang

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

Traditional fusion methods often encounter challenges related to temporal misalignment and signal variability, resulting in suboptimal performance. This study proposes a novel hybrid fusion model that integrates early and late fusion strategies to capture low-level feature interactions and high-level modality-specific abstractions. Advanced feature extraction techniques are employed to ensure robust multimodal representation: visual features are extracted using Scale-Invariant Feature Transform (SIFT) and Local Binary Patterns (LBP), while audio features are processed using Spectral Centroid and Pitch-Synchronous Speech Features. Additionally, the Ridgelet Transform enhances spatial–temporal representation. A preprocessing pipeline further reduces data noise, applying Different Resolution Total Variation (DRTV) for visual noise suppression and Mel Frequency Cepstral Coefficients (MFCCs) for audio feature extraction. Furthermore, we incorporated an xLSTM-based hierarchical multi-scale temporal encoder in the audio branch and implemented an attention-based fusion stream with Feature-wise Linear Modulation (FiLM) for dynamic alignment based on different modalities. Class imbalance is addressed by applying SMOTE in the latent feature space and using class-weighted cross-entropy loss to improve model sensitivity to minority classes. Evaluated on a collected dataset of 22,133 audio-visual samples across 21 object categories, our proposed fusion model achieves an F1 score of 97.89% and a PR AUC of 98.02%. The attention-based fusion variant converged in 14 epochs but required more resources, totaling 19.1M parameters, 9.26G FLOPs, and 14.6 ms of inference latency. In contrast, hybrid fusion with LSTM provided a more efficient option with 12.0M parameters, 4.73G FLOPs, and 9.0 ms latency, making it ideal for low-resource edge applications. These results prove the proposed model’s flexibility in real-time multimodal applications such as autonomous systems, surveillance, and recycling automation.
Original languageEnglish
Article number103667
Pages (from-to)1-24
Number of pages24
JournalInformation Fusion
Volume126
Early online date3 Sept 2025
DOIs
Publication statusPublished online - 3 Sept 2025

Bibliographical note

Publisher Copyright:
© 2025 Elsevier B.V.

Data Access Statement

All essential resources, including dataset, preprocessing scripts, feature extraction modules, fusion model implementations, hyperparameter configurations, and evaluation scripts are publicly accessible at: https://github.com/rezaul-h/multimodal-fusion.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Keywords

  • Deep learning
  • Multimodal fusion
  • Multimodal learning
  • Object recognition
  • Smart sorting

Fingerprint

Dive into the research topics of 'Vision-audio multimodal object recognition using hybrid and tensor fusion techniques'. Together they form a unique fingerprint.

Cite this