MulMoSenT: Multimodal Sentiment Analysis for a Low-Resource Language Using Textual-Visual Cross-Attention and Fusion

Research output: Contribution to journalArticlepeer-review

Abstract

The widespread availability of the Internet and the growing use of smart devices have fueled the rapid expansion of multimodal (image-text) sentiment analysis (MSA), a burgeoning research field. This growth is driven by the massive volume of image-text data generated by these technologies. However, MSA faces significant challenges, notably the misalignment between images and text, where an image may carry multiple interpretations or contradict its paired text. In addition, short textual content often lacks sufficient context, complicating sentiment prediction. These issues are particularly acute in low-resource languages, where annotated image-text corpora are scarce, and Vision-Language Models (VLMs) and Large Language Models (LLMs) exhibit limited performance. This research introduces MulMoSenT, a multimodal image-text sentiment analysis system tailored to tackle these challenges for low-resource languages. The development of MulMoSenT unfolds across four key phases: corpus development, baseline model evaluation and selection, hyperparameter adaptation, and model fine-tuning and inference. The proposed MulMoSenT model achieves a peak accuracy of 84.90%, surpassing all baseline models. Delivers a 37. 83% improvement over VLMs, a 35.28% gain over image-only models, and a 0.71% enhancement over text-only models. Both the dataset and the solution are publicly accessible at: https://github.com/sadia-afroze/MulMoSenT.
Original languageEnglish
Article number104129
JournalInformation Fusion
Early online date15 Jan 2026
DOIs
Publication statusPublished online - 15 Jan 2026

Bibliographical note

Publisher Copyright:
© 2026 Elsevier B.V.

Data Access Statement

Data will be made available on request.

Keywords

  • Multimodal sentiment analysis
  • Low-resource languages
  • Image-text fusion
  • Cross-attention
  • Ablation studies
  • Large Language Models (LLMs)

Fingerprint

Dive into the research topics of 'MulMoSenT: Multimodal Sentiment Analysis for a Low-Resource Language Using Textual-Visual Cross-Attention and Fusion'. Together they form a unique fingerprint.

Cite this