Abstract
The widespread availability of the Internet and the growing use of smart devices have fueled the rapid expansion of multimodal (image-text) sentiment analysis (MSA), a burgeoning research field. This growth is driven by the massive volume of image-text data generated by these technologies. However, MSA faces significant challenges, notably the misalignment between images and text, where an image may carry multiple interpretations or contradict its paired text. In addition, short textual content often lacks sufficient context, complicating sentiment prediction. These issues are particularly acute in low-resource languages, where annotated image-text corpora are scarce, and Vision-Language Models (VLMs) and Large Language Models (LLMs) exhibit limited performance. This research introduces MulMoSenT, a multimodal image-text sentiment analysis system tailored to tackle these challenges for low-resource languages. The development of MulMoSenT unfolds across four key phases: corpus development, baseline model evaluation and selection, hyperparameter adaptation, and model fine-tuning and inference. The proposed MulMoSenT model achieves a peak accuracy of 84.90%, surpassing all baseline models. Delivers a 37. 83% improvement over VLMs, a 35.28% gain over image-only models, and a 0.71% enhancement over text-only models. Both the dataset and the solution are publicly accessible at: https://github.com/sadia-afroze/MulMoSenT.
| Original language | English |
|---|---|
| Article number | 104129 |
| Journal | Information Fusion |
| Early online date | 15 Jan 2026 |
| DOIs | |
| Publication status | Published online - 15 Jan 2026 |
Bibliographical note
Publisher Copyright:© 2026 Elsevier B.V.
Data Access Statement
Data will be made available on request.Keywords
- Multimodal sentiment analysis
- Low-resource languages
- Image-text fusion
- Cross-attention
- Ablation studies
- Large Language Models (LLMs)