Skip to main navigation Skip to search Skip to main content

Evaluating Semantic Representation Strategies for Robust Information Retrieval Matching

Research output: Contribution to journalArticlepeer-review

3 Downloads (Pure)

Abstract

Vector Space Models (VSM) and neural word embeddings are core components in recent Machine Learning (ML) and Natural Language Processing (NLP) pipelines. By encoding words, sentences and documents as high-dimensional vectors via distributional semantics, they enable Information Retrieval (IR) systems to capture semantic relatedness between queries and answers. This paper compares different semantic representation strategies for query-statement matching, evaluating paraphrase identification within an IR framework using partial and syntactically varied queries of different lengths. Motivated by the Word Mover’s Distance (WMD) model, similarity is evaluated using the distance between individual words of queries and statements, as opposed to the common similarity measure of centroids of neural word embeddings. Results from ranked query and response statements demonstrate significant gains in accuracy using the combined approach of similarity ranking through WMD with the word embedding techniques. Our top-performing WMD + GloVe system consistently outperformed Doc2Vec and an LSA baseline across three return-rate thresholds, achieving 100% correct matches within the top-3 ranked results and 89.83% top-1 accuracy. Beyond the substantial gains from WMD-based similarity ranking, our results indicate that large, pre-trained word embeddings, trained on vast amounts of data, result in portable, domain-agnostic language processing solutions suitable for diverse business use cases.
Original languageEnglish
Pages (from-to)51-66
Number of pages15
JournalDigital Technologies Research and Applications
Volume4
Issue number3
DOIs
Publication statusPublished (in print/issue) - 11 Oct 2025

Data Availability Statement

No new data were generated or analyzed in this study. All data used are from publicly available sources cited
within the manuscript.

Funding

This work received no external funding

Keywords

  • Semantic Information Retrieval
  • Word embeddings
  • document similarity measure
  • Query-statement Matching
  • GloVe
  • WMD

Fingerprint

Dive into the research topics of 'Evaluating Semantic Representation Strategies for Robust Information Retrieval Matching'. Together they form a unique fingerprint.

Cite this