Abstract
Medicine is a fast-moving field, and the number of medical publications has increased rapidly over recent years. How to find relevant information from this vast pool of research effectively and efficiently has therefore become highly challenges. Previous studies have demonstrated that data fusion can improve search performance if properly utilized. However, in most cases effectiveness is the only concern and efficiency is not considered. A fusion-based system is by nature more complicated and expensive computationally than other retrieval models such as BM25, because many component retrieval systems and an extra layer of fusion are required. The number of component retrieval systems involved is an important indicator of complexity of the fusion-based system. We aim to select the optimal k-subset of component retrieval systems for any given number k, to optimize both fusion performance and reduce the cost of data fusion. A clustering-based approach is proposed. First all the candidates are divided into clusters by the Chameleon clustering algorithm, then representatives from every cluster are chosen by Sequential Forward Selection for fusion. Evaluated with two datasets from TREC, the proposed method performs more effectively than the other baseline methods including the state-of-the-art subset selection method significantly. When either of the two typical fusion methods is used, an improvement rate of over 10% is observed for both measures Mean Average Precision and Recall-level Precision, and an improvement rate of over 5% is observed for both measures Precision at 10 document level and Mean Reciprocal Rank. [Abstract copyright: Copyright © 2022. Published by Elsevier Inc.]
Original language | English |
---|---|
Article number | 104213 |
Journal | Journal of Biomedical Informatics |
Volume | 135 |
Early online date | 30 Sept 2022 |
DOIs | |
Publication status | Published (in print/issue) - 30 Nov 2022 |
Bibliographical note
Publisher Copyright:© 2022
Keywords
- Data fusion
- Subset selection
- Medical information retrieval
- Clustering
- Efficiency and effectiveness