Using natural language processing to facilitate the harmonisation of mental health questionnaires: a validation study using real-world data

Eoin McElroy, Thomas Wood, RR Bond, Maurice Mulvenna, M Shevlin, George B Ploubidis, Mauricio Scopel Hoffmann, Bettina Moltrecht

Research output: Contribution to journalArticlepeer-review

6 Citations (Scopus)
8 Downloads (Pure)

Abstract

Background: Pooling data from different sources will advance mental health research by providing larger sample sizes and allowing cross-study comparisons; however, the heterogeneity in how variables are measured across studies poses a challenge to this process.

Methods: This study explored the potential of natural language processing (NLP) to harmonise different mental health questionnaires by matching individual questions based on their semantic content. Using the Sentence-BERT model, we calculated the semantic similarity (cosine index) between 741 pairs of questions from 5 questionnaires. Drawing on data from a representative UK sample of adults (N=2,058), we calculated a Spearman rank correlation for each of the same pairs of items, and then estimated the correlation between the cosine values and Spearman coefficients. We also used network analysis to explore the model’s ability to uncover structures within the data and metadata.

Results: We found a moderate overall correlation (r = .48, p <.001) between the two indices. In a holdout sample, the cosine scores predicted the real-world correlations with a small degree of error (MAE= 0.05, MedAE = 0.04 , RMSE = 0.064) suggesting the utility of NLP in identifying similar items for cross-study data pooling. Our NLP model could detect more complex patterns in our data, but it needed manual rules to decide which edges to include in the network.

Conclusions: This research shows that it is possible to quantify the semantic similarity between pairs of questionnaire items from their meta-data, and these similarity indices correlate with how participants would answer the same two items. This highlights the potential of NLP to facilitate cross-study data pooling in mental health research. Nevertheless, researchers are cautioned to verify the psychometric equivalence of matched items.
Original languageEnglish
Article number530
Pages (from-to)1-9
Number of pages9
JournalBMC Psychiatry
Volume24
Issue number1
Early online date24 Jul 2024
DOIs
Publication statusPublished online - 24 Jul 2024

Bibliographical note

Publisher Copyright:
© The Author(s) 2024.

Data Access Statement

The data and meta-data from the C19PRC study can be found at https://osf.
io/v2zur/. The correlation and cosine values used in the present analyses are
available in Supplementary fle 2.

Keywords

  • Retrospective data harmonisation
  • Harmonisation
  • Meta-analysis
  • Data pooling

Fingerprint

Dive into the research topics of 'Using natural language processing to facilitate the harmonisation of mental health questionnaires: a validation study using real-world data'. Together they form a unique fingerprint.

Cite this