The Need for Open Banks of High-Quality Scientific Questions to Address Issues with Machine Learning and Open Health Data

Research output: Contribution to conferenceAbstract

Abstract

Open health data can be used with machine learning to discover knowledge and for
building algorithms. Searching for “health” on Kaggle returns ~31,299 datasets. A search on the UCI ML repository reveals ~109 health related datasets. The largest health datasets (# of instances) on UCI are related to heavy drinking, diabetes, drug reviews, gait, activity and sepsis. Datasets with the most views are related to heart disease, breast cancer, diabetes and obesity. As of 23/12/24, the heart disease dataset (donated ~1988) had ~584.73k views. Issues with using open data include data provenance, data quality, identifying people and the quality of the metadata. Kaggle do provide a dataset usability score out of 10 to grade ‘completeness’, ‘credibility’ and ‘compatibility’. Selected datasets related to COVID-19, X-Rays, Diabetes, Happiness and Breast Cancer that have a high number of votes on Kaggle do not have the highest usability score of 10/10. Healthrelated datasets with the highest number of votes out of those that scored 10/10 for usability were related to Stroke, Heart Failure and COVID. The availability of domain specific open data may perhaps dictate the amount of research hours dedicated to different domains. There is a need to address issues related to this, including the need to devote research hours to areas that require significant innovations. This talk will cover these issues, including the possibility of increasing the false discovery rate when using machine learning with open data. I will also discuss the concept of an open bank of research questions that could help address some of these issues.
Original languageEnglish
DOIs
Publication statusPublished (in print/issue) - 21 Jan 2025
EventInaugural Open Research Conference - Ulster University, Belfast, Northern Ireland
Duration: 21 Jan 202521 Jan 2025

Conference

ConferenceInaugural Open Research Conference
Country/TerritoryNorthern Ireland
CityBelfast
Period21/01/2521/01/25

Keywords

  • open health data
  • machine learning
  • open data

Fingerprint

Dive into the research topics of 'The Need for Open Banks of High-Quality Scientific Questions to Address Issues with Machine Learning and Open Health Data'. Together they form a unique fingerprint.

Cite this