Abstract
Open health data can be used with machine learning to discover knowledge and for
building algorithms. Searching for “health” on Kaggle returns ~31,299 datasets. A search on the UCI ML repository reveals ~109 health related datasets. The largest health datasets (# of instances) on UCI are related to heavy drinking, diabetes, drug reviews, gait, activity and sepsis. Datasets with the most views are related to heart disease, breast cancer, diabetes and obesity. As of 23/12/24, the heart disease dataset (donated ~1988) had ~584.73k views. Issues with using open data include data provenance, data quality, identifying people and the quality of the metadata. Kaggle do provide a dataset usability score out of 10 to grade ‘completeness’, ‘credibility’ and ‘compatibility’. Selected datasets related to COVID-19, X-Rays, Diabetes, Happiness and Breast Cancer that have a high number of votes on Kaggle do not have the highest usability score of 10/10. Healthrelated datasets with the highest number of votes out of those that scored 10/10 for usability were related to Stroke, Heart Failure and COVID. The availability of domain specific open data may perhaps dictate the amount of research hours dedicated to different domains. There is a need to address issues related to this, including the need to devote research hours to areas that require significant innovations. This talk will cover these issues, including the possibility of increasing the false discovery rate when using machine learning with open data. I will also discuss the concept of an open bank of research questions that could help address some of these issues.
building algorithms. Searching for “health” on Kaggle returns ~31,299 datasets. A search on the UCI ML repository reveals ~109 health related datasets. The largest health datasets (# of instances) on UCI are related to heavy drinking, diabetes, drug reviews, gait, activity and sepsis. Datasets with the most views are related to heart disease, breast cancer, diabetes and obesity. As of 23/12/24, the heart disease dataset (donated ~1988) had ~584.73k views. Issues with using open data include data provenance, data quality, identifying people and the quality of the metadata. Kaggle do provide a dataset usability score out of 10 to grade ‘completeness’, ‘credibility’ and ‘compatibility’. Selected datasets related to COVID-19, X-Rays, Diabetes, Happiness and Breast Cancer that have a high number of votes on Kaggle do not have the highest usability score of 10/10. Healthrelated datasets with the highest number of votes out of those that scored 10/10 for usability were related to Stroke, Heart Failure and COVID. The availability of domain specific open data may perhaps dictate the amount of research hours dedicated to different domains. There is a need to address issues related to this, including the need to devote research hours to areas that require significant innovations. This talk will cover these issues, including the possibility of increasing the false discovery rate when using machine learning with open data. I will also discuss the concept of an open bank of research questions that could help address some of these issues.
Original language | English |
---|---|
DOIs | |
Publication status | Published (in print/issue) - 21 Jan 2025 |
Event | Inaugural Open Research Conference - Ulster University, Belfast, Northern Ireland Duration: 21 Jan 2025 → 21 Jan 2025 |
Conference
Conference | Inaugural Open Research Conference |
---|---|
Country/Territory | Northern Ireland |
City | Belfast |
Period | 21/01/25 → 21/01/25 |
Keywords
- open health data
- machine learning
- open data