An approach for curating and sharing open data sets in human activity recognition

  • Gulzar Alam

Student thesis: Doctoral Thesis

Abstract

In a world flooded with an ever-vast generation of data from various sensors and technologies, it is now more necessary than ever to effectively control and utilise this abundance of data for the improvement of society. Currently, the availability of open data is growing and becoming a focus of the HAR research community. Common challenges associated with open data sets in HAR are data format, heterogeneity in data sources, data duplication, annotations, public participation during data collection, and data ambiguity. A significant challenge is the unavailability of proper metadata and the overall protocol information used for data collection, hence reducing data set understanding, discoverability, and usability. Similarly, the absence of a standardised methodology for the curation and sharing of these data sets impedes effective use, confidence in trusting the quality and collaboration.

This research study aims to motivate and facilitate researchers to construct, share, find and use open data sets in HAR domain. This will support replication, benchmarking, validating study methods, analysis techniques, forecasting in various domains, ensuring correctness, and help to explore new hypotheses when combined with other data sets. The primary objective is to devise a framework to tackle the current challenges faced by the HAR researchers in curating and sharing huge amounts of data, while also ensuring ongoing collaboration, discoverability, reusability, reproducibility, and interpretation of HAR data sets.

Open data sets issues and challenges were identified faced by HAR researchers and classified into a hierarchical structure. Data driven approach was used for extracting key metadata in large data sets and knowledge driven approach to identify common patterns and semantics structures of data sets. Based on data and knowledge driven approaches, a collection of 22 metrics have been created for evaluating data set quality. Similarly, a repository framework has been developed for curating and sharing open datasets in HAR domain with applications of LLMs. An international workshop was conducted involving experts from the HAR domain for study validation of the developed gold standard ontology, data set quality metrics and data sets repository framework.

Thesis is embargoed until 31st May 2027

Date of AwardMay 2025
Original languageEnglish
SupervisorPeter Nicholl (Supervisor), Ian Mc Chesney (Supervisor) & Joseph Rafferty (Supervisor)

Keywords

  • data quality
  • ontology
  • metadata metrics
  • metadata management
  • machine learning
  • datasets framework
  • semantic modelling
  • LLMs
  • XAI
  • data challenges
  • data mining

Cite this

'