Balancing Real and Synthetic Data for Enhanced Human Activity Recognition: An Empirical Study

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Downloads (Pure)

Abstract

Accessing data is difficult and time-consuming when developing solutions for human activity recognition (HAR). Additionally, personal data makes it challenging to use for research purposes. Nevertheless, recent advancements in synthetic data generation techniques offer an opportunity to address these issues. Whilst obtaining HAR data is essential, it is equally crucial that the data meets the high standards of quality, utility and fidelity. To date, no research has been conducted to understand what impact the proportion of real and synthetic data has on the performance of the HAR models. This research focuses on a comprehensive analysis of the distribution and performance of generated datasets when applied to various machine learning models. We systematically create training datasets with various proportions of real and synthetic data and assess their impact on performance of HAR systems. Our analysis employs common machine learning models such as Decision Tree (DT), Gaussian Naïve Bayes (GNB), Support Vector Machines (SVM), Linear Support Vector Machine (L-SVM), Random Forest (RF), Gradient Boosting (GB) and Shallow Neural Networks (SNN). By evaluating the models on various proportions of real and synthetic data for training, we observed that increasing the proportion of synthetic data in the training set had the impact of improving the model's performance on unseen instances. Specifically, we achieved 0.970 accuracy by boosting the real training dataset by 90% using synthetic data in a RF model on 5-fold cross-validation. Furthermore, we aim to understand the trade-offs and benefits associated with each approach. This study aims to provide insights into the viability of synthetic data for HAR tasks and establish guidelines for its effective use. Ultimately, our goal is to contribute to developing more effective HAR models by analysing the performance of different machine learning techniques on both real and synthetic data. In the future, we plan to extend our work to other domains, explore the use of further datasets, and investigate the impact of synthetic data on more complex models, such as deep learning.

Original languageEnglish
Title of host publicationProceedings of the International Conference on Ubiquitous Computing and Ambient Intelligence (UCAmI 2024)
EditorsJosé Bravo, Chris Nugent, Ian Cleland
PublisherSpringer Science and Business Media Deutschland GmbH
Pages194-204
Number of pages11
ISBN (Electronic)978-3-031-77571-0
ISBN (Print)9783031775703
DOIs
Publication statusPublished (in print/issue) - 21 Dec 2024
Event16th International Conference on Ubiquitous Computing and Ambient Intelligence, UCAmI 2024 - Belfast, United Kingdom
Duration: 27 Nov 202429 Nov 2024

Publication series

NameLecture Notes in Networks and Systems
Volume1212 LNNS
ISSN (Print)2367-3370
ISSN (Electronic)2367-3389

Conference

Conference16th International Conference on Ubiquitous Computing and Ambient Intelligence, UCAmI 2024
Country/TerritoryUnited Kingdom
CityBelfast
Period27/11/2429/11/24

Bibliographical note

Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.

Keywords

  • Data Scarcity
  • Human Activity Recognition
  • Synthetic Data
  • Synthetic Data Training
  • validation
  • Synthetic Data generation

Fingerprint

Dive into the research topics of 'Balancing Real and Synthetic Data for Enhanced Human Activity Recognition: An Empirical Study'. Together they form a unique fingerprint.

Cite this