Abstract
Accessing data is difficult and time-consuming when developing solutions for human activity recognition (HAR). Additionally, personal data makes it challenging to use for research purposes. Nevertheless, recent advancements in synthetic data generation techniques offer an opportunity to address these issues. Whilst obtaining HAR data is essential, it is equally crucial that the data meets the high standards of quality, utility and fidelity. To date, no research has been conducted to understand what impact the proportion of real and synthetic data has on the performance of the HAR models. This research focuses on a comprehensive analysis of the distribution and performance of generated datasets when applied to various machine learning models. We systematically create training datasets with various proportions of real and synthetic data and assess their impact on performance of HAR systems. Our analysis employs common machine learning models such as Decision Tree (DT), Gaussian Naïve Bayes (GNB), Support Vector Machines (SVM), Linear Support Vector Machine (L-SVM), Random Forest (RF), Gradient Boosting (GB) and Shallow Neural Networks (SNN). By evaluating the models on various proportions of real and synthetic data for training, we observed that increasing the proportion of synthetic data in the training set had the impact of improving the model's performance on unseen instances. Specifically, we achieved 0.970 accuracy by boosting the real training dataset by 90% using synthetic data in a RF model on 5-fold cross-validation. Furthermore, we aim to understand the trade-offs and benefits associated with each approach. This study aims to provide insights into the viability of synthetic data for HAR tasks and establish guidelines for its effective use. Ultimately, our goal is to contribute to developing more effective HAR models by analysing the performance of different machine learning techniques on both real and synthetic data. In the future, we plan to extend our work to other domains, explore the use of further datasets, and investigate the impact of synthetic data on more complex models, such as deep learning.
Original language | English |
---|---|
Title of host publication | Proceedings of the International Conference on Ubiquitous Computing and Ambient Intelligence (UCAmI 2024) |
Editors | José Bravo, Chris Nugent, Ian Cleland |
Publisher | Springer Science and Business Media Deutschland GmbH |
Pages | 194-204 |
Number of pages | 11 |
ISBN (Electronic) | 978-3-031-77571-0 |
ISBN (Print) | 9783031775703 |
DOIs | |
Publication status | Published (in print/issue) - 21 Dec 2024 |
Event | 16th International Conference on Ubiquitous Computing and Ambient Intelligence, UCAmI 2024 - Belfast, United Kingdom Duration: 27 Nov 2024 → 29 Nov 2024 |
Publication series
Name | Lecture Notes in Networks and Systems |
---|---|
Volume | 1212 LNNS |
ISSN (Print) | 2367-3370 |
ISSN (Electronic) | 2367-3389 |
Conference
Conference | 16th International Conference on Ubiquitous Computing and Ambient Intelligence, UCAmI 2024 |
---|---|
Country/Territory | United Kingdom |
City | Belfast |
Period | 27/11/24 → 29/11/24 |
Bibliographical note
Publisher Copyright:© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.
Keywords
- Data Scarcity
- Human Activity Recognition
- Synthetic Data
- Synthetic Data Training
- validation
- Synthetic Data generation