Abstract
The challenges associated with collecting real-world data are largely addressed through the generation of Synthetic Data (SD) across various domains. Utility, fidelity and privacy represent the key challenges in the synthetic tabular data generation, and each of these offers a unique perspective. In this research, we focused on the fidelity of the generated tabular data in comparison to real data, using four main metrics recommended in previous literature: Hellinger Distance (HD), Pairwise Correlation Differences (PCD), R-squared Depth vs. Depth (R2DD) Plot, and Area Under Receiver Operating Characteristic Curve (AUC-ROC). We used two Human Activity Recognition (HAR) datasets, 1) Mobile Health (mHealth) and 2) HAR Using Smartphones (HARUS); these datasets differ in the number of activities and sample sizes. We generated data using two generative methods: Conditional Tabular Generative Adversarial Networks (CTGAN) and Tabular Variational Autoencoders (TVAE).
The early results indicate that CTGAN achieved a 45% lower HD on mHealth (0.0608 vs. 0.1100), while TVAE achieved a 32% lower HD on HARUS (0.0825 vs. 0.1212). CTGAN excelled on mHealth at 1500/500 epochs (PCD 0.0295, R2DD Plot 0.9855, AUC-ROC 0.5591), whereas TVAE excelled on HARUS at 1800/700 epochs (PCD 0.0639, R2DD Plot 0.5213, AUC-ROC 0.6424). These findings suggest that adjusting the generative technique according to dataset characteristics, such as sample size and feature complexity, is useful. Future work will expand this analysis by integrating additional generative methods and datasets to explore the utility and privacy of synthetic data with fidelity.
The early results indicate that CTGAN achieved a 45% lower HD on mHealth (0.0608 vs. 0.1100), while TVAE achieved a 32% lower HD on HARUS (0.0825 vs. 0.1212). CTGAN excelled on mHealth at 1500/500 epochs (PCD 0.0295, R2DD Plot 0.9855, AUC-ROC 0.5591), whereas TVAE excelled on HARUS at 1800/700 epochs (PCD 0.0639, R2DD Plot 0.5213, AUC-ROC 0.6424). These findings suggest that adjusting the generative technique according to dataset characteristics, such as sample size and feature complexity, is useful. Future work will expand this analysis by integrating additional generative methods and datasets to explore the utility and privacy of synthetic data with fidelity.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the International Conference on Ubiquitous Computing and Ambient Intelligence (UCAmI 2025) |
| Publisher | Springer Cham |
| Chapter | 1 |
| Pages | 15-26 |
| Number of pages | 12 |
| Volume | 1 |
| ISBN (Electronic) | 978-3-032-16992-1 |
| ISBN (Print) | 978-3-032-16991-4 |
| DOIs | |
| Publication status | Published (in print/issue) - 1 Apr 2026 |
Bibliographical note
Publisher Copyright:© The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 3 Good Health and Well-being
Keywords
- synthetic data
- Synthetic Data Fidelity
- Synthetic data generation
- Human Activity Recognition
Fingerprint
Dive into the research topics of 'Evaluating Fidelity in Synthetic Tabular Data Generation: A Comparative Study of CTGAN and TVAE for Human Activity Recognition Datasets'. Together they form a unique fingerprint.Activities
- 1 Oral presentation
-
Evaluating Fidelity in Synthetic Tabular Data Generation: A Comparative Study of CTGAN and TVAE for Human Activity Recognition Datasets
Majid, .. (Speaker)
26 Nov 2025Activity: Talk or presentation › Oral presentation
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver