Sharing data is often a risk in terms of security and privacy especially if the data is sensitive. Algorithms can be used to generate synthetic data from an original raw dataset in order to share data that are considered more ‘privacy preserving’, and that increase the level of anonymity. In this paper, we carry out an experiment to study the validity of conducting machine learning on synthetic data. We compare the evaluation metrics produced from machine learning models that were trained using synthetic data with metrics yielded from machine learning models that were trained using the corresponding real data.
|Title of host publication||Data Science and Knowledge Engineering for Sensing Decision Support|
|Publisher||World Scientific Publishing|
|Number of pages||11|
|Publication status||Published - 24 Aug 2018|
Heyburn, R., Bond, RR., Black, M., Mulvenna, M., Wallace, J. G., Rankin, D., & Cleland, B. (2018). Machine learning using synthetic and real data: Similarity of evaluation metrics for different healthcare datasets and for different algorithms. In Data Science and Knowledge Engineering for Sensing Decision Support (Vol. 11, pp. 1281-1291). World Scientific Publishing. https://doi.org/DOI: 10.1142/9789813273238_0160