Unlocking the potential of bootstrapping:
A journey towards balanced and reliable synthetic data
A framework for evaluating Bootstrap in the context of synthetic data generation

Caragea, Denisa

View/Open

Thesis_report_Denisa_Caragea_6866409.pdf (4.247Mb)

Publication date

2024

Author

Caragea, Denisa

Metadata

Show full item record

Summary

Introduction. Synthetic data generation is an essential technique in data analysis and machine learning, playing a crucial role in complementing existing data sets and addressing various challenges associated with their analysis. Synthetic data have significant utility where original data sets are limited, inaccessible or insufficiently diverse. By incorporating synthetic data, it becomes feasible to increase the size of the dataset, thereby facilitating the efficient implementation of various analytics and machine learning algorithms. However, generating synthetic data does not come without challenges and risks. Among the most significant challenges are class imbalances in the datasets, where certain classes are underrepresented, which can affect the results and correct interpretation of the analysis. In addition, data confidentiality must be maintained, especially for datasets containing sensitive information. Method. This research addresses these challenges by focusing on the evaluation of a synthetic data generation method based on Bootstrap resampling. Inspired by Bootstrap, this paper proposes the "Fusionstrap" framework. This framework integrates the sample-stratified Bootstrap method with post-processing techniques to address class imbalances in datasets, enhance the diversity and accuracy of synthetic data, and at the same time maintain levels of utility and privacy. Findings. The effectiveness of this approach is evaluated through an experimental case study, where synthetic data is generated and the performance of our proposed framework is analyzed in comparison with the basic CTGAN and Synthpop methods using three datasets. The training data was collected and preprocessed using appropriate tools and techniques. Discussion. Our evaluation measures capture improvements in synthetic data quality and provide detailed insight into the strengths and weaknesses of the evaluated methods. We conclude that the application of the “Fusionstrap” framework aspires to generate accurate, balanced and representative synthetic data. Additionally, it could be used as a data generation aid to improve accuracy in the case of an unbalanced data set.

URI

https://studenttheses.uu.nl/handle/20.500.12932/45795

Collections

Theses