Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorWagenaar, Gerard
dc.contributor.authorCaragea, Denisa
dc.date.accessioned2024-01-09T00:01:30Z
dc.date.available2024-01-09T00:01:30Z
dc.date.issued2024
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/45795
dc.description.abstractIntroduction. Synthetic data generation is an essential technique in data analysis and machine learning, playing a crucial role in complementing existing data sets and addressing various challenges associated with their analysis. Synthetic data have significant utility where original data sets are limited, inaccessible or insufficiently diverse. By incorporating synthetic data, it becomes feasible to increase the size of the dataset, thereby facilitating the efficient implementation of various analytics and machine learning algorithms. However, generating synthetic data does not come without challenges and risks. Among the most significant challenges are class imbalances in the datasets, where certain classes are underrepresented, which can affect the results and correct interpretation of the analysis. In addition, data confidentiality must be maintained, especially for datasets containing sensitive information. Method. This research addresses these challenges by focusing on the evaluation of a synthetic data generation method based on Bootstrap resampling. Inspired by Bootstrap, this paper proposes the "Fusionstrap" framework. This framework integrates the sample-stratified Bootstrap method with post-processing techniques to address class imbalances in datasets, enhance the diversity and accuracy of synthetic data, and at the same time maintain levels of utility and privacy. Findings. The effectiveness of this approach is evaluated through an experimental case study, where synthetic data is generated and the performance of our proposed framework is analyzed in comparison with the basic CTGAN and Synthpop methods using three datasets. The training data was collected and preprocessed using appropriate tools and techniques. Discussion. Our evaluation measures capture improvements in synthetic data quality and provide detailed insight into the strengths and weaknesses of the evaluated methods. We conclude that the application of the “Fusionstrap” framework aspires to generate accurate, balanced and representative synthetic data. Additionally, it could be used as a data generation aid to improve accuracy in the case of an unbalanced data set.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectThis research addresses data generation challenges by focusing on the evaluation of a synthetic data generation method based on Bootstrap resampling. Inspired by Bootstrap, this paper proposes the "Fusionstrap" framework. This framework integrates the sample-stratified Bootstrap method with post-processing techniques to address class imbalances in datasets, enhance the diversity and accuracy of synthetic data, and at the same time maintain levels of utility and privacy.
dc.titleUnlocking the potential of bootstrapping: A journey towards balanced and reliable synthetic data A framework for evaluating Bootstrap in the context of synthetic data generation
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsSynthetic Data, Preprocessed Techniques, Stratified Bootstrap, Class Imbalances, Post-processing Techniques, Utility, Privacy
dc.subject.courseuuBusiness Informatics
dc.thesis.id26927


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record