Unlocking the potential of bootstrapping:
A journey towards balanced and reliable synthetic data
A framework for evaluating Bootstrap in the context of synthetic data generation

Caragea, Denisa

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Wagenaar, Gerard
dc.contributor.author	Caragea, Denisa
dc.date.accessioned	2024-01-09T00:01:30Z
dc.date.available	2024-01-09T00:01:30Z
dc.date.issued	2024
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/45795
dc.description.abstract	Introduction. Synthetic data generation is an essential technique in data analysis and machine learning, playing a crucial role in complementing existing data sets and addressing various challenges associated with their analysis. Synthetic data have significant utility where original data sets are limited, inaccessible or insufficiently diverse. By incorporating synthetic data, it becomes feasible to increase the size of the dataset, thereby facilitating the efficient implementation of various analytics and machine learning algorithms. However, generating synthetic data does not come without challenges and risks. Among the most significant challenges are class imbalances in the datasets, where certain classes are underrepresented, which can affect the results and correct interpretation of the analysis. In addition, data confidentiality must be maintained, especially for datasets containing sensitive information. Method. This research addresses these challenges by focusing on the evaluation of a synthetic data generation method based on Bootstrap resampling. Inspired by Bootstrap, this paper proposes the "Fusionstrap" framework. This framework integrates the sample-stratified Bootstrap method with post-processing techniques to address class imbalances in datasets, enhance the diversity and accuracy of synthetic data, and at the same time maintain levels of utility and privacy. Findings. The effectiveness of this approach is evaluated through an experimental case study, where synthetic data is generated and the performance of our proposed framework is analyzed in comparison with the basic CTGAN and Synthpop methods using three datasets. The training data was collected and preprocessed using appropriate tools and techniques. Discussion. Our evaluation measures capture improvements in synthetic data quality and provide detailed insight into the strengths and weaknesses of the evaluated methods. We conclude that the application of the “Fusionstrap” framework aspires to generate accurate, balanced and representative synthetic data. Additionally, it could be used as a data generation aid to improve accuracy in the case of an unbalanced data set.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	This research addresses data generation challenges by focusing on the evaluation of a synthetic data generation method based on Bootstrap resampling. Inspired by Bootstrap, this paper proposes the "Fusionstrap" framework. This framework integrates the sample-stratified Bootstrap method with post-processing techniques to address class imbalances in datasets, enhance the diversity and accuracy of synthetic data, and at the same time maintain levels of utility and privacy.
dc.title	Unlocking the potential of bootstrapping: A journey towards balanced and reliable synthetic data A framework for evaluating Bootstrap in the context of synthetic data generation
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Synthetic Data, Preprocessed Techniques, Stratified Bootstrap, Class Imbalances, Post-processing Techniques, Utility, Privacy
dc.subject.courseuu	Business Informatics
dc.thesis.id	26927

Files in this item

Name:: Thesis_report_Denisa_Caragea_6 ...
Size:: 4.247Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record

Unlocking the potential of bootstrapping: A journey towards balanced and reliable synthetic data A framework for evaluating Bootstrap in the context of synthetic data generation

Files in this item

This item appears in the following Collection(s)

Related items

Bootstrapping the CRISP-DM Process ﻿

What are the key bootstrapping practices employed by digital-first businesses? ﻿

Bootstrapping the CRISP-DM Process

What are the key bootstrapping practices employed by digital-first businesses?