View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Unlocking the potential of bootstrapping: A journey towards balanced and reliable synthetic data A framework for evaluating Bootstrap in the context of synthetic data generation

        Thumbnail
        View/Open
        Thesis_report_Denisa_Caragea_6866409.pdf (4.247Mb)
        Publication date
        2024
        Author
        Caragea, Denisa
        Metadata
        Show full item record
        Summary
        Introduction. Synthetic data generation is an essential technique in data analysis and machine learning, playing a crucial role in complementing existing data sets and addressing various challenges associated with their analysis. Synthetic data have significant utility where original data sets are limited, inaccessible or insufficiently diverse. By incorporating synthetic data, it becomes feasible to increase the size of the dataset, thereby facilitating the efficient implementation of various analytics and machine learning algorithms. However, generating synthetic data does not come without challenges and risks. Among the most significant challenges are class imbalances in the datasets, where certain classes are underrepresented, which can affect the results and correct interpretation of the analysis. In addition, data confidentiality must be maintained, especially for datasets containing sensitive information. Method. This research addresses these challenges by focusing on the evaluation of a synthetic data generation method based on Bootstrap resampling. Inspired by Bootstrap, this paper proposes the "Fusionstrap" framework. This framework integrates the sample-stratified Bootstrap method with post-processing techniques to address class imbalances in datasets, enhance the diversity and accuracy of synthetic data, and at the same time maintain levels of utility and privacy. Findings. The effectiveness of this approach is evaluated through an experimental case study, where synthetic data is generated and the performance of our proposed framework is analyzed in comparison with the basic CTGAN and Synthpop methods using three datasets. The training data was collected and preprocessed using appropriate tools and techniques. Discussion. Our evaluation measures capture improvements in synthetic data quality and provide detailed insight into the strengths and weaknesses of the evaluated methods. We conclude that the application of the “Fusionstrap” framework aspires to generate accurate, balanced and representative synthetic data. Additionally, it could be used as a data generation aid to improve accuracy in the case of an unbalanced data set.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/45795
        Collections
        • Theses
        Utrecht university logo