Exploring the Influence of Synthetic Training Data Diversity on the Behavior of Fine-Tuned Large Language Models

Schaffelder, Max

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Gatt, A.
dc.contributor.author	Schaffelder, Max
dc.date.accessioned	2025-08-21T01:01:52Z
dc.date.available	2025-08-21T01:01:52Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/49912
dc.description.abstract	In recent years, large language models (LLMs) have become an extremely popular and active subdiscipline of artificial intelligence (AI). As LLMs become more capable, they are increasingly being used to generate data for further LLM training, complementing or replacing human-written text. However, as synthetic text differs systematically from human-written text, large language models trained or fine-tuned on this data can start behaving in unexpected ways: For instance, their output distribution shifts away from the distribution of human-written text, a phenomenon previous research has termed “model collapse”. The research on model collapse thus far has mostly focused on single-source scenarios, that is, the repeated training of LLMs on their own outputs, which has been shown to induce model collapse. This thesis investigates the usage of multi-source synthetic data, so data generated by multiple source models, as a strategy for mitigating model collapse. The efficacy of this approach is investigated from different angles: Experiment 1 focuses on a diverse range of metrics for measuring model collapse directly, while Experiment 2 investigates the impact of different fine-tuning regimes on model safety, and Experiment 3 examines the implications for LLM self-preference bias. We find compelling evidence indicating the efficacy of multi-source synthetic data for mitigating model collapse. We also describe various complex interactions between synthetic data source diversity, the size of data-generating models, and the size of fine-tuned models, with varying implications for model safety and self-preference bias. Finally, we show the importance of metric choice for the study of model collapse, with different measurement approaches yielding varying outcomes.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	The thesis investigates the impacts of synthetic data source diversity on model collapse, adversarial robustness, and self-preference bias in large language models. To do this, we generated synthetic data using different models, fine-tuned open-weights LLMs, and ran experiments using the fine-tuned models and their outputs.
dc.title	Exploring the Influence of Synthetic Training Data Diversity on the Behavior of Fine-Tuned Large Language Models
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Large language models; model collapse; synthetic data; data diversity; perplexity; self-preference bias; adversarial robustness; alignment
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	52010

Files in this item

Name:: MSc Thesis Max Schaffelder.pdf
Size:: 7.351Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record

Exploring the Influence of Synthetic Training Data Diversity on the Behavior of Fine-Tuned Large Language Models

Files in this item

This item appears in the following Collection(s)

Related items

Modeling dual-task performance: do individualized models predict dual-task performance better than average models? ﻿

Modelling Wastewater Quantity and Quality in Mexico -- using an agent-based model ﻿

Modelling offshore wind in the IMAGE/TIMER model ﻿

Modeling dual-task performance: do individualized models predict dual-task performance better than average models?

Modelling Wastewater Quantity and Quality in Mexico -- using an agent-based model

Modelling offshore wind in the IMAGE/TIMER model