Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorNguyen, Dong
dc.contributor.authorWeide, Jens van der
dc.date.accessioned2025-09-03T23:02:54Z
dc.date.available2025-09-03T23:02:54Z
dc.date.issued2025
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/50325
dc.description.abstractWith Large Language Models (LLMs) increasingly used worldwide, including in the Netherlands, there is a growing need to evaluate them on harmful biases such as stereotyping. While many benchmarks exist for English, non-English bias benchmarks remain scarce. This research intro- duces DutchCrowS, a Dutch benchmark to evaluate LLMs’ preference for stereotyping versus anti- stereotyping sentences across nine social groups. This is done through selecting, translating and adapting data from the original CrowS-Pairs dataset, and extending it with newly crowdsourced data. This approach combines crowd sourcing with manual annotation using annotation criteria informed by the pitfalls of the original dataset from the literature. The benchmark is used to evaluate seven Autoregressive LLMs, varying in Dutch proficiency, using a likelihood-metric. Consistent with the literature, findings show that models explicitly trained on Dutch data, GEITje-7B-Ultra and EuroLLM-9B-Instruct, exhibit higher stereotyping scores than general multilingual models. Group-level results vary, especially for underrepresented groups like religion and ethnicity, underlining the importance of data balance. Although dataset size is moderate (n = 831), aggregate scores converge, supporting the benchmark’s reliability for overall stereotype evaluation. Models are also found to score similarly on the adapted original data and the newly crowdsourced data, validating the combination of two subsets with different development approaches into one benchmark. Limitations of the likelihood-metric are outlined and a prompt-based alternative is explored. While the likelihood and prompt metrics yield similar aggregate stereotype scores for some models, the low agreement at the instance level suggests that these scores are not based on consistent judgments across individual sentence pairs. This is in line with earlier claims in the literature to distinguish between model competence and performance. The findings highlight not only the importance of language-specific datasets for evaluating social bias in LLMs, but also the need for careful metric design and a benchmark grounded in clear conceptual foundations.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectThis research introduces DutchCrowS, a Dutch benchmark to evaluate LLMs’ preference for Dutch stereotyping versus anti- stereotyping sentences across nine social groups.
dc.titleDutchCrows: A Benchmark for Measuring Dutch Stereotypes in Large Language Models
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.courseuuArtificial Intelligence
dc.thesis.id53532


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record