DutchCrows: A Benchmark for Measuring Dutch Stereotypes in Large Language Models

Weide, Jens van der

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Nguyen, Dong
dc.contributor.author	Weide, Jens van der
dc.date.accessioned	2025-09-03T23:02:54Z
dc.date.available	2025-09-03T23:02:54Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/50325
dc.description.abstract	With Large Language Models (LLMs) increasingly used worldwide, including in the Netherlands, there is a growing need to evaluate them on harmful biases such as stereotyping. While many benchmarks exist for English, non-English bias benchmarks remain scarce. This research intro- duces DutchCrowS, a Dutch benchmark to evaluate LLMs’ preference for stereotyping versus anti- stereotyping sentences across nine social groups. This is done through selecting, translating and adapting data from the original CrowS-Pairs dataset, and extending it with newly crowdsourced data. This approach combines crowd sourcing with manual annotation using annotation criteria informed by the pitfalls of the original dataset from the literature. The benchmark is used to evaluate seven Autoregressive LLMs, varying in Dutch proficiency, using a likelihood-metric. Consistent with the literature, findings show that models explicitly trained on Dutch data, GEITje-7B-Ultra and EuroLLM-9B-Instruct, exhibit higher stereotyping scores than general multilingual models. Group-level results vary, especially for underrepresented groups like religion and ethnicity, underlining the importance of data balance. Although dataset size is moderate (n = 831), aggregate scores converge, supporting the benchmark’s reliability for overall stereotype evaluation. Models are also found to score similarly on the adapted original data and the newly crowdsourced data, validating the combination of two subsets with different development approaches into one benchmark. Limitations of the likelihood-metric are outlined and a prompt-based alternative is explored. While the likelihood and prompt metrics yield similar aggregate stereotype scores for some models, the low agreement at the instance level suggests that these scores are not based on consistent judgments across individual sentence pairs. This is in line with earlier claims in the literature to distinguish between model competence and performance. The findings highlight not only the importance of language-specific datasets for evaluating social bias in LLMs, but also the need for careful metric design and a benchmark grounded in clear conceptual foundations.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	This research introduces DutchCrowS, a Dutch benchmark to evaluate LLMs’ preference for Dutch stereotyping versus anti- stereotyping sentences across nine social groups.
dc.title	DutchCrows: A Benchmark for Measuring Dutch Stereotypes in Large Language Models
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	53532

Files in this item

Name:: MSc_Thesis (17).pdf
Size:: 1.360Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record

DutchCrows: A Benchmark for Measuring Dutch Stereotypes in Large Language Models

Files in this item

This item appears in the following Collection(s)

Related items

“If I’m around Dutch people or around anything that is even slightly Dutch I feel very very Dutch and the other times I am mostly Canadian”: A Narrative Account of Offline and Online Identities of Dutch-Canadian Emerging Adults ﻿

The influence of the religious background on moral attitudes towards homosexuals. The difference between Dutch Catholics, Dutch Protestants, Dutch Muslims and non-religious Dutch. ﻿

Do the Dutch and American Cultures Truly Have Close Similarities? Contextual Investigation of Deeply Embedded Cultural Differences among American and Dutch members of a Dutch Sports Organization ﻿

“If I’m around Dutch people or around anything that is even slightly Dutch I feel very very Dutch and the other times I am mostly Canadian”: A Narrative Account of Offline and Online Identities of Dutch-Canadian Emerging Adults

The influence of the religious background on moral attitudes towards homosexuals. The difference between Dutch Catholics, Dutch Protestants, Dutch Muslims and non-religious Dutch.

Do the Dutch and American Cultures Truly Have Close Similarities? Contextual Investigation of Deeply Embedded Cultural Differences among American and Dutch members of a Dutch Sports Organization