DutchCrows: A Benchmark for Measuring Dutch Stereotypes in Large Language Models
Summary
With Large Language Models (LLMs) increasingly used worldwide, including in the Netherlands, there is a growing need to evaluate them on harmful biases such as stereotyping. While many benchmarks exist for English, non-English bias benchmarks remain scarce. This research intro- duces DutchCrowS, a Dutch benchmark to evaluate LLMs’ preference for stereotyping versus anti- stereotyping sentences across nine social groups. This is done through selecting, translating and adapting data from the original CrowS-Pairs dataset, and extending it with newly crowdsourced data. This approach combines crowd sourcing with manual annotation using annotation criteria informed by the pitfalls of the original dataset from the literature.
The benchmark is used to evaluate seven Autoregressive LLMs, varying in Dutch proficiency, using a likelihood-metric. Consistent with the literature, findings show that models explicitly trained on Dutch data, GEITje-7B-Ultra and EuroLLM-9B-Instruct, exhibit higher stereotyping scores than general multilingual models. Group-level results vary, especially for underrepresented groups like religion and ethnicity, underlining the importance of data balance. Although dataset size is moderate (n = 831), aggregate scores converge, supporting the benchmark’s reliability for overall stereotype evaluation. Models are also found to score similarly on the adapted original data and the newly crowdsourced data, validating the combination of two subsets with different development approaches into one benchmark.
Limitations of the likelihood-metric are outlined and a prompt-based alternative is explored. While the likelihood and prompt metrics yield similar aggregate stereotype scores for some models, the low agreement at the instance level suggests that these scores are not based on consistent judgments across individual sentence pairs. This is in line with earlier claims in the literature to distinguish between model competence and performance. The findings highlight not only the importance of language-specific datasets for evaluating social bias in LLMs, but also the need for careful metric design and a benchmark grounded in clear conceptual foundations.
Collections
Related items
Showing items related by title, author, creator and subject.
-
“If I’m around Dutch people or around anything that is even slightly Dutch I feel very very Dutch and the other times I am mostly Canadian”: A Narrative Account of Offline and Online Identities of Dutch-Canadian Emerging Adults
Keijzer, J.F. (2016)With this thesis I provide an empirical account of the way Dutch-Canadian emerging adults (EA) between the age of 19 and 26 perceive their Dutchness in relation to their identities in Canadian contexts. This is done by ... -
The influence of the religious background on moral attitudes towards homosexuals. The difference between Dutch Catholics, Dutch Protestants, Dutch Muslims and non-religious Dutch.
Konijn, L. (2019)Negative attitudes towards homosexuals are still present. A quarter of the homosexual men and lesbian women experienced negatively treatment(s) in public spaces Besides, the American Nashville statement got support from ... -
Do the Dutch and American Cultures Truly Have Close Similarities? Contextual Investigation of Deeply Embedded Cultural Differences among American and Dutch members of a Dutch Sports Organization
Gebremariam, R. (2013)This study focuses on the differences between the American and Dutch cultures within the context of a Dutch sports organization which consists of American and Dutch members. Even though many culture-comparative studies ...