dc.description.abstract | With Large Language Models (LLMs) increasingly used worldwide, including in the Netherlands, there is a growing need to evaluate them on harmful biases such as stereotyping. While many benchmarks exist for English, non-English bias benchmarks remain scarce. This research intro- duces DutchCrowS, a Dutch benchmark to evaluate LLMs’ preference for stereotyping versus anti- stereotyping sentences across nine social groups. This is done through selecting, translating and adapting data from the original CrowS-Pairs dataset, and extending it with newly crowdsourced data. This approach combines crowd sourcing with manual annotation using annotation criteria informed by the pitfalls of the original dataset from the literature.
The benchmark is used to evaluate seven Autoregressive LLMs, varying in Dutch proficiency, using a likelihood-metric. Consistent with the literature, findings show that models explicitly trained on Dutch data, GEITje-7B-Ultra and EuroLLM-9B-Instruct, exhibit higher stereotyping scores than general multilingual models. Group-level results vary, especially for underrepresented groups like religion and ethnicity, underlining the importance of data balance. Although dataset size is moderate (n = 831), aggregate scores converge, supporting the benchmark’s reliability for overall stereotype evaluation. Models are also found to score similarly on the adapted original data and the newly crowdsourced data, validating the combination of two subsets with different development approaches into one benchmark.
Limitations of the likelihood-metric are outlined and a prompt-based alternative is explored. While the likelihood and prompt metrics yield similar aggregate stereotype scores for some models, the low agreement at the instance level suggests that these scores are not based on consistent judgments across individual sentence pairs. This is in line with earlier claims in the literature to distinguish between model competence and performance. The findings highlight not only the importance of language-specific datasets for evaluating social bias in LLMs, but also the need for careful metric design and a benchmark grounded in clear conceptual foundations. | |