Automatic Annotation of Dutch Educational Assessment Questions using Large Language Models
Summary
This study is aimed at the automatic evaluation of curriculum alignment. Curriculum
alignment refers to the extent to which learning objectives, instructional activities,
and assessments are coherently aligned. Traditionally, measuring this alignment is a
time-consuming and often subjective process, since it typically involves evaluating all
educational materials with the learning objectives of the curriculum. To address this,
the research explores the use of large language models (LLMs) to automate the annotation
of Dutch assessment questions with subject-specific concepts. Specifically, it investigates
both generative (GPT-4.1 nano) and non-generative (mBERT) models using a labeled
dataset of Dutch statistics questions. Results indicate that LLMs show strong potential
in this domain: GPT achieved up to 71.1% accuracy and 62.2% macro F1 score, while
mBERT reached 91.7% accuracy and 83.7% macro F1 score. Additionally, prompt
engineering significantly enhances GPT’s performance, leading to substantial gains.
The findings also highlight the importance of careful adaptation and evaluation across
diverse educational contexts and task types, as performance varied depending on question
categories and subject matter. This research contributes to the integration of AI in
education by providing an effective solution for question annotation and offering insights
into which approaches are better suited for different educational scenarios. As a result,
educators can better align assessments with learning objectives and enhance the overall
learning experience.