dc.rights.license | CC-BY-NC-ND | |
dc.contributor.advisor | Keuning, Hieke | |
dc.contributor.author | Tweel, Siem van den | |
dc.date.accessioned | 2025-08-21T00:05:40Z | |
dc.date.available | 2025-08-21T00:05:40Z | |
dc.date.issued | 2025 | |
dc.identifier.uri | https://studenttheses.uu.nl/handle/20.500.12932/49887 | |
dc.description.abstract | Large Language Models (LLMs) show promise for generating programming hints, but
current systems largely ignore behavioral data from student programming sessions. This
thesis investigates whether integrating contextual metrics such as time spent on tasks,
error patterns, and help-seeking behavior can improve hint quality in LLM-based hint
systems for programming exercises in introductory Python programming.
We operationalized four contextual metrics and developed seven hint generation ap-
proaches, which we applied to the CSEDM 2019 dataset containing novice programming
sessions from an introductory Python course, generating 273 hints across 39 student ses-
sions. We conducted an evaluation from multiple perspectives involving LLM assessment
of multiple generations approaches, expert validation with three educators on the two
most promising approaches and a baseline, and a user study with 16 novice programmers
comparing the final approach selected by the prior evaluation against a baseline.
Our findings reveal that contextual metrics’ impact is dependent on the evaluator
perspective. LLM evaluation showed that contextual approaches improved overall hint
quality, though modestly. However, experts showed a modest preference for baseline
hints, often penalizing hints generated with contextual metrics for revealing too much
information and not letting the student solve the problem themselves. Students demon-
strated a slight preference for hints using the time on task contextual metric, perceiving
them as more useful for overcoming immediate struggles.
These contrasting outcomes highlight a fundamental challenge: hint quality assess-
ment depends heavily on the evaluator’s perspective and priorities. Students prioritize
actionable guidance, while experts focus on long-term pedagogical goals. Our analysis
revealed challenges in using prompt engineering to achieve consistent LLM behavior for
subtle, context-dependent guidance requirements. This work demonstrates that simply
adding contextual metrics does not guarantee improved perceived quality. | |
dc.description.sponsorship | Utrecht University | |
dc.language.iso | EN | |
dc.subject | This thesis examines whether adding contextual metrics like time on task, error patterns, and help-seeking behavior can improve LLM-generated hints for beginner Python exercises. Using the CSEDM 2019 dataset, we tested several approaches and evaluated them with LLMs, educators, and students. Students preferred direct, actionable help, while educators valued hints that promoted independent problem-solving, showing that perceptions of quality vary by perspective. | |
dc.title | Integrating Contextual Metrics in
LLM-Based Hint Generation for
Programming Exercises | |
dc.type.content | Master Thesis | |
dc.rights.accessrights | Open Access | |
dc.subject.courseuu | Artificial Intelligence | |
dc.thesis.id | 52019 | |