Integrating Contextual Metrics in
LLM-Based Hint Generation for
Programming Exercises

Tweel, Siem van den

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Keuning, Hieke
dc.contributor.author	Tweel, Siem van den
dc.date.accessioned	2025-08-21T00:05:40Z
dc.date.available	2025-08-21T00:05:40Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/49887
dc.description.abstract	Large Language Models (LLMs) show promise for generating programming hints, but current systems largely ignore behavioral data from student programming sessions. This thesis investigates whether integrating contextual metrics such as time spent on tasks, error patterns, and help-seeking behavior can improve hint quality in LLM-based hint systems for programming exercises in introductory Python programming. We operationalized four contextual metrics and developed seven hint generation ap- proaches, which we applied to the CSEDM 2019 dataset containing novice programming sessions from an introductory Python course, generating 273 hints across 39 student ses- sions. We conducted an evaluation from multiple perspectives involving LLM assessment of multiple generations approaches, expert validation with three educators on the two most promising approaches and a baseline, and a user study with 16 novice programmers comparing the final approach selected by the prior evaluation against a baseline. Our findings reveal that contextual metrics’ impact is dependent on the evaluator perspective. LLM evaluation showed that contextual approaches improved overall hint quality, though modestly. However, experts showed a modest preference for baseline hints, often penalizing hints generated with contextual metrics for revealing too much information and not letting the student solve the problem themselves. Students demon- strated a slight preference for hints using the time on task contextual metric, perceiving them as more useful for overcoming immediate struggles. These contrasting outcomes highlight a fundamental challenge: hint quality assess- ment depends heavily on the evaluator’s perspective and priorities. Students prioritize actionable guidance, while experts focus on long-term pedagogical goals. Our analysis revealed challenges in using prompt engineering to achieve consistent LLM behavior for subtle, context-dependent guidance requirements. This work demonstrates that simply adding contextual metrics does not guarantee improved perceived quality.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	This thesis examines whether adding contextual metrics like time on task, error patterns, and help-seeking behavior can improve LLM-generated hints for beginner Python exercises. Using the CSEDM 2019 dataset, we tested several approaches and evaluated them with LLMs, educators, and students. Students preferred direct, actionable help, while educators valued hints that promoted independent problem-solving, showing that perceptions of quality vary by perspective.
dc.title	Integrating Contextual Metrics in LLM-Based Hint Generation for Programming Exercises
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	52019

Files in this item

Name:: final_thesis.pdf
Size:: 1.045Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record

Integrating Contextual Metrics in LLM-Based Hint Generation for Programming Exercises

Files in this item

This item appears in the following Collection(s)