Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorKeuning, Hieke
dc.contributor.authorTweel, Siem van den
dc.date.accessioned2025-08-21T00:05:40Z
dc.date.available2025-08-21T00:05:40Z
dc.date.issued2025
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/49887
dc.description.abstractLarge Language Models (LLMs) show promise for generating programming hints, but current systems largely ignore behavioral data from student programming sessions. This thesis investigates whether integrating contextual metrics such as time spent on tasks, error patterns, and help-seeking behavior can improve hint quality in LLM-based hint systems for programming exercises in introductory Python programming. We operationalized four contextual metrics and developed seven hint generation ap- proaches, which we applied to the CSEDM 2019 dataset containing novice programming sessions from an introductory Python course, generating 273 hints across 39 student ses- sions. We conducted an evaluation from multiple perspectives involving LLM assessment of multiple generations approaches, expert validation with three educators on the two most promising approaches and a baseline, and a user study with 16 novice programmers comparing the final approach selected by the prior evaluation against a baseline. Our findings reveal that contextual metrics’ impact is dependent on the evaluator perspective. LLM evaluation showed that contextual approaches improved overall hint quality, though modestly. However, experts showed a modest preference for baseline hints, often penalizing hints generated with contextual metrics for revealing too much information and not letting the student solve the problem themselves. Students demon- strated a slight preference for hints using the time on task contextual metric, perceiving them as more useful for overcoming immediate struggles. These contrasting outcomes highlight a fundamental challenge: hint quality assess- ment depends heavily on the evaluator’s perspective and priorities. Students prioritize actionable guidance, while experts focus on long-term pedagogical goals. Our analysis revealed challenges in using prompt engineering to achieve consistent LLM behavior for subtle, context-dependent guidance requirements. This work demonstrates that simply adding contextual metrics does not guarantee improved perceived quality.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectThis thesis examines whether adding contextual metrics like time on task, error patterns, and help-seeking behavior can improve LLM-generated hints for beginner Python exercises. Using the CSEDM 2019 dataset, we tested several approaches and evaluated them with LLMs, educators, and students. Students preferred direct, actionable help, while educators valued hints that promoted independent problem-solving, showing that perceptions of quality vary by perspective.
dc.titleIntegrating Contextual Metrics in LLM-Based Hint Generation for Programming Exercises
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.courseuuArtificial Intelligence
dc.thesis.id52019


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record