Integrating Contextual Metrics in
LLM-Based Hint Generation for
Programming Exercises

Tweel, Siem van den

View/Open

final_thesis.pdf (1.045Mb)

Publication date

2025

Author

Tweel, Siem van den

Metadata

Show full item record

Summary

Large Language Models (LLMs) show promise for generating programming hints, but current systems largely ignore behavioral data from student programming sessions. This thesis investigates whether integrating contextual metrics such as time spent on tasks, error patterns, and help-seeking behavior can improve hint quality in LLM-based hint systems for programming exercises in introductory Python programming. We operationalized four contextual metrics and developed seven hint generation ap- proaches, which we applied to the CSEDM 2019 dataset containing novice programming sessions from an introductory Python course, generating 273 hints across 39 student ses- sions. We conducted an evaluation from multiple perspectives involving LLM assessment of multiple generations approaches, expert validation with three educators on the two most promising approaches and a baseline, and a user study with 16 novice programmers comparing the final approach selected by the prior evaluation against a baseline. Our findings reveal that contextual metrics’ impact is dependent on the evaluator perspective. LLM evaluation showed that contextual approaches improved overall hint quality, though modestly. However, experts showed a modest preference for baseline hints, often penalizing hints generated with contextual metrics for revealing too much information and not letting the student solve the problem themselves. Students demon- strated a slight preference for hints using the time on task contextual metric, perceiving them as more useful for overcoming immediate struggles. These contrasting outcomes highlight a fundamental challenge: hint quality assess- ment depends heavily on the evaluator’s perspective and priorities. Students prioritize actionable guidance, while experts focus on long-term pedagogical goals. Our analysis revealed challenges in using prompt engineering to achieve consistent LLM behavior for subtle, context-dependent guidance requirements. This work demonstrates that simply adding contextual metrics does not guarantee improved perceived quality.

URI

https://studenttheses.uu.nl/handle/20.500.12932/49887

Collections

Theses