Using Vision and Language models for Uncertainty-Aware Image Captioning

Brink, Douwe

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Gatt, A.
dc.contributor.author	Brink, Douwe
dc.date.accessioned	2025-08-21T01:01:35Z
dc.date.available	2025-08-21T01:01:35Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/49908
dc.description.abstract	This thesis investigates how uncertainty can be integrated into image captioning models, focusing on generating captions that express appropriate levels of confidence. Using the HL Dataset, a human-annotated image-text dataset with axis-specific captions and associated confidence scores, we augment captions to include lexical hedging based on confidence scores and fine-tune a vision-language model (BLIP-2) using supervised learning and reinforcement learning (PPO) to align generated captions with human confidence judgments. Fine-tuning improved the usage of hedging expressions, but human evaluation showed that it failed to consistently align expressed uncertainty with human expectations. PPO training proved unstable due to sparse references causing unreliable reward signals. Evaluation with standard semantic similarity metrics revealed biases in current metrics: BLEURT penalized hedging, while BERTScore overstated similarity. SBERT showed more stable behavior, but still lacked sensitivity to epistemic tone. A comparison with LLaVA, a large instruction-tuned model, showed lower misalignment with human ratings on average. However, this may be partly due to repeated phrasing from other caption axes, especially on the rationale axis. Interestingly, the original human captions from the dataset showed the highest mean misalignment, highlighting gaps between existing annotations and current human expectations. These results highlight the challenges of uncertainty-aware captioning and point to the need for datasets with more diverse references and better confidence annotations.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	Using Vision and Language models for Uncertainty-Aware Image Captioning
dc.title	Using Vision and Language models for Uncertainty-Aware Image Captioning
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	52005

Files in this item

Name:: Master_thesis_Douwe__Brink.pdf
Size:: 2.691Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record