Using Vision and Language models for Uncertainty-Aware Image Captioning
Summary
This thesis investigates how uncertainty can be integrated into image captioning models, focusing on generating captions that express appropriate levels of confidence. Using the HL
Dataset, a human-annotated image-text dataset with axis-specific captions and associated
confidence scores, we augment captions to include lexical hedging based on confidence scores
and fine-tune a vision-language model (BLIP-2) using supervised learning and reinforcement
learning (PPO) to align generated captions with human confidence judgments. Fine-tuning
improved the usage of hedging expressions, but human evaluation showed that it failed to
consistently align expressed uncertainty with human expectations. PPO training proved unstable due to sparse references causing unreliable reward signals. Evaluation with standard
semantic similarity metrics revealed biases in current metrics: BLEURT penalized hedging,
while BERTScore overstated similarity. SBERT showed more stable behavior, but still lacked
sensitivity to epistemic tone. A comparison with LLaVA, a large instruction-tuned model,
showed lower misalignment with human ratings on average. However, this may be partly due
to repeated phrasing from other caption axes, especially on the rationale axis. Interestingly, the
original human captions from the dataset showed the highest mean misalignment, highlighting
gaps between existing annotations and current human expectations. These results highlight
the challenges of uncertainty-aware captioning and point to the need for datasets with more
diverse references and better confidence annotations.