From Image Captioning to Visual Storytelling

Passadakis, Admitos

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Gatt, A.
dc.contributor.author	Passadakis, Admitos
dc.date.accessioned	2025-03-01T01:02:27Z
dc.date.available	2025-03-01T01:02:27Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/48588
dc.description.abstract	This thesis explores the intersection of two key tasks in Vision & Language: Image Captioning and Visual Storytelling. While traditionally treated as separate problems, we investigate whether these two tasks can be combined into a single framework by viewing image captioning as a subset of visual storytelling. Our approach is threefold. First, we employ and fine-tune a transformer-based model, Clip-Cap, to generate individual captions for a sequence of images originating from VIST dataset. Then, these captions are transformed into coherent narratives using text-to-text Encoder-Decoder models, such as T5 or BART. The aim is to generate stories that capture the dynamic relationships between visual elements in the input image sequence. In the end, we unify our previous steps under an end-to-end architecture which will be capable of producing cohesive storylines given a sequence of correlated images in its input. The evaluation of the generated stories is conducted through multiple methods: 1) with automatic language metrics, 2) with human judgment for a more nuanced assessment, and 3) with GPT-4’s artificial judge for comparisons against human annotators. Our results show that integrating image captioning and storytelling under our novel framework, has a positive impact on the quality of the generated stories. In fact, in many cases, our outcome surpasses even thehuman-level of written narratives, a conclusion supported by all evaluation methods employed. Consequently, the present work could contribute to the ongoing research in generative AI, particularly in bridging the gap between textual description and narrative coherence in multi-image sequences which are slightly correlated.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	Can we use as base the captions of images, so to produce narrative storytelines for a sequence of visual inputs?
dc.title	From Image Captioning to Visual Storytelling
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Image Captioning; Visual Storytelling; Language Models
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	41654

Files in this item

Name:: Admitos_Passadakis_MSc_Thesis.pdf
Size:: 11.95Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record