Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorGatt, A.
dc.contributor.authorPassadakis, Admitos
dc.date.accessioned2025-03-01T01:02:27Z
dc.date.available2025-03-01T01:02:27Z
dc.date.issued2025
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/48588
dc.description.abstractThis thesis explores the intersection of two key tasks in Vision & Language: Image Captioning and Visual Storytelling. While traditionally treated as separate problems, we investigate whether these two tasks can be combined into a single framework by viewing image captioning as a subset of visual storytelling. Our approach is threefold. First, we employ and fine-tune a transformer-based model, Clip-Cap, to generate individual captions for a sequence of images originating from VIST dataset. Then, these captions are transformed into coherent narratives using text-to-text Encoder-Decoder models, such as T5 or BART. The aim is to generate stories that capture the dynamic relationships between visual elements in the input image sequence. In the end, we unify our previous steps under an end-to-end architecture which will be capable of producing cohesive storylines given a sequence of correlated images in its input. The evaluation of the generated stories is conducted through multiple methods: 1) with automatic language metrics, 2) with human judgment for a more nuanced assessment, and 3) with GPT-4’s artificial judge for comparisons against human annotators. Our results show that integrating image captioning and storytelling under our novel framework, has a positive impact on the quality of the generated stories. In fact, in many cases, our outcome surpasses even thehuman-level of written narratives, a conclusion supported by all evaluation methods employed. Consequently, the present work could contribute to the ongoing research in generative AI, particularly in bridging the gap between textual description and narrative coherence in multi-image sequences which are slightly correlated.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectCan we use as base the captions of images, so to produce narrative storytelines for a sequence of visual inputs?
dc.titleFrom Image Captioning to Visual Storytelling
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsImage Captioning; Visual Storytelling; Language Models
dc.subject.courseuuArtificial Intelligence
dc.thesis.id41654


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record