View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        From Image Captioning to Visual Storytelling

        Thumbnail
        View/Open
        Admitos_Passadakis_MSc_Thesis.pdf (11.95Mb)
        Publication date
        2025
        Author
        Passadakis, Admitos
        Metadata
        Show full item record
        Summary
        This thesis explores the intersection of two key tasks in Vision & Language: Image Captioning and Visual Storytelling. While traditionally treated as separate problems, we investigate whether these two tasks can be combined into a single framework by viewing image captioning as a subset of visual storytelling. Our approach is threefold. First, we employ and fine-tune a transformer-based model, Clip-Cap, to generate individual captions for a sequence of images originating from VIST dataset. Then, these captions are transformed into coherent narratives using text-to-text Encoder-Decoder models, such as T5 or BART. The aim is to generate stories that capture the dynamic relationships between visual elements in the input image sequence. In the end, we unify our previous steps under an end-to-end architecture which will be capable of producing cohesive storylines given a sequence of correlated images in its input. The evaluation of the generated stories is conducted through multiple methods: 1) with automatic language metrics, 2) with human judgment for a more nuanced assessment, and 3) with GPT-4’s artificial judge for comparisons against human annotators. Our results show that integrating image captioning and storytelling under our novel framework, has a positive impact on the quality of the generated stories. In fact, in many cases, our outcome surpasses even thehuman-level of written narratives, a conclusion supported by all evaluation methods employed. Consequently, the present work could contribute to the ongoing research in generative AI, particularly in bridging the gap between textual description and narrative coherence in multi-image sequences which are slightly correlated.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/48588
        Collections
        • Theses
        Utrecht university logo