View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Contextualized Representation Learning for Robust Multi-Modal Reasoning

        Thumbnail
        View/Open
        Master_Thesis_Wenkai_Chen.pdf (4.959Mb)
        Publication date
        2025
        Author
        Chen, Wenkai
        Metadata
        Show full item record
        Summary
        Recent advancements in computing resources and the accumulation of large-scale data have enabled the development of Large Vision-Language Model, which typically consist a vision encoder to extract image features and language model that fuses these features with textual data. However, traditional vision encoders produce static, context-independent visual representations that often fail to capture the dynamic nature of visual inferences when paired with specific contextual information. This limitation can lead to weak performance in tasks that require information beyond the static visual representation. In this thesis, we target to find an effective approach for enabling dynamic visual representation by fusing the vision encoder with contextual information. We introduce the Contextualized Vision Transformer (C-ViT), an early-fusion vision encoder that fuses contextual information at an early stage, generating dynamic visual representations that aligned with the given context. Additionally, we propose a finetuning pipeline along with an augmented training dataset. Experimental results demonstrate that our approach not only outperforms existing baselines across various benchmarks but also enhances the model's robustness and generalization ability. Our work provides new insights for future research in multimodal learning.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/50640
        Collections
        • Theses
        Utrecht university logo