Contextualized Representation Learning for Robust Multi-Modal Reasoning
Summary
Recent advancements in computing resources and the accumulation of large-scale data have enabled the development of Large Vision-Language Model, which typically consist a vision encoder to extract image features and language model that fuses these features with textual data. However, traditional vision encoders produce static, context-independent visual representations that often fail to capture the dynamic nature of visual inferences when paired with specific contextual information. This limitation can lead to weak performance in tasks that require information beyond the static visual representation. In this thesis, we target to find an effective approach for enabling dynamic visual representation by fusing the vision encoder with contextual information. We introduce the Contextualized Vision Transformer (C-ViT), an early-fusion vision encoder that fuses contextual information at an early stage, generating dynamic visual representations that aligned with the given context. Additionally, we propose a finetuning pipeline along with an augmented training dataset. Experimental results demonstrate that our approach not only outperforms existing baselines across various benchmarks but also enhances the model's robustness and generalization ability. Our work provides new insights for future research in multimodal learning.
