View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Multitask Learning for Medical Image Report Generation with Structured Table Integration

        Thumbnail
        View/Open
        MSc_Thesis___Multitask_Learning_for_Medical_Image_Report_Generation_Submission.pdf (1.871Mb)
        Publication date
        2025
        Author
        Vasilyev, Boris
        Metadata
        Show full item record
        Summary
        This thesis investigates whether multitask learning can improve factual grounding in automated radiology report generation using Large Vision-Language Models. We adapt LLaVA-OneVision-7B with QLoRA for chest X-ray report generation, jointly optimizing report generation with auxiliary disease classification to test if explicit pathology recognition can improve factual accuracy. Experiments on MIMIC-CXR reveal a critical disconnect between stylistic competence and clinical accuracy. While models demonstrate fluency in radiological terminology and report structure (ROUGE-L: 0.234-0.255), they exhibit severe factual grounding failures with RadGraph F1 scores below 0.15. Despite reasonable auxiliary classification performance (macro F1: 0.45-0.46), multitask learning provides no meaningful improvement in report generation quality. Qualitative analysis exposes systematic error patterns: only 8% of generated reports are completely accurate, while 29% contain major finding hallucinations and 35% omit critical abnormalities. These failures persist across different configurations, suggesting fundamental architectural limitations rather than parameter or training issues. Our results challenge the common approach of adapting general-purpose visionlanguage models for medical applications through fine-tuning. The vision-language semantic gap in medical imaging appears more fundamental than previously recognized, requiring architectural innovations beyond incremental improvements. We show that current LVLMs are unsuitable for high-stakes medical applications without substantial modifications and that traditional NLP metrics provide misleadingly optimistic assessments. While our multitask learning hypothesis was not confirmed, the findings establish realistic expectations for vision-language models in medical applications and highlight the need for specialized architectures designed for clinical accuracy.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/49883
        Collections
        • Theses
        Utrecht university logo