Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorGatt, A.
dc.contributor.authorVasilyev, Boris
dc.date.accessioned2025-08-21T00:05:25Z
dc.date.available2025-08-21T00:05:25Z
dc.date.issued2025
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/49883
dc.description.abstractThis thesis investigates whether multitask learning can improve factual grounding in automated radiology report generation using Large Vision-Language Models. We adapt LLaVA-OneVision-7B with QLoRA for chest X-ray report generation, jointly optimizing report generation with auxiliary disease classification to test if explicit pathology recognition can improve factual accuracy. Experiments on MIMIC-CXR reveal a critical disconnect between stylistic competence and clinical accuracy. While models demonstrate fluency in radiological terminology and report structure (ROUGE-L: 0.234-0.255), they exhibit severe factual grounding failures with RadGraph F1 scores below 0.15. Despite reasonable auxiliary classification performance (macro F1: 0.45-0.46), multitask learning provides no meaningful improvement in report generation quality. Qualitative analysis exposes systematic error patterns: only 8% of generated reports are completely accurate, while 29% contain major finding hallucinations and 35% omit critical abnormalities. These failures persist across different configurations, suggesting fundamental architectural limitations rather than parameter or training issues. Our results challenge the common approach of adapting general-purpose visionlanguage models for medical applications through fine-tuning. The vision-language semantic gap in medical imaging appears more fundamental than previously recognized, requiring architectural innovations beyond incremental improvements. We show that current LVLMs are unsuitable for high-stakes medical applications without substantial modifications and that traditional NLP metrics provide misleadingly optimistic assessments. While our multitask learning hypothesis was not confirmed, the findings establish realistic expectations for vision-language models in medical applications and highlight the need for specialized architectures designed for clinical accuracy.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectMultitask Learning for Medical Image Report Generation with Structured Table Integration
dc.titleMultitask Learning for Medical Image Report Generation with Structured Table Integration
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsArtificial Intelligence; Computer Vision; Multitask Learning; Medical Image Report Generation
dc.subject.courseuuArtificial Intelligence
dc.thesis.id52016


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record