dc.rights.license | CC-BY-NC-ND | |
dc.contributor.advisor | Gatt, A. | |
dc.contributor.author | Vasilyev, Boris | |
dc.date.accessioned | 2025-08-21T00:05:25Z | |
dc.date.available | 2025-08-21T00:05:25Z | |
dc.date.issued | 2025 | |
dc.identifier.uri | https://studenttheses.uu.nl/handle/20.500.12932/49883 | |
dc.description.abstract | This thesis investigates whether multitask learning can improve factual grounding
in automated radiology report generation using Large Vision-Language Models.
We adapt LLaVA-OneVision-7B with QLoRA for chest X-ray report generation,
jointly optimizing report generation with auxiliary disease classification to test if
explicit pathology recognition can improve factual accuracy.
Experiments on MIMIC-CXR reveal a critical disconnect between stylistic
competence and clinical accuracy. While models demonstrate fluency in radiological terminology and report structure (ROUGE-L: 0.234-0.255), they exhibit
severe factual grounding failures with RadGraph F1 scores below 0.15. Despite
reasonable auxiliary classification performance (macro F1: 0.45-0.46), multitask
learning provides no meaningful improvement in report generation quality.
Qualitative analysis exposes systematic error patterns: only 8% of generated
reports are completely accurate, while 29% contain major finding hallucinations
and 35% omit critical abnormalities. These failures persist across different configurations, suggesting fundamental architectural limitations rather than parameter
or training issues.
Our results challenge the common approach of adapting general-purpose visionlanguage models for medical applications through fine-tuning. The vision-language
semantic gap in medical imaging appears more fundamental than previously recognized, requiring architectural innovations beyond incremental improvements.
We show that current LVLMs are unsuitable for high-stakes medical applications
without substantial modifications and that traditional NLP metrics provide misleadingly optimistic assessments.
While our multitask learning hypothesis was not confirmed, the findings establish realistic expectations for vision-language models in medical applications
and highlight the need for specialized architectures designed for clinical accuracy. | |
dc.description.sponsorship | Utrecht University | |
dc.language.iso | EN | |
dc.subject | Multitask Learning for Medical Image Report Generation with Structured Table Integration | |
dc.title | Multitask Learning for Medical Image Report Generation with Structured Table Integration | |
dc.type.content | Master Thesis | |
dc.rights.accessrights | Open Access | |
dc.subject.keywords | Artificial Intelligence; Computer Vision; Multitask Learning; Medical Image Report Generation | |
dc.subject.courseuu | Artificial Intelligence | |
dc.thesis.id | 52016 | |