Multitask Learning for Medical Image Report Generation with Structured Table Integration
Summary
This thesis investigates whether multitask learning can improve factual grounding
in automated radiology report generation using Large Vision-Language Models.
We adapt LLaVA-OneVision-7B with QLoRA for chest X-ray report generation,
jointly optimizing report generation with auxiliary disease classification to test if
explicit pathology recognition can improve factual accuracy.
Experiments on MIMIC-CXR reveal a critical disconnect between stylistic
competence and clinical accuracy. While models demonstrate fluency in radiological terminology and report structure (ROUGE-L: 0.234-0.255), they exhibit
severe factual grounding failures with RadGraph F1 scores below 0.15. Despite
reasonable auxiliary classification performance (macro F1: 0.45-0.46), multitask
learning provides no meaningful improvement in report generation quality.
Qualitative analysis exposes systematic error patterns: only 8% of generated
reports are completely accurate, while 29% contain major finding hallucinations
and 35% omit critical abnormalities. These failures persist across different configurations, suggesting fundamental architectural limitations rather than parameter
or training issues.
Our results challenge the common approach of adapting general-purpose visionlanguage models for medical applications through fine-tuning. The vision-language
semantic gap in medical imaging appears more fundamental than previously recognized, requiring architectural innovations beyond incremental improvements.
We show that current LVLMs are unsuitable for high-stakes medical applications
without substantial modifications and that traditional NLP metrics provide misleadingly optimistic assessments.
While our multitask learning hypothesis was not confirmed, the findings establish realistic expectations for vision-language models in medical applications
and highlight the need for specialized architectures designed for clinical accuracy.