Multitask Learning for Medical Image Report Generation with Structured Table Integration

Vasilyev, Boris

View/Open

MSc_Thesis___Multitask_Learning_for_Medical_Image_Report_Generation_Submission.pdf (1.871Mb)

Publication date

2025

Author

Vasilyev, Boris

Metadata

Show full item record

Summary

This thesis investigates whether multitask learning can improve factual grounding in automated radiology report generation using Large Vision-Language Models. We adapt LLaVA-OneVision-7B with QLoRA for chest X-ray report generation, jointly optimizing report generation with auxiliary disease classification to test if explicit pathology recognition can improve factual accuracy. Experiments on MIMIC-CXR reveal a critical disconnect between stylistic competence and clinical accuracy. While models demonstrate fluency in radiological terminology and report structure (ROUGE-L: 0.234-0.255), they exhibit severe factual grounding failures with RadGraph F1 scores below 0.15. Despite reasonable auxiliary classification performance (macro F1: 0.45-0.46), multitask learning provides no meaningful improvement in report generation quality. Qualitative analysis exposes systematic error patterns: only 8% of generated reports are completely accurate, while 29% contain major finding hallucinations and 35% omit critical abnormalities. These failures persist across different configurations, suggesting fundamental architectural limitations rather than parameter or training issues. Our results challenge the common approach of adapting general-purpose visionlanguage models for medical applications through fine-tuning. The vision-language semantic gap in medical imaging appears more fundamental than previously recognized, requiring architectural innovations beyond incremental improvements. We show that current LVLMs are unsuitable for high-stakes medical applications without substantial modifications and that traditional NLP metrics provide misleadingly optimistic assessments. While our multitask learning hypothesis was not confirmed, the findings establish realistic expectations for vision-language models in medical applications and highlight the need for specialized architectures designed for clinical accuracy.

URI

https://studenttheses.uu.nl/handle/20.500.12932/49883

Collections

Theses