Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorDoyran, Metehan
dc.contributor.authorCampanella, Riccardo
dc.date.accessioned2025-08-29T00:03:30Z
dc.date.available2025-08-29T00:03:30Z
dc.date.issued2025
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/50136
dc.description.abstractIn this exploratory work, we investigate the spatial reasoning capabilities of state-of-the-art Multimodal Large Language Models (MLLMs) in the context of building renovation. We evaluate models with different capacities, including GPT-4o, Qwen, Mulberry, and SpatialRGPT on a few-shot dataset of Dutch residential buildings, assessing their performance in visual reasoning tasks. To this end, we propose DuTCh Space, a novel application of Chain-of-Thought Distillation consisting of a Dual-Teacher Framework that leverages both step-by-step rationales and scene graph description augmentation to guide and assess student models' performance on spatial reasoning. This structured supervision enables iterative refinement of spatial reasoning through domain-task decomposition and scene understanding. Additionally, we integrate Monte Carlo Tree Search at inference time to improve reasoning-path selection under visual uncertainty. By combining distillation and MCTS, we observe a measurable reduction in hallucinations, with models generating more grounded and verifiable predictions. Our findings demonstrate that reasoning-enhanced models can compensate for limited visual grounding even without scene graph augmentation, offering a scalable path toward spatially-aware MLLMs in low-resource, domain-specific settings.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectExploring spatial reasoning capabilities in state-of-the-art Multimodal LLMs for Dutch building renovation, introducing the DuTCh SpaCE framework to enhance reasoning and reduce hallucinations.
dc.titleSpatial Reasoning in Multimodal LLMs via CoT Distillation and Monte Carlo Tree Search for Dutch Facade-Element Detection: An Exploratory Study
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsVision-Language Models; Multimodal Large Language Models (MLLMs); Visual Grounding; Spatial Reasoning; Spatial Reasoning Enhancement; Architectural Feature Recognition; Dutch Building Stock; Domain Adaptation; Scaling Laws; Model Evaluation; Model Evaluation Frameworks; Visual Question Answering (VQA); Zero-Shot Detection; Few-Shot Learning; Synthetic Captioning; Scene Graphs; 3D Scene Understanding; Test-Time Computate (TTC); Chain-of-Thought (CoT) Distillation; Collective Monte Carlo Tree Search (CoMCTS); Hallucination Reduction; Prompt Engineering; Teacher-Student Learning; Parameter Efficient Fine-Tuning (PEFT); LoRA; GPT-4o; SpatialRGPT; Mulberry; Qwen
dc.subject.courseuuArtificial Intelligence
dc.thesis.id53206


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record