dc.rights.license | CC-BY-NC-ND | |
dc.contributor.advisor | Doyran, Metehan | |
dc.contributor.author | Campanella, Riccardo | |
dc.date.accessioned | 2025-08-29T00:03:30Z | |
dc.date.available | 2025-08-29T00:03:30Z | |
dc.date.issued | 2025 | |
dc.identifier.uri | https://studenttheses.uu.nl/handle/20.500.12932/50136 | |
dc.description.abstract | In this exploratory work, we investigate the spatial reasoning capabilities of state-of-the-art Multimodal Large Language Models (MLLMs) in the context of building renovation. We evaluate models with different capacities, including GPT-4o, Qwen, Mulberry, and SpatialRGPT on a few-shot dataset of Dutch residential buildings, assessing their performance in visual reasoning tasks. To this end, we propose DuTCh Space, a novel application of Chain-of-Thought Distillation consisting of a Dual-Teacher Framework that leverages both step-by-step rationales and scene graph description augmentation to guide and assess student models' performance on spatial reasoning. This structured supervision enables iterative refinement of spatial reasoning through domain-task decomposition and scene understanding. Additionally, we integrate Monte Carlo Tree Search at inference time to improve reasoning-path selection under visual uncertainty. By combining distillation and MCTS, we observe a measurable reduction in hallucinations, with models generating more grounded and verifiable predictions. Our findings demonstrate that reasoning-enhanced models can compensate for limited visual grounding even without scene graph augmentation, offering a scalable path toward spatially-aware MLLMs in low-resource, domain-specific settings. | |
dc.description.sponsorship | Utrecht University | |
dc.language.iso | EN | |
dc.subject | Exploring spatial reasoning capabilities in state-of-the-art Multimodal LLMs for Dutch building renovation, introducing the DuTCh SpaCE framework to enhance reasoning and reduce hallucinations. | |
dc.title | Spatial Reasoning in Multimodal LLMs via CoT Distillation and Monte Carlo Tree Search for Dutch Facade-Element Detection: An Exploratory Study | |
dc.type.content | Master Thesis | |
dc.rights.accessrights | Open Access | |
dc.subject.keywords | Vision-Language Models; Multimodal Large Language Models (MLLMs); Visual Grounding; Spatial Reasoning; Spatial Reasoning Enhancement; Architectural Feature Recognition; Dutch Building Stock; Domain Adaptation; Scaling Laws; Model Evaluation; Model Evaluation Frameworks; Visual Question Answering (VQA); Zero-Shot Detection; Few-Shot Learning; Synthetic Captioning; Scene Graphs; 3D Scene Understanding; Test-Time Computate (TTC); Chain-of-Thought (CoT) Distillation; Collective Monte Carlo Tree Search (CoMCTS); Hallucination Reduction; Prompt Engineering; Teacher-Student Learning; Parameter Efficient Fine-Tuning (PEFT); LoRA; GPT-4o; SpatialRGPT; Mulberry; Qwen | |
dc.subject.courseuu | Artificial Intelligence | |
dc.thesis.id | 53206 | |