Spatial Reasoning in Multimodal LLMs via CoT Distillation and Monte Carlo Tree Search for Dutch Facade-Element Detection: An Exploratory Study

Campanella, Riccardo

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Doyran, Metehan
dc.contributor.author	Campanella, Riccardo
dc.date.accessioned	2025-08-29T00:03:30Z
dc.date.available	2025-08-29T00:03:30Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/50136
dc.description.abstract	In this exploratory work, we investigate the spatial reasoning capabilities of state-of-the-art Multimodal Large Language Models (MLLMs) in the context of building renovation. We evaluate models with different capacities, including GPT-4o, Qwen, Mulberry, and SpatialRGPT on a few-shot dataset of Dutch residential buildings, assessing their performance in visual reasoning tasks. To this end, we propose DuTCh Space, a novel application of Chain-of-Thought Distillation consisting of a Dual-Teacher Framework that leverages both step-by-step rationales and scene graph description augmentation to guide and assess student models' performance on spatial reasoning. This structured supervision enables iterative refinement of spatial reasoning through domain-task decomposition and scene understanding. Additionally, we integrate Monte Carlo Tree Search at inference time to improve reasoning-path selection under visual uncertainty. By combining distillation and MCTS, we observe a measurable reduction in hallucinations, with models generating more grounded and verifiable predictions. Our findings demonstrate that reasoning-enhanced models can compensate for limited visual grounding even without scene graph augmentation, offering a scalable path toward spatially-aware MLLMs in low-resource, domain-specific settings.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	Exploring spatial reasoning capabilities in state-of-the-art Multimodal LLMs for Dutch building renovation, introducing the DuTCh SpaCE framework to enhance reasoning and reduce hallucinations.
dc.title	Spatial Reasoning in Multimodal LLMs via CoT Distillation and Monte Carlo Tree Search for Dutch Facade-Element Detection: An Exploratory Study
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Vision-Language Models; Multimodal Large Language Models (MLLMs); Visual Grounding; Spatial Reasoning; Spatial Reasoning Enhancement; Architectural Feature Recognition; Dutch Building Stock; Domain Adaptation; Scaling Laws; Model Evaluation; Model Evaluation Frameworks; Visual Question Answering (VQA); Zero-Shot Detection; Few-Shot Learning; Synthetic Captioning; Scene Graphs; 3D Scene Understanding; Test-Time Computate (TTC); Chain-of-Thought (CoT) Distillation; Collective Monte Carlo Tree Search (CoMCTS); Hallucination Reduction; Prompt Engineering; Teacher-Student Learning; Parameter Efficient Fine-Tuning (PEFT); LoRA; GPT-4o; SpatialRGPT; Mulberry; Qwen
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	53206

Files in this item

Name:: Master_Thesis_Riccardo_Campane ...
Size:: 54.80Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record