Visual Narrator 2.0: From User Stories to Domain Models via Large and Small Language Models

Chou, Cheng Yi

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Dalpiaz, Fabiano
dc.contributor.author	Chou, Cheng Yi
dc.date.accessioned	2025-08-28T00:01:30Z
dc.date.available	2025-08-28T00:01:30Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/50024
dc.description.abstract	Modeling is an important activity in Requirements Engineering (RE), yet performing it manually is time-consuming and tedious. When the modeling starts from an existing set of early requirements, researchers have proposed tools that automatically extract domain models. Early rule-based systems eased the burden but relied on experts to create linguistic rules to process the natural language, which struggles with free-form text and typos. Recent studies show that large language models (LLMs) can automate the modeling task, potentially replacing the heuristic rules; however, their resourceintensive nature limits local deployment. This study therefore asks whether cost-efficient small language models (SLMs) can deliver performance comparable to that of LLMs and a rulebased system, and further explores the types of errors they tend to produce. We evaluate GPT-o1, Llama3-8B, Qwen-14B, and the rule-based system Visual Narrator (VN) on the task of domain-model extraction, measuring model completeness (F2) and validity (F0.5). Experiments use a dataset of nine projects consisting of 487 user stories, and apply Friedman and Nemenyi tests to verify statistical significance. The results show that neither large nor small language models achieve a statistically significant improvement over VN in class identification. GPT-o1 outperforms the two SLMs in model validity on both class- and associationidentification tasks, whereas Qwen-14B attains model completeness comparable to GPT-o1 on both tasks despite its smaller size. Error analysis reveals distinct profiles: VN tends to mislabel Role entities, whereas language models more frequently introduce Redundant/Derived elements, with SLMs additionally prone to Irrelevant and Attribute errors. These findings indicate that resource-efficient SLMs can rival LLMs in model completeness, shifting the focus from model scale to accessibility. We also introduce an open-source evaluation system that supports reproducible research in automated requirements modeling.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	This study aims to investigate the performance of one leading large language model (GPT-o1) and two popular small language models (Llama3-8B and Qwen-14B) in the task of domain modeling. It also provides prompt and and automatic evaluation system for furture researchers.
dc.title	Visual Narrator 2.0: From User Stories to Domain Models via Large and Small Language Models
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Requirements Engineering; NLP; LLMs; SLMs; Domain Modeling
dc.subject.courseuu	Business Informatics
dc.thesis.id	52698

Files in this item

Name:: Rex__Master_Thesis.pdf
Size:: 1.130Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record

Visual Narrator 2.0: From User Stories to Domain Models via Large and Small Language Models

Files in this item

This item appears in the following Collection(s)

Related items

Modeling dual-task performance: do individualized models predict dual-task performance better than average models? ﻿

Modelling Wastewater Quantity and Quality in Mexico -- using an agent-based model ﻿

Modelling offshore wind in the IMAGE/TIMER model ﻿

Modeling dual-task performance: do individualized models predict dual-task performance better than average models?

Modelling Wastewater Quantity and Quality in Mexico -- using an agent-based model

Modelling offshore wind in the IMAGE/TIMER model