Exploring LLM-Based Semantic Representations in a Hybrid Approach for Automated Trace Link Recovery

Cheng, Junxin

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Dalpiaz, Fabiano
dc.contributor.author	Cheng, Junxin
dc.date.accessioned	2025-10-08T23:01:27Z
dc.date.available	2025-10-08T23:01:27Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/50513
dc.description.abstract	Automated trace link recovery between issues and commits is essential for maintaining requirements traceability, as it reduces the manual effort required in large-scale software projects. This study investigates the effectiveness of large language models (LLMs) in generating semantic representations within a machine learning classification framework for supporting automated trace link recovery. To this end, we formulate three research questions: (1) How effective is the feature representation via LLM embeddings compared to information retrieval (IR) methods, static word embeddings, and bidirectional encoder representations from Transformers (BERT)-based models? (2) What is the relative contribution of textual and non-textual features to supervised issue-commit link classification? (3) Which classification algorithm performs best when using the engineered features? We extract and construct three categories of feature sets: textual, non-textual, and a combination of both, based on data from eight open-source projects. And we apply five models: VSM with TF-IDF, FastText, Word2Vec, Sentence Transformer, and OpenAl's embedding model to evaluate the effectiveness of semantic representations. These models are assessed using two classifiers (Random Forest and XGBoost) in two practical scenarios: trace recommendation and trace maintenance. Evaluation metrics include Precision, Recall, F2, and F0.5 scores, further supported by statistical significance tests and feature importance analysis. The results show that textual features generated by the VSM with TF-IDF consistently outperform other semantic and non-textual features, demonstrating not only the effectiveness of domain-specific term distribution captured by traditional IR methods but also the importance of high-quality semantic representations. Nonetheless, LLM-based models, without domain-specific fine-tuning, demonstrate comparable performance, suggesting their strong potential in automated trace link recovery. Additionally, Random Forest outperforms XGBoost in both evaluation scenarios. This comparative study provides practical insights into designing robust LLM-enhanced traceability support systems for requirements engineering in modern software development environments. We introduce a hybrid approach that integrates traditional IR models, static and contextual embeddings (including LLM-based representations), along with both textual and non-textual features within a supervised classification framework. Future work may focus on fine-tuning LLMs for domain-specific contexts, enriching the feature space with additional development artifacts, and exploring prompt-based or interactive trace inference. Investigating lightweight deployment strategies and alternative classifiers also presents promising directions for practical and scalable use.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	This thesis investigates LLM-based semantic representations for automated trace link recovery between issues and commits. It compares LLM embeddings with IR methods, static embeddings, and BERT-based models, analyzing textual and non-textual features using Random Forest and XGBoost on eight open-source projects.
dc.title	Exploring LLM-Based Semantic Representations in a Hybrid Approach for Automated Trace Link Recovery
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Requirement Traceability, Trace Link Recovery, Issue-commit Link Recovery, Large Language Models, Machine Learning
dc.subject.courseuu	Business Informatics
dc.thesis.id	53073

Files in this item

Name:: master_thesis.pdf
Size:: 2.235Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record