An explorative study of smaller LLMs: Pushing performance in a factuality assessment task using Prompt Engineering and Direct Preference Optimization

Camfferman, Cathinka

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Gatt, A.
dc.contributor.author	Camfferman, Cathinka
dc.date.accessioned	2024-12-12T00:02:10Z
dc.date.available	2024-12-12T00:02:10Z
dc.date.issued	2024
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/48237
dc.description.abstract	The automatic summarization of texts is time- and cost-efficient compared to manual summarization. However, models used for summary generation oftentimes ‘hallucinate’ information, meaning that the generated summary contains information not present in the input document. This greatly diminishes the usefulness and reliability of generated summaries. To combat this issue, various evaluation metrics have been proposed to assess the factuality of generated summaries. They are often based on the summary’s token probability or the lexical similarity with a reference summary. Recent works on this topic use Large Language Models (LLMs) for summary evaluation, given their impressive language capabilities. Several studies demonstrated that LLMs, such as Chat-GPT, perform comparable to state-of-the-art evaluation methods, even without being fine-tuned for this task. Unfortunately, there is a significant downside to employing an LLM for this task. Due to a prohibitively large number of parameters, they are currently too computationally expensive to be used in practice. Moreover: their evaluation capabilities are good, but still not on par with human assessment. This work aims to tackle both problems by improving a recent LLM-based evaluation framework called TrueTeacher. In this framework, the language capabilities of a large LLM are distilled into a significantly smaller and more efficient model, which can be used to assess the factuality of a summary. Its performance supersedes current state-of-the-art, but recent literature suggests that changes to its design may push its performance even further. This work investigates two directions for improvement: prompt engineering and direct preference optimization. In addition to this, this work aims to provide guidelines for future research on how to push performance on smaller LLMs, as this is currently an understudied subject.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	This work investigates different improvement strategies for smaller LLMs: prompt engineering and direct preference optimization. These strategies are used to improve the LLMs' ability to evaluate a generated summary in terms of factuality.
dc.title	An explorative study of smaller LLMs: Pushing performance in a factuality assessment task using Prompt Engineering and Direct Preference Optimization
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	LLMs; knowledge distillation; energy-efficient; direct preference optimization; post-training; alignment; prompt engineering; automated summarization; factuality; evaluation
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	41655

Files in this item

Name:: Thesis_Cathinka_Camfferman.pdf
Size:: 1.889Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record