Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorGatt, A.
dc.contributor.authorCamfferman, Cathinka
dc.date.accessioned2024-12-12T00:02:10Z
dc.date.available2024-12-12T00:02:10Z
dc.date.issued2024
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/48237
dc.description.abstractThe automatic summarization of texts is time- and cost-efficient compared to manual summarization. However, models used for summary generation oftentimes ‘hallucinate’ information, meaning that the generated summary contains information not present in the input document. This greatly diminishes the usefulness and reliability of generated summaries. To combat this issue, various evaluation metrics have been proposed to assess the factuality of generated summaries. They are often based on the summary’s token probability or the lexical similarity with a reference summary. Recent works on this topic use Large Language Models (LLMs) for summary evaluation, given their impressive language capabilities. Several studies demonstrated that LLMs, such as Chat-GPT, perform comparable to state-of-the-art evaluation methods, even without being fine-tuned for this task. Unfortunately, there is a significant downside to employing an LLM for this task. Due to a prohibitively large number of parameters, they are currently too computationally expensive to be used in practice. Moreover: their evaluation capabilities are good, but still not on par with human assessment. This work aims to tackle both problems by improving a recent LLM-based evaluation framework called TrueTeacher. In this framework, the language capabilities of a large LLM are distilled into a significantly smaller and more efficient model, which can be used to assess the factuality of a summary. Its performance supersedes current state-of-the-art, but recent literature suggests that changes to its design may push its performance even further. This work investigates two directions for improvement: prompt engineering and direct preference optimization. In addition to this, this work aims to provide guidelines for future research on how to push performance on smaller LLMs, as this is currently an understudied subject.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectThis work investigates different improvement strategies for smaller LLMs: prompt engineering and direct preference optimization. These strategies are used to improve the LLMs' ability to evaluate a generated summary in terms of factuality.
dc.titleAn explorative study of smaller LLMs: Pushing performance in a factuality assessment task using Prompt Engineering and Direct Preference Optimization
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsLLMs; knowledge distillation; energy-efficient; direct preference optimization; post-training; alignment; prompt engineering; automated summarization; factuality; evaluation
dc.subject.courseuuArtificial Intelligence
dc.thesis.id41655


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record