View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        An explorative study of smaller LLMs: Pushing performance in a factuality assessment task using Prompt Engineering and Direct Preference Optimization

        Thumbnail
        View/Open
        Thesis_Cathinka_Camfferman.pdf (1.889Mb)
        Publication date
        2024
        Author
        Camfferman, Cathinka
        Metadata
        Show full item record
        Summary
        The automatic summarization of texts is time- and cost-efficient compared to manual summarization. However, models used for summary generation oftentimes ‘hallucinate’ information, meaning that the generated summary contains information not present in the input document. This greatly diminishes the usefulness and reliability of generated summaries. To combat this issue, various evaluation metrics have been proposed to assess the factuality of generated summaries. They are often based on the summary’s token probability or the lexical similarity with a reference summary. Recent works on this topic use Large Language Models (LLMs) for summary evaluation, given their impressive language capabilities. Several studies demonstrated that LLMs, such as Chat-GPT, perform comparable to state-of-the-art evaluation methods, even without being fine-tuned for this task. Unfortunately, there is a significant downside to employing an LLM for this task. Due to a prohibitively large number of parameters, they are currently too computationally expensive to be used in practice. Moreover: their evaluation capabilities are good, but still not on par with human assessment. This work aims to tackle both problems by improving a recent LLM-based evaluation framework called TrueTeacher. In this framework, the language capabilities of a large LLM are distilled into a significantly smaller and more efficient model, which can be used to assess the factuality of a summary. Its performance supersedes current state-of-the-art, but recent literature suggests that changes to its design may push its performance even further. This work investigates two directions for improvement: prompt engineering and direct preference optimization. In addition to this, this work aims to provide guidelines for future research on how to push performance on smaller LLMs, as this is currently an understudied subject.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/48237
        Collections
        • Theses
        Utrecht university logo