An explorative study of smaller LLMs: Pushing performance in a factuality assessment task using Prompt Engineering and Direct Preference Optimization
Summary
The automatic summarization of texts is time- and cost-efficient compared to manual summarization. However, models used for summary generation oftentimes ‘hallucinate’ information, meaning that the generated summary contains information not present in the input document. This greatly diminishes the usefulness and reliability of generated summaries.
To combat this issue, various evaluation metrics have been proposed to assess the factuality of generated summaries. They are often based on the summary’s token probability or the lexical similarity with a reference summary. Recent works on this topic use Large Language Models (LLMs) for summary evaluation, given their impressive language capabilities. Several studies demonstrated that LLMs, such as Chat-GPT, perform comparable to state-of-the-art evaluation methods, even without being fine-tuned for this task.
Unfortunately, there is a significant downside to employing an LLM for this task. Due to a prohibitively large number of parameters, they are currently too computationally expensive to be used in practice. Moreover: their evaluation capabilities are good, but still not on par with human assessment.
This work aims to tackle both problems by improving a recent LLM-based evaluation framework called TrueTeacher. In this framework, the language capabilities of a large LLM are distilled into a significantly smaller and more efficient model, which can be used to assess the factuality of a summary. Its performance supersedes current state-of-the-art, but recent literature suggests that changes to its design may push its performance even further. This work investigates two directions for improvement: prompt engineering and direct preference optimization. In addition to this, this work aims to provide guidelines for future research on how to push performance on smaller LLMs, as this is currently an understudied subject.