An assessment of Zero-Shot Open Book Question Answering using Large Language Models
Summary
In this thesis, we aim to compare the performance of State-Of-The-Art Language Models in a Zero-Shot Open Domain Question Answering setting for technical topics, specifically regarding cloud technology and containerization. Question Answering has historically been mostly extractive in nature, but in recent years we have seen the paradigm of Natural Language Processing switch towards the more abstract Natural Language Generation approach. We propose a two-step architecture, in which the solution attempts to answers questions from a set of documents with no prior training or fine-tuning. We do not solely focus on Retriever-Reader methods (e.g., BERT, RoBERTa), but also evaluate Retriever-Generator (e.g., GPT, FLAN-T5) systems through Long Form Question Answering. The Amazon Web Services dataset is used as an benchmark for evaluating performance of the zero-shot Open Book Question Answering system. Empirical results are sometimes obtained by splitting the documents in to smaller subsections like paragraphs or passages, therefore we analyse the hyperparameters for document splitting using a sliding window. We show that RoBERTa-large is able to achieve a new State-Of-The-Art F1 score of 59.19 through proper pre-processing of the documents and carefully selecting hyper-parameters, gaining a respectful 18.66 compared to the baseline and 16.99 compared to the best results in the original study. We conclude that generative models and Long Form Question Answering demonstrate great potential, but come with their own set of biases and risks. We observe that when the complexity of the model far exceeds the evaluation metrics, the relevance and meaning of the metrics become questionable. In this context, Semantic Answer Similarity and METEOR prove useful for analyzing diverse model outputs, as they are not dependent on lexical stride like ROUGE, BLEU, F1 and EM. Splitting documents into passages offers performance benefits, although it is important to note that document splitting may not necessarily be superior for all use cases and the optimal hyper parameter values are expected to vary depending on the specific application.