Worth more than gold
Summary
Research within Natural Language Generation is evolving rapidly as new models are consistently reported to outperform previous works based on popular metrics, such as n-gram-based BLEU and ROUGE or model-based, such as BLUERT and BERTScore. Nonetheless, it is still unclear why state-of-the-art models make errors, thus making it difficult to identify a viable solution. This project aims to help fill this gap by performing model diagnostics on the mT5 transformer and attempting to uncover the origin of errors within Transformers. The mT5-base, T5-base, Yeb Havinga’s T5-base Dutch-case, and BART models are fine-tuned on the RDF-to-text dataset called CACAPO. Following this, the mT5-base’s generations are manually and automatically reviewed using evaluation metrics BLEU, METEOR, ROUGE, BERTScore, BARTScore, and PARENT. Additional experiments are conducted where CACAPO’s training set is augmented, and underspecified input is provided with additional contextual information. These experiments led to several insights. First, observations are made involving CACAPO, where the reverse engineering nature highlighted the difficulties of capturing all contextually relevant data in the input. Furthermore, CACAPO often leaves relevant information from the reference text out of the input, as no attribute could be connected to the corresponding value. However, experimentation showed that this is too restricted and that CACAPO could be extended using inter-subject attributes. Furthermore, data augmentation experiments highlighted the need for a structured augmentation method for multilingual use cases. Following this, experimentation showed it to be more beneficial to add contextual data to underspecified inputs compared to augmenting the data. This also highlights the need for an improved content selection process, so that all contextually relevant information in the reference text is captured in the input data. Another observation was made where a model trained on improved input data performs better during the inference stage, even when the input data during inference was underspecified. This indicates that purely improving the specificity of the training set could lower the number of errors made by the model. Moreover, comparisons between Dutch and English records showed that Dutch records improved more due to the additional input data, which could be caused by the amount of training data the model has seen during both pre-training and fine-tuning. This could highlight a correlation between improved contextual input data and the necessary training set size, where smaller data set sizes might be usable if the input data had all contextually relevant information. Another possible reason could be the difference in sentence complexity, where the majority of Dutch records are relatively simple, whereas a large part of English records is complex. This could indicate that an improved specification of input data could be more impactful for relatively simple texts, highlighted by the increase in performance for Dutch records, but lacking increase of BARTScore performance for the English records. Furthermore, analysis between languages showed no difference in error counts, showing that error types are consistent between English and Dutch records. However, the severity of these errors was not captured in this project. Finally, each model showed difficulty capturing the correct order of attributes, thereby generating incorrect conclusions. This is likely due to the lacking relational information in the CACAPO dataset for end-to-end models.