The impact of language family on D2T generation in under-resourced languages

Christopoulos, George

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Gatt, A.
dc.contributor.author	Christopoulos, George
dc.date.accessioned	2024-07-24T23:07:10Z
dc.date.available	2024-07-24T23:07:10Z
dc.date.issued	2024
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/46903
dc.description.abstract	The paper delves deep into the challenges and methods of generating text from structured data(RDF triples) for under-resourced languages based on the WebNLG challenge. The main question of this paper is to assess how important language families are in helping the model generate text without prior examples(zero-shot) in the WebNLG target languages. In the paper, we work with limited resources. We utilize an already pre-trained encoder-decoder LLM, such as mT5-small, to test the hypothesis, and we train as much as necessary before we notice a plateau in performance since there are hardware limitations. By applying further pre-training and testing different finetuning strategies, the aim is to improve text coherence and fluency and assess how well the model extracts the information from the RDF triples. As part of our ablation experimentation, we pre-train and finetune to assess their impact on the D2T task. The experimentation starts with the simplest model, pre-trained on the OPUS-100 dataset and finetuned on the English WebNLG dataset. The pre-training recipe remains the same, but for the finetuning step, the WebNLG dataset is altered to include more linguistically diverse language samples. Lastly, we introduce an augmentation technique to alter theWebNLG dataset further and generate samples for all the relative languages we are trying to target. In the end, the best finetuning strategy is applied to a clean mT5 model to assess the influence of the pre-training. Later on, in the meta-experiments, we generate augmented data for the languages we target in WebNLG, which takes the models out of the zero-shot setting. With these extensive experiments, we try to measure model performance using automatic metrics involving manual analysis and a comparison with other models in the WebNLG 2023 challenge. For the assessment, we use automatic metrics such as BLEU, ROUGE, METEOR, TER, chrF++, BertScore and PARENT to provide a more holistic view of our model’s capabilities. In a few words, this study aims to contribute in the following ways:(a) What is the influence of language families under a zero-shot setting? (b) is further pre-training necessary, or does it tend to have diminishing results? (c) Does finetuning with noisy data provide any benefit? Lastly, (d) How does our model compare with the other models of the WebNLG 2023 challenge?
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	The thesis aims to explore the impact of language families when generating with mt5 in the under-resourced languages of WebNLG
dc.title	The impact of language family on D2T generation in under-resourced languages
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.courseuu	Computing Science
dc.thesis.id	34806

Files in this item

Name:: Georgios_Christopoulos_MSc_The ...
Size:: 3.558Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record