The impact of language family on D2T generation in under-resourced languages
Summary
The paper delves deep into the challenges and methods of generating
text from structured data(RDF triples) for under-resourced languages based
on the WebNLG challenge. The main question of this paper is to assess how
important language families are in helping the model generate text without
prior examples(zero-shot) in the WebNLG target languages. In the paper,
we work with limited resources. We utilize an already pre-trained encoder-decoder
LLM, such as mT5-small, to test the hypothesis, and we train as
much as necessary before we notice a plateau in performance since there are
hardware limitations. By applying further pre-training and testing different
finetuning strategies, the aim is to improve text coherence and fluency and
assess how well the model extracts the information from the RDF triples.
As part of our ablation experimentation, we pre-train and finetune to assess
their impact on the D2T task. The experimentation starts with the simplest
model, pre-trained on the OPUS-100 dataset and finetuned on the English
WebNLG dataset. The pre-training recipe remains the same, but for the
finetuning step, the WebNLG dataset is altered to include more linguistically
diverse language samples. Lastly, we introduce an augmentation technique
to alter theWebNLG dataset further and generate samples for all the relative
languages we are trying to target. In the end, the best finetuning strategy is
applied to a clean mT5 model to assess the influence of the pre-training. Later
on, in the meta-experiments, we generate augmented data for the languages
we target in WebNLG, which takes the models out of the zero-shot setting.
With these extensive experiments, we try to measure model performance
using automatic metrics involving manual analysis and a comparison
with other models in the WebNLG 2023 challenge. For the assessment, we
use automatic metrics such as BLEU, ROUGE, METEOR, TER, chrF++,
BertScore and PARENT to provide a more holistic view of our model’s
capabilities. In a few words, this study aims to contribute in the following
ways:(a) What is the influence of language families under a zero-shot setting?
(b) is further pre-training necessary, or does it tend to have diminishing
results? (c) Does finetuning with noisy data provide any benefit? Lastly, (d)
How does our model compare with the other models of the WebNLG 2023
challenge?