dc.description.abstract | Multi-lingual language models based on multi-lingual BERT (mBERT) have made it possible to perform linguistic tasks on virtually any language for which some raw text data is available. Through a particular type of transfer learning, these models exploit the large amount of data available for resource-rich languages like English and then apply what they learn when handling low-resource languages. This happens both in the pre-training phase, where low-resource languages' data would not be enough for the model to train, and in the fine-tuning phase, where labeled data is often available only for a few languages. Multi-lingual language models, however, exhibit better performance for some languages than for others, and many languages do not seem to benefit from multi-lingual sharing at all. The broad goal of this thesis is to improve the performance of multi-lingual models on low-resource languages. We focus on two issues: First, to create a better multi-lingual representational space where text tokens are represented independently of the language to which they belong. Second, to improve multi-lingual segmentation. In current work, indeed, low-resource languages are segmented with a tokenizer mainly trained on resource-rich languages text, thus yielding sub-optimal segmentation for low-resource languages. Instead, we use a dedicated tokenizer and a large subword vocabulary for better segmentation of each language. We then use a clustering algorithm to discover sensible multi-lingual groupings of segments across languages. A multi-lingual BERT model is then trained on the obtained clusters. In this work, we focused on two issues that presumably cause this phenomenon. First, the models' difficulties in creating a multi-lingual representational space, where. Second, the poor multi-lingual segmentation adopted: low-resource languages are segmented with a tokenizer mainly trained on resource-rich languages text, thus yielding sub-optimal segmentation. To guarantee a fair segmentation of all languages, we propose to use a dedicated tokenizer and a large subword vocabulary for each language. In order to allow the use of large monolingual vocabularies, and to increase the multi-linguality of the representational space, we cluster monolingual segments grouping together tokens which are similar across languages, and train a BERT model on the obtained clusters. Not only does this allows to use the aforementioned large vocabularies, without increasing the multi-lingual model capacity, but it also yields a truly interlingual model, which learns how to represent language-neutral semantic clusters, instead of language-specific text tokens as in traditional BERT-like models. We called this model ICEBERT, standing for Interlingual-Clusters Enhanced BERT. We show significant improvements over standard multi-lingual segmentation and training in a question answering task covering nine languages, both in a small model regime and in a BERT-base training regime, thus demonstrating the effectiveness of our clustering. The proposed approach could be easily expanded to more languages, or applied to model architectures different from mBERT. | |
dc.subject | This thesis aimed to improve mBERT's cross-lingual generalization and performance on low-resource languages, by optimizing tokenization and training it on clusters of monolingual segments.
The proposed approach, focusing on 9 languages, yielded good quality clusters, and led to the development of a novel model, Intelingual-Clusters Enhanced BERT (ICEBERT), which was shown to outperform standard segmentation and training methods on a question answering task. | |