Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorDeoskar, Tejaswini
dc.contributor.authorBassani, Riccardo
dc.date.accessioned2021-10-29T10:02:10Z
dc.date.available2021-10-29T10:02:10Z
dc.date.issued2021
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/102
dc.description.abstractMulti-lingual language models based on multi-lingual BERT (mBERT) have made it possible to perform linguistic tasks on virtually any language for which some raw text data is available. Through a particular type of transfer learning, these models exploit the large amount of data available for resource-rich languages like English and then apply what they learn when handling low-resource languages. This happens both in the pre-training phase, where low-resource languages' data would not be enough for the model to train, and in the fine-tuning phase, where labeled data is often available only for a few languages. Multi-lingual language models, however, exhibit better performance for some languages than for others, and many languages do not seem to benefit from multi-lingual sharing at all. The broad goal of this thesis is to improve the performance of multi-lingual models on low-resource languages. We focus on two issues: First, to create a better multi-lingual representational space where text tokens are represented independently of the language to which they belong. Second, to improve multi-lingual segmentation. In current work, indeed, low-resource languages are segmented with a tokenizer mainly trained on resource-rich languages text, thus yielding sub-optimal segmentation for low-resource languages. Instead, we use a dedicated tokenizer and a large subword vocabulary for better segmentation of each language. We then use a clustering algorithm to discover sensible multi-lingual groupings of segments across languages. A multi-lingual BERT model is then trained on the obtained clusters. In this work, we focused on two issues that presumably cause this phenomenon. First, the models' difficulties in creating a multi-lingual representational space, where. Second, the poor multi-lingual segmentation adopted: low-resource languages are segmented with a tokenizer mainly trained on resource-rich languages text, thus yielding sub-optimal segmentation. To guarantee a fair segmentation of all languages, we propose to use a dedicated tokenizer and a large subword vocabulary for each language. In order to allow the use of large monolingual vocabularies, and to increase the multi-linguality of the representational space, we cluster monolingual segments grouping together tokens which are similar across languages, and train a BERT model on the obtained clusters. Not only does this allows to use the aforementioned large vocabularies, without increasing the multi-lingual model capacity, but it also yields a truly interlingual model, which learns how to represent language-neutral semantic clusters, instead of language-specific text tokens as in traditional BERT-like models. We called this model ICEBERT, standing for Interlingual-Clusters Enhanced BERT. We show significant improvements over standard multi-lingual segmentation and training in a question answering task covering nine languages, both in a small model regime and in a BERT-base training regime, thus demonstrating the effectiveness of our clustering. The proposed approach could be easily expanded to more languages, or applied to model architectures different from mBERT.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectThis thesis aimed to improve mBERT's cross-lingual generalization and performance on low-resource languages, by optimizing tokenization and training it on clusters of monolingual segments. The proposed approach, focusing on 9 languages, yielded good quality clusters, and led to the development of a novel model, Intelingual-Clusters Enhanced BERT (ICEBERT), which was shown to outperform standard segmentation and training methods on a question answering task.
dc.titleClustering Monolingual Vocabularies to Improve Cross-Lingual Generalization
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordscross-lingual LMs; mBERT; ICEBERT; subword tokenization; subword clustering;
dc.subject.courseuuArtificial Intelligence
dc.thesis.id309


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record