Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization

Bassani, Riccardo

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Deoskar, Tejaswini
dc.contributor.author	Bassani, Riccardo
dc.date.accessioned	2021-10-29T10:02:10Z
dc.date.available	2021-10-29T10:02:10Z
dc.date.issued	2021
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/102
dc.description.abstract	Multi-lingual language models based on multi-lingual BERT (mBERT) have made it possible to perform linguistic tasks on virtually any language for which some raw text data is available. Through a particular type of transfer learning, these models exploit the large amount of data available for resource-rich languages like English and then apply what they learn when handling low-resource languages. This happens both in the pre-training phase, where low-resource languages' data would not be enough for the model to train, and in the fine-tuning phase, where labeled data is often available only for a few languages. Multi-lingual language models, however, exhibit better performance for some languages than for others, and many languages do not seem to benefit from multi-lingual sharing at all. The broad goal of this thesis is to improve the performance of multi-lingual models on low-resource languages. We focus on two issues: First, to create a better multi-lingual representational space where text tokens are represented independently of the language to which they belong. Second, to improve multi-lingual segmentation. In current work, indeed, low-resource languages are segmented with a tokenizer mainly trained on resource-rich languages text, thus yielding sub-optimal segmentation for low-resource languages. Instead, we use a dedicated tokenizer and a large subword vocabulary for better segmentation of each language. We then use a clustering algorithm to discover sensible multi-lingual groupings of segments across languages. A multi-lingual BERT model is then trained on the obtained clusters. In this work, we focused on two issues that presumably cause this phenomenon. First, the models' difficulties in creating a multi-lingual representational space, where. Second, the poor multi-lingual segmentation adopted: low-resource languages are segmented with a tokenizer mainly trained on resource-rich languages text, thus yielding sub-optimal segmentation. To guarantee a fair segmentation of all languages, we propose to use a dedicated tokenizer and a large subword vocabulary for each language. In order to allow the use of large monolingual vocabularies, and to increase the multi-linguality of the representational space, we cluster monolingual segments grouping together tokens which are similar across languages, and train a BERT model on the obtained clusters. Not only does this allows to use the aforementioned large vocabularies, without increasing the multi-lingual model capacity, but it also yields a truly interlingual model, which learns how to represent language-neutral semantic clusters, instead of language-specific text tokens as in traditional BERT-like models. We called this model ICEBERT, standing for Interlingual-Clusters Enhanced BERT. We show significant improvements over standard multi-lingual segmentation and training in a question answering task covering nine languages, both in a small model regime and in a BERT-base training regime, thus demonstrating the effectiveness of our clustering. The proposed approach could be easily expanded to more languages, or applied to model architectures different from mBERT.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	This thesis aimed to improve mBERT's cross-lingual generalization and performance on low-resource languages, by optimizing tokenization and training it on clusters of monolingual segments. The proposed approach, focusing on 9 languages, yielded good quality clusters, and led to the development of a novel model, Intelingual-Clusters Enhanced BERT (ICEBERT), which was shown to outperform standard segmentation and training methods on a question answering task.
dc.title	Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	cross-lingual LMs; mBERT; ICEBERT; subword tokenization; subword clustering;
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	309

Files in this item

Name:: Master_Thesis_Riccardo_Bassani ...
Size:: 1.362Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record