View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization

        Thumbnail
        View/Open
        Master_Thesis_Riccardo_Bassani_Clustering_Monolingual_Vocabularies.pdf (1.362Mb)
        Publication date
        2021
        Author
        Bassani, Riccardo
        Metadata
        Show full item record
        Summary
        Multi-lingual language models based on multi-lingual BERT (mBERT) have made it possible to perform linguistic tasks on virtually any language for which some raw text data is available. Through a particular type of transfer learning, these models exploit the large amount of data available for resource-rich languages like English and then apply what they learn when handling low-resource languages. This happens both in the pre-training phase, where low-resource languages' data would not be enough for the model to train, and in the fine-tuning phase, where labeled data is often available only for a few languages. Multi-lingual language models, however, exhibit better performance for some languages than for others, and many languages do not seem to benefit from multi-lingual sharing at all. The broad goal of this thesis is to improve the performance of multi-lingual models on low-resource languages. We focus on two issues: First, to create a better multi-lingual representational space where text tokens are represented independently of the language to which they belong. Second, to improve multi-lingual segmentation. In current work, indeed, low-resource languages are segmented with a tokenizer mainly trained on resource-rich languages text, thus yielding sub-optimal segmentation for low-resource languages. Instead, we use a dedicated tokenizer and a large subword vocabulary for better segmentation of each language. We then use a clustering algorithm to discover sensible multi-lingual groupings of segments across languages. A multi-lingual BERT model is then trained on the obtained clusters. In this work, we focused on two issues that presumably cause this phenomenon. First, the models' difficulties in creating a multi-lingual representational space, where. Second, the poor multi-lingual segmentation adopted: low-resource languages are segmented with a tokenizer mainly trained on resource-rich languages text, thus yielding sub-optimal segmentation. To guarantee a fair segmentation of all languages, we propose to use a dedicated tokenizer and a large subword vocabulary for each language. In order to allow the use of large monolingual vocabularies, and to increase the multi-linguality of the representational space, we cluster monolingual segments grouping together tokens which are similar across languages, and train a BERT model on the obtained clusters. Not only does this allows to use the aforementioned large vocabularies, without increasing the multi-lingual model capacity, but it also yields a truly interlingual model, which learns how to represent language-neutral semantic clusters, instead of language-specific text tokens as in traditional BERT-like models. We called this model ICEBERT, standing for Interlingual-Clusters Enhanced BERT. We show significant improvements over standard multi-lingual segmentation and training in a question answering task covering nine languages, both in a small model regime and in a BERT-base training regime, thus demonstrating the effectiveness of our clustering. The proposed approach could be easily expanded to more languages, or applied to model architectures different from mBERT.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/102
        Collections
        • Theses
        Utrecht university logo