Detecting and analyzing code-switching behaviour in a Moroccan-Dutch dataset using transformer architectures

Kalkman, Tom

View/Open

Tom_Thesis_Final.pdf (646.7Kb)

Publication date

2024

Author

Kalkman, Tom

Metadata

Show full item record

Summary

The use of linguistic influences from various languages, or the fluid alternation between languages, is known as code-switching. This phenomenon is particularly prevalent in areas shaped by diverse cultural influences, such as young urban communities in the Netherlands. Tools for Natural Language Processing (NLP) have seen an increase in use and performance quality over the last decade, but are typically not trained to work with multilingual, urban youth speech styles. In this thesis, I train models with different levels of complexity with the objective to recognize code-switching. For this objective I use the Moroccorp: an unlabeled Moroccan-Dutch corpus that consists of chat conversations between Moroccan-Dutch internet forum users. I annotate this dataset on word-level, with labels that describe the languages and linguistic varieties that are present, as well as labels for language independent utterances. Two deep learning models are fine-tuned on the annotated dataset. For this, I use two pretrained transformer models and compare their performance to a multinomial logistic regression baseline model on language identification. I use the Dutch model RobBERT and Multilingual BERT, that is trained on over a hundred languages, including Dutch, English and Arabic, which are all present in the Moroccan-Dutch corpus I use. If informal codeswitched texts can be processed as well as more formal, monolingual texts by NLP tools like RobBERT, this could reduce performance bias for non-standard Dutch in technological appliances like voice activated tools or hate speech detectors on social media. The annotated subset of the Moroccorp contains about 10% of code-switched sentences, with a Codemixing index (10,92) that is similar to other datasets used for language identification. The best hyperparameter configurations for both models reach a higher F1-score (F1 = 0,83) than the logistic regression baseline model (F1 = 0,53) on the token classification task. Although differences in F1-score between the two transformers proved to be insignificant (p=0,07), M-BERT shows higher precision, whereas RobBERT shows higher recall and accuracy. Recognizing code-switching and the Moroccan ethnolect simultaneously is shown to be complex for the models. Higher performance may be achieved by focusing on the two tasks individually and by using a more balanced dataset.

URI

https://studenttheses.uu.nl/handle/20.500.12932/45842

Collections

Theses