Detecting and analyzing code-switching behaviour in a Moroccan-Dutch dataset using transformer architectures

Kalkman, Tom

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Schraagen, Marijn
dc.contributor.author	Kalkman, Tom
dc.date.accessioned	2024-01-25T00:01:19Z
dc.date.available	2024-01-25T00:01:19Z
dc.date.issued	2024
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/45842
dc.description.abstract	The use of linguistic influences from various languages, or the fluid alternation between languages, is known as code-switching. This phenomenon is particularly prevalent in areas shaped by diverse cultural influences, such as young urban communities in the Netherlands. Tools for Natural Language Processing (NLP) have seen an increase in use and performance quality over the last decade, but are typically not trained to work with multilingual, urban youth speech styles. In this thesis, I train models with different levels of complexity with the objective to recognize code-switching. For this objective I use the Moroccorp: an unlabeled Moroccan-Dutch corpus that consists of chat conversations between Moroccan-Dutch internet forum users. I annotate this dataset on word-level, with labels that describe the languages and linguistic varieties that are present, as well as labels for language independent utterances. Two deep learning models are fine-tuned on the annotated dataset. For this, I use two pretrained transformer models and compare their performance to a multinomial logistic regression baseline model on language identification. I use the Dutch model RobBERT and Multilingual BERT, that is trained on over a hundred languages, including Dutch, English and Arabic, which are all present in the Moroccan-Dutch corpus I use. If informal codeswitched texts can be processed as well as more formal, monolingual texts by NLP tools like RobBERT, this could reduce performance bias for non-standard Dutch in technological appliances like voice activated tools or hate speech detectors on social media. The annotated subset of the Moroccorp contains about 10% of code-switched sentences, with a Codemixing index (10,92) that is similar to other datasets used for language identification. The best hyperparameter configurations for both models reach a higher F1-score (F1 = 0,83) than the logistic regression baseline model (F1 = 0,53) on the token classification task. Although differences in F1-score between the two transformers proved to be insignificant (p=0,07), M-BERT shows higher precision, whereas RobBERT shows higher recall and accuracy. Recognizing code-switching and the Moroccan ethnolect simultaneously is shown to be complex for the models. Higher performance may be achieved by focusing on the two tasks individually and by using a more balanced dataset.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	I trained models with different levels of complexity with the objective to recognize code-switching. I annotated a Moroccan-Dutch dataset on word-level, with labels that describe the present languages and linguistic varieties, as well as labels for language independent utterances. Two deep learning models are fine-tuned on the annotated dataset. I use two pretrained transformer models and compare their performance to a multinomial logistic regression baseline model on language detection.
dc.title	Detecting and analyzing code-switching behaviour in a Moroccan-Dutch dataset using transformer architectures
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	BERT; RobBERT, Multilingual; Code-switching; NLP; Moroccan; Dutch; Moroccorp; Language identification; Transformers; ethnolect; language variety; annotation
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	27318

Files in this item

Name:: Tom_Thesis_Final.pdf
Size:: 646.7Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record