Detecting and analyzing code-switching behaviour in a Moroccan-Dutch dataset using transformer architectures
Summary
The use of linguistic influences from various languages, or the fluid alternation between
languages, is known as code-switching. This phenomenon is particularly prevalent in areas
shaped by diverse cultural influences, such as young urban communities in the Netherlands.
Tools for Natural Language Processing (NLP) have seen an increase in use and performance
quality over the last decade, but are typically not trained to work with multilingual, urban
youth speech styles. In this thesis, I train models with different levels of complexity with the
objective to recognize code-switching. For this objective I use the Moroccorp: an unlabeled
Moroccan-Dutch corpus that consists of chat conversations between Moroccan-Dutch internet
forum users. I annotate this dataset on word-level, with labels that describe the languages and
linguistic varieties that are present, as well as labels for language independent utterances.
Two deep learning models are fine-tuned on the annotated dataset. For this, I use two
pretrained transformer models and compare their performance to a multinomial logistic
regression baseline model on language identification. I use the Dutch model RobBERT and
Multilingual BERT, that is trained on over a hundred languages, including Dutch, English
and Arabic, which are all present in the Moroccan-Dutch corpus I use. If informal codeswitched texts can be processed as well as more formal, monolingual texts by NLP tools
like RobBERT, this could reduce performance bias for non-standard Dutch in technological
appliances like voice activated tools or hate speech detectors on social media. The annotated
subset of the Moroccorp contains about 10% of code-switched sentences, with a Codemixing index (10,92) that is similar to other datasets used for language identification. The
best hyperparameter configurations for both models reach a higher F1-score (F1 = 0,83) than
the logistic regression baseline model (F1 = 0,53) on the token classification task. Although
differences in F1-score between the two transformers proved to be insignificant (p=0,07),
M-BERT shows higher precision, whereas RobBERT shows higher recall and accuracy.
Recognizing code-switching and the Moroccan ethnolect simultaneously is shown to be
complex for the models. Higher performance may be achieved by focusing on the two tasks
individually and by using a more balanced dataset.