Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorSchraagen, Marijn
dc.contributor.authorKalkman, Tom
dc.date.accessioned2024-01-25T00:01:19Z
dc.date.available2024-01-25T00:01:19Z
dc.date.issued2024
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/45842
dc.description.abstractThe use of linguistic influences from various languages, or the fluid alternation between languages, is known as code-switching. This phenomenon is particularly prevalent in areas shaped by diverse cultural influences, such as young urban communities in the Netherlands. Tools for Natural Language Processing (NLP) have seen an increase in use and performance quality over the last decade, but are typically not trained to work with multilingual, urban youth speech styles. In this thesis, I train models with different levels of complexity with the objective to recognize code-switching. For this objective I use the Moroccorp: an unlabeled Moroccan-Dutch corpus that consists of chat conversations between Moroccan-Dutch internet forum users. I annotate this dataset on word-level, with labels that describe the languages and linguistic varieties that are present, as well as labels for language independent utterances. Two deep learning models are fine-tuned on the annotated dataset. For this, I use two pretrained transformer models and compare their performance to a multinomial logistic regression baseline model on language identification. I use the Dutch model RobBERT and Multilingual BERT, that is trained on over a hundred languages, including Dutch, English and Arabic, which are all present in the Moroccan-Dutch corpus I use. If informal codeswitched texts can be processed as well as more formal, monolingual texts by NLP tools like RobBERT, this could reduce performance bias for non-standard Dutch in technological appliances like voice activated tools or hate speech detectors on social media. The annotated subset of the Moroccorp contains about 10% of code-switched sentences, with a Codemixing index (10,92) that is similar to other datasets used for language identification. The best hyperparameter configurations for both models reach a higher F1-score (F1 = 0,83) than the logistic regression baseline model (F1 = 0,53) on the token classification task. Although differences in F1-score between the two transformers proved to be insignificant (p=0,07), M-BERT shows higher precision, whereas RobBERT shows higher recall and accuracy. Recognizing code-switching and the Moroccan ethnolect simultaneously is shown to be complex for the models. Higher performance may be achieved by focusing on the two tasks individually and by using a more balanced dataset.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectI trained models with different levels of complexity with the objective to recognize code-switching. I annotated a Moroccan-Dutch dataset on word-level, with labels that describe the present languages and linguistic varieties, as well as labels for language independent utterances. Two deep learning models are fine-tuned on the annotated dataset. I use two pretrained transformer models and compare their performance to a multinomial logistic regression baseline model on language detection.
dc.titleDetecting and analyzing code-switching behaviour in a Moroccan-Dutch dataset using transformer architectures
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsBERT; RobBERT, Multilingual; Code-switching; NLP; Moroccan; Dutch; Moroccorp; Language identification; Transformers; ethnolect; language variety; annotation
dc.subject.courseuuArtificial Intelligence
dc.thesis.id27318


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record