Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorBosch, Antal van den
dc.contributor.authorBerendsen, Kas
dc.date.accessioned2024-08-07T23:03:02Z
dc.date.available2024-08-07T23:03:02Z
dc.date.issued2024
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/47129
dc.description.abstractThis study investigates the memorization of Dutch language content in mT5 models, a multilingual variant of the T5 Transformer-based models. A fill-mask evaluation technique is used to assess memorization and how it varies across different model sizes. Results show that memorization increases with model size up to a certain point. Significant memorization is observed in the 580M and 1.2B sized models, while the smallest 300M and largest 3.7B models are close to baseline generalization performance, minimizing relative memorization effects. Additionally, the findings reveal that data duplication and varying the masking level impact the memorization effect. Moderately duplicated sequences exhibit the highest memorization. Furthermore, a masking level similar to pre-training conditions also results in the highest observed memorization which sharply declines when the masking level is increased. These findings have implications for model reliability as well as ethical and legal implications, particularly regarding the use of copyrighted training data. This research underscores the need to balance training data and adjust model design to promote generalization and minimize memorization in multilingual models.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectMaster's thesis about the level of Dutch training data memorization in the mT5 model family.
dc.titleUnmasking Memorization: Assessing Dutch Language Memorization in mT5 Models
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsLLMs;mT5;memorization;copyright;Dutch;NLP
dc.subject.courseuuApplied Data Science
dc.thesis.id36219


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record