View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Unmasking Memorization: Assessing Dutch Language Memorization in mT5 Models

        Thumbnail
        View/Open
        ADS_Thesis_Kas_Berendsen.pdf (2.731Mb)
        Publication date
        2024
        Author
        Berendsen, Kas
        Metadata
        Show full item record
        Summary
        This study investigates the memorization of Dutch language content in mT5 models, a multilingual variant of the T5 Transformer-based models. A fill-mask evaluation technique is used to assess memorization and how it varies across different model sizes. Results show that memorization increases with model size up to a certain point. Significant memorization is observed in the 580M and 1.2B sized models, while the smallest 300M and largest 3.7B models are close to baseline generalization performance, minimizing relative memorization effects. Additionally, the findings reveal that data duplication and varying the masking level impact the memorization effect. Moderately duplicated sequences exhibit the highest memorization. Furthermore, a masking level similar to pre-training conditions also results in the highest observed memorization which sharply declines when the masking level is increased. These findings have implications for model reliability as well as ethical and legal implications, particularly regarding the use of copyrighted training data. This research underscores the need to balance training data and adjust model design to promote generalization and minimize memorization in multilingual models.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/47129
        Collections
        • Theses
        Utrecht university logo