Unmasking Memorization: Assessing Dutch Language Memorization in mT5 Models
Summary
This study investigates the memorization of Dutch language content in mT5 models, a multilingual variant of the T5 Transformer-based models. A fill-mask evaluation technique is used to assess memorization and how it varies across different model sizes. Results show that memorization increases with model size up to a certain point. Significant memorization is observed in the 580M and 1.2B sized models, while the smallest 300M and largest 3.7B models are close to baseline generalization performance, minimizing relative memorization effects. Additionally, the findings reveal that data duplication and varying the masking level impact the memorization effect. Moderately duplicated sequences exhibit the highest memorization. Furthermore, a masking level similar to pre-training conditions also results in the highest observed memorization which sharply declines when the masking level is increased. These findings have implications for model reliability as well as ethical and legal implications, particularly regarding the use of copyrighted training data. This research underscores the need to balance training data and adjust model design to promote generalization and minimize memorization in multilingual models.