Learnable Fingerprints for Large Language Models
Summary
The rapid advancement of generative artificial intelligence (AI), especially large language models (LLMs), has led to unprecedented capabilities in text generation, leading to the urgent need for the development of methods that can identify AI-generated text and prevent misuse. Techniques like watermarking that can mark text or images as being AI-generated are currently being explored in the field but are in their infancy, and are especially challenging for textual output. This thesis focuses on model fingerprinting techniques, i.e. methods that embed fingerprints into a deep generative model, used for identification of models via prompting, and can also be used to authenticate the origin of AI-generated text. We propose a fine-tuning-based method to embed learnable fingerprints within LLMs, enabling black-box model authentication without requiring access to model parameters. We evaluate it for several desirable properties of fingerprints, such as maintenance of generated text quality, and robustness against attacks. Our experiments show that model quality is maintained, even with quantization, but fingerprints are susceptible to removal via fine-tuning and are not immune from being detected via data leakage. Additionally, we experiment with combining model fingerprints and common watermarking methods that embed signatures into the generated text, and evaluate which watermarking paradigms can be used in combination with model fingerprinting. Our motivation is to provide first insights into the potential of combining the strengths of both techniques for broader purposes and application to AI regulation, trustworthiness, detection, and authentication.