View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Leveraging Protein Language Models to Characterize T Cell Repertoire Features

        Thumbnail
        View/Open
        FinalReport_Dylan_MinorResearchProfile.pdf (5.970Mb)
        Publication date
        2025
        Author
        Maassen-Veeters, Dylan
        Metadata
        Show full item record
        Summary
        Single-cell immune repertoire sequencing has revolutionized the resolution and quantification of adaptive immune responses for modern research. Likewise, advances in the field of deep learning have provided novel computational approaches to analyze the complexity underlying T cell receptor (TCR) repertoires. Specifically, advancements in protein language models have demonstrated initial success in the implementation of both with the goal of quantifying feature selection underlying these adaptive immune repertoires. Examples include differentiation of cellular phenotypes, somatic hypermutations, and predicting antigen binding specificities. However, there has not been a comprehensive effort to explore how PLMs can be used to quantify selection of single-cell T cell repertoires under various experimental conditions. Therefore, we established a computational pipeline to interrogate how PLM-derived numerical embeddings could be used to describe and predict T cell selection features such as transcriptional phenotype, antigen-specificity, germline gene usage, and alpha-beta chain pairing. This pipeline, known as Platypus Python, performed high-throughput embedding and prediction using three PLM models on four murine datasets and two human, publically available at 10X genomics and PlatypusDB. Subsequently, we leveraged Platypus Python to demonstrate PLM embedding usage by comparing T cell: CDR3 and full length sequences, alpha and beta chains, CD4+ and CD8+ T cells, and various transcriptional phenotypes. We compared different repertoires, alpha and beta CDR3s and full length chains by labeling PLM embeddings by specificity and T cell phenotype using UMAP. After learning that our PLM UMAPs did not differentiate by these features but instead by various gene segments within the TCRs, we leveraged alternative embedding analysis such as cosine similarity comparisons and feature classifications. Furthermore, we found that despite no UMAP evidence of differentiation, the PLM embeddings could differentiate repertoire specificities using Cosine similarity and CD4+/CD8+ cells with classification models. Together, this work created a novel pipeline to facilitate AI-guided research of immune repertoires to further benchmark how PLM-embeddings can quantify T cell selection.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/50679
        Collections
        • Theses
        Utrecht university logo