Leveraging Protein Language Models to Characterize T Cell Repertoire Features
Summary
Single-cell immune repertoire sequencing has revolutionized the resolution and quantification
of adaptive immune responses for modern research. Likewise, advances in the field of deep
learning have provided novel computational approaches to analyze the complexity underlying T cell receptor (TCR) repertoires. Specifically, advancements in protein language models have
demonstrated initial success in the implementation of both with the goal of quantifying feature selection underlying these adaptive immune repertoires. Examples include differentiation of cellular phenotypes, somatic hypermutations, and predicting antigen binding specificities. However, there has not been a comprehensive effort to explore how PLMs can be used to quantify selection of single-cell T cell repertoires under various experimental conditions. Therefore, we established a computational pipeline to interrogate how PLM-derived numerical embeddings could be used to describe and predict T cell selection features such as transcriptional phenotype, antigen-specificity, germline gene usage, and alpha-beta chain pairing. This pipeline, known as Platypus Python, performed high-throughput embedding and prediction using three PLM models on four murine datasets and two human, publically available at 10X genomics and PlatypusDB. Subsequently, we leveraged Platypus Python to demonstrate PLM embedding usage by comparing T cell: CDR3 and full length sequences, alpha and beta chains, CD4+ and CD8+ T cells, and various transcriptional phenotypes. We compared different repertoires, alpha and beta CDR3s and full length chains by labeling PLM embeddings by specificity and T cell phenotype using UMAP. After learning that our PLM UMAPs did not differentiate by these features but instead by various gene segments within the TCRs, we leveraged alternative embedding analysis such as cosine similarity comparisons and feature classifications. Furthermore, we found that despite no UMAP evidence of differentiation, the PLM embeddings could differentiate repertoire specificities using Cosine similarity and CD4+/CD8+ cells with classification models. Together, this work created a novel pipeline to facilitate AI-guided research of immune repertoires to further benchmark how PLM-embeddings can quantify T cell selection.
