Large-scale protein structure prediction methods for enhanced annotation
Summary
The outbreak of next-generation sequencing led to a boom in the number of available protein sequences. This opened a breach between sequence data, and structural and functional data. In recent years, deep learning algorithms like AlphaFold2 have managed to predict protein structures from sequence with an accuracy similar to that of experimental structures, saving the gap with structural data. Protein structure alignment tools have also experienced an upswing in terms of speed with Foldseek, enabling large-scale comparisons. In proteins, structural conservation is higher than sequence conservation. Because of this, large-scale comparisons opened the door to distant homology detection based on structures. The efficiency of protein structure predictors permits the generation of structures on a large scale, which, after the structure-based homology detection, can be used for inference annotation. This methodology has been used up to the whole UniProtKB level, showing promising evolutionary insights and perspectives. The fruitful combination of different methods and use of large datasets highlights the potential of protein structure-based tools, creating a whole new approach for the computational research of evolution.