Building a Canine SNP Reference Database: Data Preprocessing and Quality Control Procedures
Summary
Canine genetics play a pivotal role in unraveling the genetic basis of inherited diseases in both dogs and humans. Single-nucleotide polymorphisms (SNPs) are a robust tool for genetic research and with the development of commercial SNP arrays and increasing research activities the available amount of canine SNP data has grown tremendously. Combining and (re)using these data enhances sample sizes and, consequently, the power of genetic studies. However, the utility of multi-source SNP data depends on effective data harmonization and quality control (QC) procedures. This includes removal of poor-quality samples based on low sample call rates and excessive heterozygosity, and detection of duplicates, relationships between dogs, phenotyping errors or potential sample swaps by checking the dog’s identity based on sex, breed, and kinship. In total, data from approximately 19,000 dogs from 5 different platforms and Whole Genome Sequencing datasets were analyzed and merged. Recognizing the limitations of readily applying QC thresholds from human research to canine SNP data, this project aims to explore data preprocessing and QC steps essential for ensuring high-quality and accurately phenotyped canine SNP data, to establish a SNP reference database to advance genetic research in canines.