Reconstructing NCBI SRA plasmids for global analysis of antimicrobial resistance genes
Summary
Antimicrobial resistance (AMR) is a critical global health threat, driven by bacteria evolving to withstand existing treatments. Addressing AMR requires understanding the spread of antimicrobial resistance genes (ARGs) through extensive public microbial sequencing data. However, metadata inconsistencies in public databases such as the NCBI Sequence Read Archive (SRA) hinder this process. Because of the variable quality of the metadata, the data itself is difficult to find and reuse. We developed SRA-Data-Collector, a Snakemake pipeline that streamlines the search and retrieval of NCBI SRA data using taxon IDs, run-accessions, and study-accessions. This tool simplifies finding samples, collects, cleans, formats metadata, and supports parallel sample downloads. We also adapted a reference-free plasmid reconstruction pipeline for use with the collected Illumina short-read data. To demonstrate the tool’s utility, we downloaded samples of Shigella flexneri and Enterobacter cloacae with it. We then reconstructed plasmids with the plasmid reconstruction pipeline, and clustered the plasmids with mge-cluster. Preliminary analyses where we annotated the clusters with the metadata from SRA-Data-Collector showed that plasmid clusters were not influenced by geolocation, source, or collection year. While we have demonstrated how SRA-Data-Collector and automated plasmid reconstruction enable bulk analysis of public sequencing data. It is important to stress that thorough curating and analysis of mge-clusters is needed, and that only annotating clusters with metadata is not sufficient. To conclude, our integrated approach using SRA-Data-Collector and associated tools facilitate ARG analysis with public data, supporting global-scale AMR research.