PlasmidEC: An ensemble of classifiers that improves plasmidome recall from short-read sequencing data in Escherichia coli
Summary
Over the past decades, pathogenic lineages of Escherichia coli have rapidly acquired antibiotic
resistance. Currently, multidrug resistant E. coli is the most frequent cause of lethal infections
among resistant bacteria in a hospital setting. Antibiotic resistance genes (ARGs) are commonly
spread via plasmids. From a clinical and epidemiological standpoint, it is very relevant to analyse the
plasmid content in E. coli. The rise of Illumina whole genome sequencing (WGS) has enabled fast
large-scale analysis of the genomic content of bacteria. However, it is usually not possible to
reconstruct plasmids by genome assembly of short-read sequencing data. Therefore, several
bioinformatic tools have been developed to uncover the total plasmid content in a sample, also
referred to as the plasmidome, by classifying genomic sequences as either chromosome- or plasmid-
derived. We benchmarked four of these binary classifiers (mlplasmids, PlaScope, Platon and
RFPlasmids). They are at the basis of plasmidEC, an ensemble classifier that combines the output of
three plasmid classifiers using a majority voting system. The combination of
Platon/PlaScope/RFPlasmid presented the best plasmidome predictions (F1-score = 0.904).
Compared to individual classifiers, plasmidEC achieved increased recall (0.885), especially for contigs
derived from ARG-plasmids (recall = 0.941). Moreover, a plasmidome study of E. coli ST131 using
plasmidEC was used to identify differences between this lineage and other E. coli. Finally, we show
that plasmidEC removes chromosomal contamination in plasmid reconstructions obtained by MOB-
suite.