Establishing reference collection
The first step in this process was to establish a reference collection of genomes which, in their entirety, capture the diversity within each individual species. This process entailed gathering genomes for each of our 17 focal species, identifying the representative genomes from each species, and subsequently testing these representatives to ensure that they capture the diversity within their respective species.
The following species of interest were selected due to their reported ability to cause bacterial meningitis: Neisseria meningitidis, Haemophilus influenzae, Streptococcus pneumoniae, Listeria monocytogenes, and Escherichia coli. In addition to these five species, we chose to include several additional species of interest from the genera Neisseria and Haemophilus. These additional species were: Neisseria cinerea, Neisseria elongata, Neisseria lactamica, Neisseria gonorrhoeae, Neisseria subflava, Neisseria mucosa, Neisseria weaveri, Neisseria polysaccharea, Haemophilus haemolyticus, Haemophilus parainfluenzae, Haemophilus parahaemolyticus, and “Neisseria bergeri.” Neisseria bergeri has historically been characterized as a variant of Neisseria polysaccharea, but has since been suggested to be reclassified as a novel species [11]. We followed the proposed reclassification of Neisseria species by Maiden et al. [12], classifying Neisseria flavescens as subspecies of Neisseria subflava, and Neisseria sicca and Neisseria macacae as subspecies of Neisseria mucosa. Notably, the methods described here for generating the reference collection for BMScan are not unique for our chosen focal species, and can be applied to any other bacterial species of interest to the user.
The genomes for each of these species were obtained from three main sources: 1) the Bacterial Meningitis Laboratory (BML) isolate collection, 2) NCBI’s RefSeq [13], and 3) the Minnesota Department of Health Public Health Laboratory (MDH) isolate collection. All of the isolates from the BML collection were tested for their respective species through a combination of biochemical and molecular testing, including the API NH strip system [14] and PCR for species-specific genes [15]. The isolates from the MDH collection were tested for their respective species through a series of biochemical tests, such as rapid sugars, slide agglutination and other classical microbiological methods [16].
All isolates from CDC BML and MDH were sequenced using either Illumina (n = 1782) or Pacific Biosciences (PacBio) technologies (n = 82) as described previously [17]. These Illumina reads were assembled using SPAdes version 3 [18], and the PacBio reads were assembled using PacBio’s Hierarchical Genome Assembly Process version 3 (HGAP) [19]. The collection was supplemented with additional assemblies from NCBI’s Refseq (n = 820) and then processed through the dRep pipeline [20]. The dRep software consists of a set of command-line tools for clustering a given set of genomes and identifying high-quality representative genomes for each cluster. This pipeline analyzes the entire set of genomes, performs a rapid, primary-clustering with Mash using a threshold of 0.90, followed by a slower, secondary-clustering within each primary-cluster with ANI using a threshold of 0.995. After clustering is complete, dRep identifies the representative genome from within each cluster. These representative genomes.
The representative genome is selected through a scoring process, whereby each genome is scored according to a formula which factors several components of genome quality, such as: 1) genome completeness, 2) N50, 3) contamination, 4) genome size and 5) strain heterogeneity. These quality metrics were determined using the CheckM [21] module within dRep. The highest scoring genome for each cluster was selected as the representative of the respective cluster, and these representatives were then compiled to create our reference collection. Each representative genome is either a divergent strain that was selected from a cluster of one by being less than 99.5% similar to any other genome in the collection, or is the best quality genome from a cluster of highly similar genomes which are at least 99.5% similar to other genomes within that cluster. Ultimately, this workflow ensures that each genome in the reference collection both captures the diversity of its respective cluster and is of high quality.
Obtaining threshold values for each species
After finalizing the reference collection, threshold values were established for each of the 17 focal species through all-vs-all pairwise comparisons using both ANI and Mash. The results of these comparisons were parsed, and the Mash distances were converted into Mash scores (1-Mash distance) to be easily comparable with ANI, such that a lower Mash score indicates a greater distance between two genomes. The smallest score between members of the same species was recorded as that given species’ threshold value. The threshold values for both ANI and Mash serve as indicators of confidence for each method that an unknown genome belongs to that given species, as these thresholds mark the greatest distance between two members of the same species within the reference collection.
The final threshold values for each species were stored in a SQL database. In addition to these values, this database also contains meta-data for each genome, such as its source location, file name, ID, genus and species. The database SQL schema is provided as supplement (Additional file 1).
The results of each all-vs-all Mash and ANI comparison were tested for linear correlation using Pearson’s correlation coefficient test. For this analysis, we only included pairwise comparison values of 0.90 or above, as we were primarily interested in the correlation between Mash and ANI for intra-species comparisons and for identifying inter-species boundaries. To assess the agreement between ANI and Mash values above 0.90, we created a Bland Altman plot [22].
Species delineation with Mash
In combination with the species-specific threshold values, we sought to represent Mash’s ability to delineate between bacterial species within the same genus. For this analysis, we ran Mash on the Neisseria and Haemophilus genomes within the reference collection, and used those results to generate neighbor-joining [23] trees. These trees were generated by formatting the Mash results as distance matrices, parsing each matrix and creating the tree with a custom python script built with the BioPython [24] and Phylo [25] packages. The final trees were visualized using the iTOL [26] package.
Development of BMScan
BMScan was developed to serve as a comprehensive program for bacterial meningitis species identification. The tool was written using the Python programming language (https://www.python.org/). The program is a pipeline with Mash at its core, which coordinates with a custom-built SQL threshold database for rapid retrieval of the results. The code for BMScan can be found here https://bitbucket.org/ntopaz/bmscan.
Assessing performance of BMScan
In order to check whether these threshold values confidently capture the diversity within each species, we established a validation set. This process consisted of downloading all available genomes for each species of interest from both RefSeq and Genbank [27], as well as the genomes for additional closely-related and sister species of those included in the reference collection (N = 15,503). These additional species include Streptococcus mitis, Escherichia albertii, Escherichia fergusonii, and Listeria innocua. Several Neisseria species had very few (less than or equal to 10) available genomes on NCBI, so these species were supplemented with genomes from the Neisseria isolate collection on PubMLST [28] (N = 208). Each of the genomes in the set was scanned using BMScan, and the results were analyzed to assess the performance of the tool. As NCBI’s RefSeq is one of the main sources of genomes for the reference collection, any pairwise comparisons between identical genomes were filtered out.