Genomic epidemiology supports multiple introductions and cryptic transmission of Zika virus in Colombia

Background Colombia was the second most affected country during the American Zika virus (ZIKV) epidemic, with over 109,000 reported cases. Despite the scale of the outbreak, limited genomic sequence data were available from Colombia. We sought to sequence additional samples and use genomic epidemiology to describe ZIKV dynamics in Colombia. Methods We sequenced ZIKV genomes directly from clinical diagnostic specimens and infected Aedes aegypti samples selected to cover the temporal and geographic breadth of the Colombian outbreak. We performed phylogeographic analysis of these genomes, along with other publicly-available ZIKV genomes from the Americas, to estimate the frequency and timing of ZIKV introductions to Colombia. Results We attempted PCR amplification on 184 samples; 19 samples amplified sufficiently to perform sequencing. Of these, 8 samples yielded sequences with at least 50% coverage. Our phylogeographic reconstruction indicates two separate introductions of ZIKV to Colombia, one of which was previously unrecognized. We find that ZIKV was first introduced to Colombia in February 2015 (95%CI: Jan 2015 – Apr 2015), corresponding to 5 to 8 months of cryptic ZIKV transmission prior to confirmation in September 2015. Despite the presence of multiple introductions, we find that the majority of Colombian ZIKV diversity descends from a single introduction. We find evidence for movement of ZIKV from Colombia into bordering countries, including Peru, Ecuador, Panama, and Venezuela. Conclusions Similarly to genomic epidemiological studies of ZIKV dynamics in other countries, we find that ZIKV circulated cryptically in Colombia. More accurately dating when ZIKV was circulating refines our definition of the population at risk. Additionally, our finding that the majority of ZIKV transmission within Colombia was attributable to transmission between individuals, rather than repeated travel-related importations, indicates that improved detection and control might have succeeded in limiting the scale of the outbreak within Colombia.


Background
In recent years, countries across the Americas have experienced the emergence and endemic circulation of various mosquito-borne viruses, making this a critical area for public health surveillance and epidemiologic research. Zika virus (ZIKV) caused a particularly widespread epidemic, with over 800,000 suspected or confirmed cases reported [1]. Given estimated seroprevalence rates between 36% and 76% [2][3][4][5], the true number of ZIKV infections in the Americas is likely much higher. With neither a vaccine nor ZIKV-specific treatments available, understanding the epidemiology of ZIKV is our primary tool for controlling disease spread [6]. However, because many infections are asymptomatic [2], the analysis of surveillance data alone yields inaccurate estimates of when ZIKV arrived in a country [7,8]. In such cases, introduction timing and transmission dynamics post-introduction are better inferred from genomic epidemiological studies, which use joint analysis of viral genome sequences and epidemiologic case data. Indeed, such studies have defined our understanding of when ZIKV arrived in Brazil [7], described general patterns of spread from Brazil into other countries in the Americas [7,9,10], and been used to investigate the extent of endemic transmission occurring post-introduction [8]. Genomic epidemiological studies of the spread of ZIKV in the Americas have aided our understanding of the epidemic [7][8][9][10][11][12][13][14][15][16][17][18][19], but generally, ZIKV pathogen sequencing has remained a challenge for the public health community [20].
Colombia has a population of approximately 48 million people. In addition, Colombia has Aedes aegypti and Ae. albopictus mosquitoes, which are commonly found at elevations below 2000m above sea level [21]. Public health surveillance for arboviral diseases, along with other notifiable conditions, is performed by the Instituto Nacional de Salud de Colombia (INS) [21]. While suspected cases from other municipalities were reported earlier [22], the INS first confirmed ZIKV circulation in mid-September 2015, in the Turbaco municipality on the Caribbean coast. ZIKV spread throughout the country, appearing in areas infested with Ae. aegypti that experience endemic dengue transmission and previous circulation of chikungunya virus [23]. Over the entire epidemic Colombia reported 109,265 cases of Zika virus disease [24], making it the second most ZIKV-affected country in the Americas after Brazil. The extent of the epidemic led the INS to start active surveillance for congenital Zika syndrome [25] as well as other neurological syndromes associated with ZIKV infection [26]. While the INS determined that epidemic ZIKV transmission ended in July 2016, they continue to perform surveillance for endemic transmission.
Despite numerous reported cases, only 12 whole ZIKV genomes from Colombian clinical samples were publicly available. These sequences included 1 sample from Barranquilla, Atlántico department, 4 samples from Santander department, and 7 sequences for which departmental or municipal information was unspecified. We sequenced an additional 8 samples from ZIKV-positive human clinical and Ae. aegypti specimens, sampled from previously unrepresented Colombian departments. We describe here the first detailed phylogeographic analysis of Colombian ZIKV to estimate when, and how frequently, ZIKV was introduced into Colombia.

INS sample selection and processing
The INS National Virological Reference Laboratory collected diagnostic specimens from over 32,000 suspected ZIKV cases over the course of the epidemic. Of these, roughly 800 serum specimens were positive for ZIKV by real time RT-PCR (rRT-PCR) and had cycle threshold (Ct) values less than 30. From this set we selected 176 serum specimens that were ZIKV-positive, but negative for dengue and chikungunya viruses, as per results from the Trioplex rRT-PCR assay [27]. Specimens were selected such that each Colombian department was represented over the entire time period in which specimens were submitted. We extracted RNA using the MagNA Pure 96 system (Roche Molecular Diagnostics, Pleasanton, CA, USA) according to manufacturer's instructions. An extraction negative was used for each plate; positive controls were eschewed given the risk of cross-contaminating low titer clinical samples [20]. We attempted reverse transcription and PCR-amplification of ZIKV using the twostep multiplex PCR protocol developed by Quick and colleagues on all 176 extracted samples [20]. Briefly, cDNA was generated using random hexamer priming and the Protoscript II First Strand cDNA Synthesis Kit (New England Biolabs, Ipswich, MA, USA). We amplified cDNA using the Zika Asian V1 ZIKV-specific primer scheme [20], which amplifies 400bp long overlapping amplicons across the ZIKV genome, over 35 cycles of PCR. Amplicons were purified using 1× AMPure XP beads (Beckman Coulter, Brea, CA, USA) and quantified using with the Qubit dsDNA High Sensitivity assay on the Qubit 3.0 instrument (Life Technologies, Carlsbad, CA, USA). Due to long storage periods and variable storage temperature, the vast majority of samples were too degraded to amplify. Of the 176 processed samples, only 15 amplified sufficiently to perform sequencing.

UR sample selection and processing
Universidad del Rosario (UR) collected and performed diagnostic testing on 23 human clinical samples from different geographic regions, and 38 Ae. aegypti samples from the Cordoba department of Colombia. RNA was extracted using the RNeasy kit (Qiagen, Hilden, Germany) and a single TaqMan assay (Applied Biosystems, Foster City, CA, USA) directed to ZIKV was employed [28] to confirm ZIKV presence. Approximately 60% of samples (14 clinical samples and 23 Ae. aegypti samples) were found to be ZIKV-positive by rRT-PCR. From these, we attempted amplification on 8 samples with sufficiently high viral copy numbers that they were likely to amplify. Amplification, purification, and quantification of ZIKV amplicons from UR samples were performed as described above. Of the eight samples that we performed PCR on, four samples amplified sufficiently to conduct sequencing; three samples were from human clinical specimens and one was from an Ae. aegypti sample.

Sequencing protocol
We sequenced amplicons from 4 UR and 15 INS samples using the Oxford Nanopore MinION (Oxford Nanopore Technologies, Oxford, UK) according to the protocol described in Quick et al. [20]. Amplicons were barcoded using the Native Barcoding Kit EXP-NBD103 (Oxford Nanopore Technologies, Oxford, UK) and pooled in equimolar fashion. Sequencing libraries were prepared using the 1D Genomic DNA Sequencing kit SQK-LSK108 (Oxford Nanopore Technologies, Oxford, UK). We used AMPure XP beads (Beckman Coulter, Brea, CA, USA) for all purification steps performed as part of library preparation. Prepared libraries were sequenced on R9.4 flowcells (Oxford Nanopore Technologies, Oxford, UK) at the INS in Bogotá and at the Fred Hutchinson Cancer Research Center in Seattle.

Bioinformatic processing
Raw signal level data from the MinION were basecalled using Albacore version 2.0.2 (Oxford Nanopore Technologies, Oxford, UK) and demultiplexed using Porechop version 0.2.3_seqan2.1.1 (github.com/rrwick/Porechop). Primer binding sites were trimmed from reads using custom scripts, and trimmed reads were mapped to Zika reference strain H/PF/2013 (GenBank Accession KJ776791) using BWA v0.7.17 [29]. We used Nanopolish version 0.9.0 (github.com/jts/nanopolish) to determine single nucleotide variants from the event-level data, and used custom scripts to extract consensus genomes given the variant calls and the reference sequence. Coverage depth of at least 20x was required to call a SNP; sites with insufficient coverage were masked with N, denoting that the exact nucleotide at that site is unknown. After bioinformatic assembly, 8 samples produced sufficiently complete genomes to be informative for phylogenetic analysis.

Dataset curation
All publicly available Asian lineage ZIKV genomes and their associated metadata were downloaded from ViPR [30] and NCBI GenBank. The full download contained both published and unpublished sequences; we sought written permission from submitting authors to include sequences that had not previously been published on. Any sequences for which we did not receive approval were removed. Additionally, we excluded sequences from the analysis if any of the following conditions were met: the sequence had ambiguous base calls at half or more sites in the alignment, the sequence was from a cultured clone for which a sequence from the original isolate was available, the sequence was sampled from countries outside the Americas or Oceania, or geographical sampling information was unknown. Finally, we also excluded viruses whose estimated clock rate was more than 4 times the interquartile distance of clock rates across all sequences in the analysis. Sequences that deviate this greatly from the average evolutionary rate either have far too many or too few mutations than expected given the date they were sampled. This deviation usually occurs if the given sampling date is incorrect, or if the sequence has been affected by contamination, lab adaptation, or sequencing error. After curation, the final dataset consisted of 360 sequences; 352 publicly available ZIKV full genomes from the the Americas (including Colombia) and Oceania, and the 8 Colombian genomes from the present study.

Phylogeographic analysis
Data were cleaned and canonicalized using Nextstrain Fauna (github.com/nextstrain/fauna), a databasing tool that enforces a schema for organizing sequence data and sample metadata, thereby creating datasets compatible with the Nextstrain Augur analytic pipeline (github.com/nextstrain/augur) and the Nextstrain Auspice visualization platform (github.com/nextstrain/auspice). A full description of the Nextstrain pipelines can be found in Hadfield et al. [31]. Briefly, Nextstrain Augur performs a multiple sequence alignment with MAFFT [32], which is then trimmed to the reference sequence. A maximum likelihood phylogeny is inferred using IQ-TREE [33]. Augur then uses TreeTime [34] to estimate a molecular clock; rates inferred by Tree-Time are comparable to BEAST [34], a program that infers temporally-resolved phylogenies in a Bayesian framework [35]. Given the inferred molecular clock, TreeTime then creates a temporally-resolved phylogeny, infers sequence states at internal nodes, and estimates the geographic migration history across the tree. These data are exported as JSON files that can be interactively visualized on the web using Nextstrain Auspice.

Rarefaction analysis
We generated a series of datasets in which ZIKV genomes from either Colombia or Mexico were subsampled. For each value from 1 to n, where n is the total number of available sequences for the country of interest, we generated 30 subsampled datasets in which n sequences were randomly sampled without replacement. Each subsampled set was then combined with all ZIKV genomes from other countries included in the main phylogeographic analysis described above, and the phylogeographic inference was rerun. For the Colombian rarefaction analysis, we have n = 20 high quality whole genomes, so we inferred 20 × 30 = 600 phylogeographically labelled trees. For the Mexican rarefaction analysis, we have n = 51 high quality whole genomes in total, and inferred 51×30 = 1530 phylogeographically labelled trees. For each labelled tree, we used custom scripts to traverse the resulting phylogeny and count the number of ZIKV introductions into the country of interest. We then plotted the inferred introduction count as a function of the number of sequences sampled.

Sequencing and sampling characteristics of reported ZIKV genomes
In total, we attempted to amplify ZIKV nucleic acid from 184 samples collected by the Instituto Nacional de Salud de Colombia (INS) and Universidad del Rosario (UR). Given the low viral titers associated with most ZIKV infections, as well as long storage times, most samples did not amplify well. We attempted sequencing on 19 samples that amplified sufficiently to generate sequencing libraries (Table 1). Sequencing efforts yielded eight ZIKV sequences with at least 50% coverage across the genome with unambiguous base calls ( Table 1). Seven of these viruses came from humans; one virus came from an Ae. aegypti pool. Three sequences came from samples collected from infected individuals in Cali, department of Valle del Cauca, two sequences came from Montería, department of Córdoba, and one sequence each came from Ibagué, department of Tolima, Belén de Umbría, department of Risaralda, and Pitalito, department of Huila ( Fig. 1). Colombian viruses are sampled across the period of peak ZIKV incidence in Colombia (Fig. 2).

General patterns of ZIKV transmission in the Americas
We conducted a maximum likelihood phylogeographic analysis of 360 Asian lineage ZIKV genomes; 18 sequences are sampled from Oceania, and 342 sequences are sampled from the Americas. We estimate the ZIKV evolutionary rate to be 8.65 × 10 −4 substitutions per site per year, in agreement with other estimates of the ZIKV molecular clock [7,10,11,36] made with BEAST [35], a Bayesian method for estimating rates of evolution and inferring  Fig. 3a).

Multiple introductions of ZIKV to Colombia
We infer patterns of ZIKV introduction and spread in Colombia from the phylogenetic placement of 20 Colombian sequences. Previously, 12 of these sequences were publicly available. We add an additional 8 Colombian sequences sampled over a broad geographic and temporal range. Colombian ZIKV sequences clustered into two distinct clades (Fig. 3b). Both clades are descended from viruses inferred to be from Brazil (Fig. 3a). Viruses immediately ancestral to both Colombian clades are estimated by the phylogeographic model to have 100% model support for a Brazilian origin. However, lack of genomic sampling from many ZIKV-affected countries in the Americas may limit our ability to infer direct introduction from Brazil or transmission through unsampled countries prior to arrival in Colombia.
Clade 1 is comprised of 28 viruses, 17 of which are from Colombia, and is characterized by nucleotide mutations T738C, C858T, G864T, C3442T, A3894G, C5991T, C9279T, and A10147G. This clade contains all previously reported Colombian genomes, as well as five of the genomes generated during this study (Figs. 3 and 4). The phylogeographic model places the root of this clade in Colombia with 99% model support. The most parsimonious reading is that this clade of viruses resulted from a single introduction event from Brazil into Colombia.
Clade 2 contains 5 viruses, 3 of which are from Colombia (Figs. 3 and 5). All three were sequenced during this study, and thus this clade was not previously recognized in Colombia. This clade is characterized by mutations T1858C, A3780G, G4971T, C5532T, G5751A, A6873G, T8553C, and C10098T. The phylogeographic model places the root of this clade in Colombia with 98% model support and also suggests a Brazil to Colombia transmission route that is likely, but not certainly, direct. Two additional genomes with less that 50% genomic coverage also group within these clades; COL/FH14/2016 within clade 1 and COL/FH15/2016 within clade 2 (data not shown).
To examine how our estimate of two introductions to Colombia might be affected by the number of sequences available, we conducted rarefaction analyses. Under the assumption that sequenced viruses are sampled at random, these curves show how much genomic sampling is required to fully sample the circulating viral diversity. Figure 6 shows the number of introductions to Colombia or Mexico inferred under the phylogeographic model as a function of the number of sequences sampled from that country. For ZIKV in Mexico, we see that the rarefaction curve begins to flatten once roughly 25 to 30 Mexican viruses have been sampled (Fig. 6). In contrast, for Colombia we do not observe further ZIKV introductions once we have sampled around 4 genomes (Fig. 6). Thus, while there are only 20 ZIKV whole genomes available from Colombia, we think it is unlikely that we would observe more ZIKV introductions given more sequence data.

Transmission within Colombia
We estimate that clade 1 was introduced to Colombia around late February of 2015 (95% CI: January 2015 to April 2015) (Figs. 3b and 4), and that clade 2 was introduced in mid September of 2015 (95% CI: July 2015 to November 2015) (Figs. 3b and 5). Our estimate of clade 1 introduction timing supports between five and eight months of cryptic ZIKV transmission within Colombia prior to initial case detection in September 2015, a finding that is consistent with other genomic epidemiological studies of ZIKV [7,8,10].
Genomes were sampled from 7 of 32 departments within Colombia (Fig. 7a). Four genomes were sampled from Santander, three viruses were sampled from Valle del Cauca, and two viruses came from Córdoba. One virus each was sampled from Risaralda, Tolima, Huila, and Atlántico departments respectively (Fig. 7b and c). Seven of the publicly-available Colombian ZIKV genomes lacked department-level information (Fig. 7b). Given that many ZIKV-affected departments within Colombia have only minimal genomic sampling, or lack it all together, we have refrained from using phylogeographic methods to reconstruct the direction of transmission between Colombian

Transmission from Colombia to other countries
We find evidence for onward transmission from Colombia into other countries in the Americas. Clade 1 shows movement of viruses into countries that share a border with Colombia, namely Panama, Venezuela, and Peru, as well as into the Dominican Republic and Martinique (Fig. 4). Clade 2 indicates movement of Colombian ZIKV into neighboring Ecuador (Fig. 5). Transmission from Colombia into bordering countries seems reasonable, and these patterns agree with previously documented trends of ZIKV expansion in the Americas [7,9,10], but provide more detail due to the greater amounts of sequence data now available. For instance, analysis by Metsky et al. [10] also supports movement of ZIKV from Colombia to Martinique and the Dominican Republic. However, without sequence data from Panama, Peru, or Venezuela, they were unable to capture spread from Colombia into these countries.

Discussion
Despite the scale of the Colombian epidemic, publicly available sequence data were limited, and no detailed genomic epidemiological analysis of ZIKV dynamics had been performed. We sought to improve genomic sampling for Colombia, and to perform a detailed genomic analysis of the Colombian epidemic. Only 12 Colombian genomes were available prior to this study. To these data we added 8 new sequences sampled broadly across Colombia, and  performed a phylogeographic analysis of American ZIKV. We describe general transmission patterns across the Americas and present estimates of ZIKV introduction timing and frequency specific to Colombia. We find evidence of at least two introductions of ZIKV to Colombia, yet remarkably the majority of Colombian viruses cluster within a single clade, indicating that a single introduction event caused the majority of ZIKV cases in Colombia. Under the assumption that viruses sequenced from Colombia are random samples of Colombian ZIKV cases, we find that Colombian ZIKV diversity is well represented by 20 Colombian genomes. It is therefore unlikely that further genomic sampling would reveal more introductions of ZIKV into Colombia. ZIKV dispersal out of Colombia appears widespread, with movement to bordering countries (Panama, Venezuela, Ecuador, and Peru) as well as more distal countries in the Caribbean.
While it may be tempting to read the inferred phylogeographic migration history as a complete record of transmission between countries, we caution against doing this for analyses of ZIKV. In contrast to other large outbreaks, such as the Ebola epidemic in West Africa, genomic sampling of the American ZIKV epidemic is sparse. Many ZIKV-affected countries have minimal genomic sampling; others have none at all. Thus while the phylogeographic model will correctly infer the geographic location of internal nodes given the dataset at hand, adding sequences from previously unsampled countries may alter migration histories such that apparent direct transmission from country A to country C instead becomes transmission from country A to country B to country C.
Consistent with other studies, our estimates of when introductions to Colombia occurred support cryptic ZIKV transmission prior to initial case confirmation. Perhaps more surprisingly, our estimate of the age of clade 1 indicates that ZIKV likely spread to Colombia even before official confirmation of ZIKV circulation in Brazil [37]. These findings underscore the utility of genomic epidemiology to date introduction events and describe transmission patterns that are difficult to detect using traditional surveillance methods, thereby providing more accurate definitions of the population at risk and a better understanding of how importation and within-country transmission shape epidemics.

Conclusions
We found evidence for two separate introductions of ZIKV to Colombia, one of which occurred five to eight months prior to the official confirmation of ZIKV in Colombia. Refining our estimates of when ZIKV circulated in Colombia improves our definition of when individuals were at risk for ZIKV infection. Accurately defining this exposure period increases the ability of population-level observational studies to properly measure associations between ZIKV infection and outcomes of interest. In addition, we found that the majority of ZIKV diversity in Colombia descends from one of these two introductions, and rarefaction analyses suggest that we would not identify more introductions with greater genomic sampling. Taken together, these findings suggest that most cases of ZIKV infection were attributable to ZIKV transmission within Colombia after a single introduction event, and that cases of ZIKV infection acquired in other countries and brought back to Colombia were rare. As the majority of Colombian ZIKV infections were locally-acquired, infection prevention and control measures targeting local spread might have limited the scale of the outbreak within Colombia.

Availability of data and materials
All priming schemes, kit specifications, laboratory protocols, bioinformatic pipelines, and sequence validation information are openly available at github.com/blab/zika-seq. Nextstrain analytic and visualization code is available at github.com/nextstrain. Code and builds specific to the analysis presented here, as well as all genomes sequenced for this study, are openly available at github.com/blab/zika-colombia. Consensus sequences are also available on NCBI GenBank (accessions MK049245 through MK049252). The phylogeny and all inferences reported here can be interactively explored at nextstrain.org/community/blab/zika-colombia.

Ethics approval and consent to participate
In Colombia, most ZIKV diagnostic testing occurred at the INS. However, Colombian academics were also involved in ZIKV surveillance, sampling Ae. aegypti and limited human cases. Our study includes samples collected by both the INS and the Universidad del Rosario (UR). Sample collection by UR was approved by the Universidad del Rosario Ethics Committee, approval number 617-2017. Written consent was obtained for human specimen collection. Sample collection performed by the INS was approved by the Comité de Ética y Metodologías de la Investigación (CEMIN) -Instituto Nacional de Salud, approval number CEMIN 39-2017. Consent for specimen collection was waived as specimens were collected as part of the Colombian public health response. All sample selection for inclusion in the genomic analysis, sequencing library preparation, and data analysis were performed on anonymized samples and data. Administrative permission to access data collected by the INS was required. This permission was granted in Collaboration Agreement Fred-Hutch-Ref-#COL171204.