A systematic review of reported reassortant viral lineages of influenza A

Background Most previous evolutionary studies of influenza A have focussed on genetic drift, or reassortment of specific gene segments, hosts or subtypes. We conducted a systematic literature review to identify reported claimed reassortant influenza A lineages with genomic data available in GenBank, to obtain 646 unique first-report isolates out of a possible 20,781 open-access genomes. Results After adjusting for correlations, only: swine as host, China, Europe, Japan and years between 1997 and 2002; remained as significant risk factors for the reporting of reassortant viral lineages. For swine H1, more reassortants were observed in the North American H1 clade compared with the Eurasian avian-like H1N1 clade. Conversely, for avian H5 isolates, a higher number of reported reassortants were observed in the European H5N2/H3N2 clade compared with the H5N2 North American clade. Conclusions Despite unavoidable biases (publication, database choice and upload propensity) these results synthesize a large majority of the current literature on novel reported influenza A reassortants and are a potentially useful prerequisite to inform further algorithmic studies. Electronic supplementary material The online version of this article (doi:10.1186/s12879-015-1298-9) contains supplementary material, which is available to authorized users.

availability of whole-genome sequencing (4) has permitted a rapid expansion ( Figure S1) of high quality descriptive studies that rely on genomic data. Pathogen-dynamic studies of reassortment have previously focussed on specific influenza subtypes (5)(6)(7), hosts (4,8,9) or evolutionary events (3,10,11). More recently work has estimated rates of reassortment within a particular viral lineage (12) and to identify high-risk areas in which reassortment may occur (13). However, broader descriptions of patterns of observed reassortment remain lacking. We examined the extent to which theories of geographical (1) and host (14,15) drivers of pandemic emergence may or may not be reflected in the frequency with which novel reassortants have been identified.

Search strategy and study selection
We will conduct electronic searches in PubMed (MEDLINE) and Web of Knowledge (all databases) to identify relevant articles. There is no restriction regarding language, publication period or study design.

Search terms:
-PubMed and Web of Knowledge: "influenza AND reassortment" or "influenza AND reassortant" All titles and abstracts were scanned, if the title was not rejected then the full text of the article was obtained and carefully reviewed for inclusion by (AP). Inclusion or exclusion of a study was evaluated based on the inclusion/exclusion criteria but was specific to clear phylogenetic evidence for the occurrence of reassortment. If unsure of the inclusion of any articles SR was consulted.

Inclusion criteria
-First report of a novel reassortant viral lineages of influenza A. This could be any subtype and any host.
-Presentation of clear phylogenetic evidence and suggestion that greater than 2 sequences had been deposited in GenBank. The presence of all 8 genes in GenBank was later checked.

Exclusion criteria
-Sequences not deposited in GenBank. i.e published in GISAID or not at all.
-At least 2 genes sequenced shown in the article.
-Repeat isolate from a previous paper eg. Novel 2013 H7N9 or H1N1 2009 pandemic isolates.
-No explicit identification of isolates that had undergone reassortment.
-Laboratory studies of reassortant isolates i.e non-natural strains or ones that were being developed as vaccine strains.
-Articles which highlighted the occurrence of co-infection or interspecies transmission of virus that did not result in reassortment.
-Articles that focused on the algorithmic detection of viruses to classify them as reassortant were not included.

Data extraction
Data extraction was performed by AP. Data was recorded in an excel file. All whole genomes identified are in Database S1.

Selection of data for the meta-analysis
-We defined reassortment to be the viral exchange of one whole gene segment, between two different viral lineages, such that at least one gene from the viral isolate was located dis-concordantly to the other 7 genes on a phylogenetic tree. This criterion ensured clear detection of reassortant strains.
-After a strain duplication algorithm was applied to the isolates for which data of all genes was available, the resulting set of isolates were used in the meta-analysis.
-We downloaded all the available whole genomes from Genbank and used this data as the denominator for the general additive model.

Selection of data for the comparative trees and hamming distance distributions
-The denominator gathered from GenBank was used to draw the same size random samples for each host/subtype combination. Data was subset according to host and subtype.
-From this subset for a specific host subtype combination the same number of random samples were drawn as there were for that host/subtype combination in the final reassorted data set. 10 random samples for each host/subtype combination were drawn.

Risk of bias
Genetic similarity between isolates for which whole genome data was available was assessed. A filtering algorithm was applied to remove duplicate isolates based on gene specific sequence homology. If isolates were very genetically similar it was checked to see if they were reported in the same article as the same reassortant, if they were the more recent isolate was removed, if they were reported in different articles an alternative route was taken. See Figure 1 for further detail.
We used the same criteria to classify reassortant virus across all papers examined ensuring consistency. By subsequently restricting the analysis to whole genomes we reduced the chances that reassortment may have occurred on other genes but gone unnoticed by the author. Quality of the sequences themselves was not assessed, nor the robustness of the trees on which the reassortant report was made as we were interested in capturing as many reports as possible.

Statistical model for the probability of reassortment
Regression analysis is performed for each of the 3 covariates available, with all the full genomes available in GenBank for all years, host types and regions as the denominator data.
The univariate logistic regression analysis in R (16) and the odds ratio (OR) for each of the covariates for the identification of a reassortant isolate being identified can be calculated.
The effect of covariates is considered significant when the p-value is <0.05 or its 95%CI is not overlapped with the original one.
To examine the relationship between: host, geographic region and year of isolation; and the probability that a given publically available genome was an FRI we used a multivariate general additive model. To compare between models we used the Akaike Information Criterion (AIC), which gives the likelihood of the model minus the number of parameters within the model. The addition of each of the covariates significantly improved the AIC score. We fitted a smoothing spline to year as a covariate. This model was developed using the mgcv package in R (16,17)