A genome epidemiological study of mycobacterium tuberculosis in subpopulations with high and low incidence rate in Guangxi, South China

Background Tuberculosis (TB) is caused by a bacterium called Mycobacterium tuberculosis (Mtb). China is the third in top 8 high TB burden countries and Guangxi is one of the high incidence areas in South China. Determine bacterial factors that affected TB incidence rate is a step toward Ending the TB epidemic. Results Genomes of M. tuberculosis cultures from a relatively high and low incidence region in Guangxi have been sequenced. 347 of 358(96.9%) were identified as M. tuberculosis. All the strains belong to Lineage 2 and Lineage 4, except for one in Lineage 1. We found that the genetic structure of the M. tuberculosis population in each county varies enormously. Low incidence rate regions have a lower prevalence of Beijing genotypes than other regions. Four isolates which harbored mutT4-48 also had mutT2-58 mutations. It is suggested that strains from the ancestors of modern Beijing lineage is circulating in Guangxi. Strains of modern Beijing lineage (OR=2.04) were more likely to acquire drug resistances than Lineage 4. Most of the lineage differentiation SNPs are related to cell wall biosynthetic pathways. Conclusions These results provided a higher resolution to better understand the history of transmission of M. tuberculosis from/to South China. And the incidence rate of tuberculosis might be affected by bacterial population structure shaped by demographic history. Our findings also support the hypothesis that Modern Beijing lineage originated in South China.

pathogenicity of M. tuberculosis strains in the hot and cold spot is different. we try to characterize the genetic background of the pathogen and figure out some potential associating factors.

Epidemiological characteristics of study participants
The people in the cold-spot mainly speak Cantonese as their first language. In the hot spots south-western Mandarin is generally adopted. Language differences reflect the distinct demographic histories of TB human hosts. Genome sequences of 358 bacterial cultures have been determined with WGS. The minimal sequencing depth of each sample is 100 × of H37Rv genome size. Eleven (11/358 [3.0%]) patient samples were identified as Nontuberculous Mycobacteria (NTM) or Mycobacterium tuberculosis contaminated with other respiratory pathogens. The remaining 357 samples were identified as M. tuberculosis strains. One from central Guangxi was identified as Lineage1.1.1.1(L1.1.1.1). 215 samples were recognized as L2 (East Asia) and 130 samples were recognized as L4 (Euro-American).

Cluster analysis by genome-wide SNPs
Maximum likelihood phylogeny was used to determine the population structure of 347 M. tuberculosis cultures. A single DR-TB case of L1.1.1.1 was detected in central Guangxi which was considered an endemic lineage in Vietnam only [2]. M. tuberculosis L2 in the Guangxi area is composed of L2.1(proto-Beijing) and L2.2(modern-and ancient-Beijing sub-lineage). Euro-American lineage consists of sub-lineages L4.2, L4.4, and L4.5. Some of the M. tuberculosis cultures appeared to be mixed infections (indicated by lineage specific SNPs, genotype heterozygosity and distance from leaf node to lineage common ancestor). At the very least, 4 Modern Beijing strains in L2.2 mixed with L4 or ancient Beijing lineage can be identified (Fig. 2). Ancestral Beijing strains in this study consist of Asia ancestral 1, 2 and 3. There are four sublineages of known classification in Modern Beijing strains, Asian African 1, 2, 3 and Pacific RD150. Only a single sample belongs to the Asian African 1 group. According to our study, over 30 strains assigned to the Modern Beijing lineage (harbored mutT2-58 mutations) cannot be subdivided into unified classifications [3]. We have developed two SNP markers to describe these strains. One SNP is 3943858 A/G which Pacific RD150 strains shared, the other one is 425871 C/G (Fig. 2). Interestingly, several strains are close to the ancient Beijing genotype and are considered as Modern Beijing lineage without mutT2-58 mutations. All these genomes share the mutation 3048912 C/G which is a common ancestral genetic marker of Modern Beijing lineage and group Bmyc26. The phylogeny shows that these M. tuberculosis strains' evolutionary position is between recognized Modern and Ancient Beijing sub-lineages. Both mutT2-58(1286766 G/C, codon 58) and ogt-12 (1477596 C/T, codon 12) mutations cannot be detected in these strains, except sample 103239 which has an ogt-12 SNV (Fig. 3). It is almost impossible to obtain perfect data in clinical cultures [4]. Cross-contamination and/or mixed infection of multiple lineages of tuberculosis can affect the topology of the tree by Pseudo-homoplasy phylogenetic signal. To help with data interpretation, we used the heterozygosity of SNPs to estimate and label the problematic leafs.

M. tuberculosis genetic differences and lineages expansion in geographic spaces
The genetic population structure of M. tuberculosis in high and low incidence rate areas in Guangxi is quite different. The most prevalent M. tuberculosis lineage in this study is L2 which contributed 62% (215/346) of TB cases. On the other hand, the proportion of L4 in the low burden region is 47.8% (68/142). Our result shows that almost half of the patients in the southeast of Guangxi infected by strains in Europe-originated lineage. This result is quite different from the high incidence rate area of Guangxi which has a L4 proportion of 30.6% (45/147) and previous surveillance in China [5]. We combined the geographic areas (administrative area with several counties) of M. tuberculosis cultures origin and phylogeny. Two monophyletic groups with geographic links were observed which suggest that clonal expansion and circulation of Beijing lineage occurred within these counties. And three clusters are endemic in hot spot or cold spot regions only (Fig. 2). There is no recent transmission event that can be observed in this study by using a cgMLST scheme consisting of 2891 core M. tuberculosis gene [6] with 12 SNPs distance cut-off [7], also see ( Fig. 2).  Red and brown round dots denote the counties in hot spot, green dots indicate cold-spot. Hot spot is a term in geology that referred to M. tuberculosis high incidence rate area and vice versa. The black dot is a county that did not identify as a hot or cold spot in our previous study

NTM and mixed infection of M. tuberculosis/NTM
Eleven samples were considered contaminated. We used ANIm and AF metrics [8] to identify these genomes. ANIm greater than 0.95 (or known as sequence similarity of 95%) and aligned fragment length (AF) over 60% of the reference genome were considered as the same bacterial species with reference. We have confirmed that 5 of 359 samples were clinically identified as M. tuberculosis but was actually NTM. And the rest of the samples were contaminated with bacteria from the respiratory tract and oral cavity Table 3 shows a list of NTM and mixed infection of M. tuberculosis/NTM.
The bacterial population structure in Yulin is different from inland China even the hot spot in central-west Guangxi. Almost half of the strains in Yulin(cold spot) belong to M. tuberculosis L4 which originated in Europe Heatmap employing a cool-to-warm (value from 0 to 1) color scheme to represent the heterozygosity of every haploid genome. Heterozygosity of problematic leaf-nodes that have branch length and topological variation is high. Outer circles of heatmap marked drug resistance detected by in silico method. Lineage-specific genetic markers labeled in the inner circles used color squares. Dots that denote the leaf-nodes are corresponding to color dots in Fig. 1. And potential local circling clusters are highlighted in yellow [9]. A large number of migrants from south-east Guangxi moved to the Southeast Asian colonies of western powers, for example as economic migrants or victims of human traffickers, since the Ming dynasty (1368 -1644 AD) this continued up to as recently as the 1950's .Some of the Chinese laborers returned to their hometown in differ-ent phases of history. These Chinese expatriates might be carriers of M. tuberculosis L4 strains of Southeast Asia.
Mainly, the Ancient Beijing lineage in Guangxi consists of three sub-lineages including the Asian Ancestral 1, 2 and 3. And the Modern Beijing lineage in Guangxi consists of Asia Africa 2, Asia Africa 3 and two sub- lineages which is unclassified in the unified schema [3]. A few strains without sub-lineage specific markers might be attached to the root of sub-trees. Sub-trees of two unnamed Modern Beijing sub-lineages can be defined using SNPs 425871 C/G and 3943858 A/G as a specific marker, respectively. Pacific RD150 is a sub-group of the clade of 425871 C/G which is observed in Yulin only and that is specific for the Pacific region [10]. We also found some strains of intermediate genotype between the Modern and Ancient Beijing genotype (Fig. 2). These findings support a hypothesis that the Modern Beijing lineage of M. tuberculosis originated in South China [11].
Beijing genotypes could be considered as 'native strains' and dominated in mainland China, hence Lineage 4 of M. tuberculosis is an 'exotic species' . Even though the transmission and pathogenesis of M. tuberculosis strains in Modern Beijing lineage are considered higher than L4 strains, the prevalence of M. tuberculosis L4 is almost the same as the Beijing genotype in the communities of returned Chinese expatriates in South China. Theoretically, L2 and L4 have the same ecological niche. M. tuberculosis strains in L4 might have stronger colonization ability in the newborn's respiratory tract and it might be transmitted from close relatives. Previous study has inferred that M. tuberculosis genes in cell wall envelop bio-genesis under strong diversifying selection and might be related to M. tuberculosis lineages specialization [12]. Our population study revealed that lineages differentiation of M. tuberculosis L2 and L4 in the molecular level are cell wall bio-synthesis related genes (Table 4). In other words, the composition of mycobacterial cell envelope or cell wall is a big biological difference between M. tuberculosis lineages. For instance, one of the SNPs we have observed in accD4 gene between L2 and L4 is 4254431 C/T. M.tuberculosis contains six ACCase (AccD1-6). Previous studies indicate that AccD4, AccD5, and AccD6 are important for cell envelope lipid biosynthesis and that its disruption leads to pathogen death. A synonymous mutation will not change the structure of the protein but would affect the amount of accD4 protein at the transcriptional level [16]. AccD4 was proposed to form Acyl-CoA carboxylases complex with AccD5 and accept propionyl-CoA as a substrate to produce methylmalonyl-CoA, which is the building block of mycocerosic acid [17].
The role of the cell wall of M. tuberculosis has been involved in pathogenesis [18]. Molecular structural diversity on the surface of bacteria may affect interaction with human epithelial cells or macrophages. Another role of M. tuberculosis cell wall is conferring to resistance to many anti-tuberculosis agents.
Some of the previous studies inferred that the Beijing genotype of M. tuberculosis is less associated with drug resistance in South China [19,20]. Meanwhile, surveillance overseas purposed that the Beijing genotype of M. tuberculosis is a risk factor for drug resistance [21]. Interestingly, our calculation suggested that strains in the Modern Beijing sub-lineage have more drug resistance mutations than Lineage4 in South China other than strains of the Ancient Beijing genotype. Conflicts of previous studies might be due to the ignorance of genetic segregation. Clonal expansion is a common phenomenon in M. tuberculosis transmission. In our preliminary surveillance, we observed several clusters circulating in counties or prefecture cities of Guangxi. From a public health aspect, although financial and technical support for counties in the hot spot and cold spot from provincial CDC are almost the same, TB burden is quite different. Our study purposed that the related high proportion of Beijing lineage(L2) in hot spot area might be one reason for the higher incidence rate.
Another clinically relevant issue is the misdiagnosis of Nontuberculous Mycobacteria (NTM) infections. About 3% [11/358] of the culture-based laboratory M. tuberculosis diagnostics is wrong. The probable causes of misdiagnosis might be mixed infection/contamination and similarity of culture-based phenotypes determination.
In recent years, WGS-based approaches are more and more applied in M. tuberculosis research, especially in surveillance projects. It has become the standard technology for M. tuberculosis molecular epidemiology and the application in clinical settings is gaining acceptance (32). One of the considerations of infectious disease surveillance is the detection of recent transmission events [22]. There is no unified clustering method for genome-wide SNPs based typing yet [22]. In this work, we did not find  any recent transmission event in this study by using a cgMLST scheme consisting of 2891 core M. tuberculosis genes and 20 SNPs cut-off [7]. The small sample size might be the reason for the failure to detect recent transmission clusters. The topic of recent transmission of M. tuberculosis in Guangxi should be investigated in further studies.
Another problem in whole-genome wide SNPs-based phylogenic analysis of M. tuberculosis is laboratory contamination and/or a mixture of M. tuberculosis lineages in clinical samples. For rapid outbreak tracking, public health laboratories often direct use materials submitted by the health care provider. These materials might be patient specimens or initial growth in Broth which would submit to the sequencing process without further isolation of bacteria strains. As a result, contaminations in the sequencing data can not be avoided. When the sample size is increased, the number of shared homogeneous SNP sites across the samples is decreased. If heterozygous SNPs are used, some conflicting phylogenetic signals would affect the phylogenetic tree, and pseudo-homoplastic clades would be introduced. These clades make data interpretation more difficult. To facilitate data interpretation, we used heterozygosity as a metric to indicate the problematic leaf-nodes in the tree, which leads to a more robust and solid conclusion. We noticed that some recent studies [23,24] mentioned that they used heterozygous sites as high-confidence variations to perform tree construction. That may be a neglected problem in research laboratories because alternate data-set is available. It should be conside-red that the structure of the phylogenetic tree could be influenced by heterozygous sites.
The rate of resistance acquisition is determined by several factors including the population size, mutation rate and mutational target size [25]. L2 has been associated with greater drug resistance compared to L4 and the other ancient lineages [25]. The influence of strain's genetic background on the rate of resistance acquisition has been much debated. Our results suggested that Modern-Beijing strains have greater antibiotic-resistance mutations than Ancient-Bejing strains and strains in L4. Theoretically, there are two possible explanations, one is the rate of resistance acquisition of Modern-Beijing lineage higher than others, another one is bacteria under the selection pressure of drugs and expanded via patientto-patient transmission. We have no determine contribution of each of these two factors. But to prevent drug-resistant TB develop during treatment, it is reasonable that a high-resolution genotype of M. tuberculosis should be considered in the clinical setting when it is available.

Conclusions
We provided a higher resolution to better understand the history of the transmission of Mycobacterium tuberculosis in South China. Genetic structures of M. tuberculosis is different in hot and cold spot. Genetic background of M. tuberculosis may influence the TB burden. Geographical pathogen population structure which is shaped by demographic history and local transmission events should be a concern in public health policymaking. Genome coordinates correspond to reference genome sequence AL123456.3. Subcellular location and CDS product annotation based-on UniProt [14] and TubercuList [15].

Case inclusion and epidemiology data collection
We conducted a retrospective study to identify the bacterial genetic factors affecting TB incidence.The sample size of this study was determined based on a previous epidemiological survey. Patients above 5-years-old and diagnosed with pulmonary tuberculosis between January to June in 2018 from hot and cold spots were down-sampled. All the data entries of selected patients with culture confirmed pulmonary tuberculosis were retrieved from the National Notifiable Disease Reported System of China by a criterion as follows: cultured positive M. tuberculosis isolates from sputum samples corresponding to patients. with the parameter "-very-sensitive". SNPs were identified using VarScan [28] v2.3.8 and SAMtools [29] v1.3.1, variations with coverage under 30× or the minimal frequency of homozygote below 0.99 were ignored. Further filter was applied to all the VCF files using bcf tools v1.9. In our filter criteria, heterozygous SNP or SNP within 5 bp of an indel was filtered. SNPs with genotype quality below 255 or below a minimum of 90% of the median coverage in each bam file were discarded [30]. Variations in x the coding sequence of proteins that contain PE or PPE motifs [31] were characterized by Genebank's annotation and known drug resistance genes were discarded. Anti-TB drug resistance SNPs were identified using TB-Profiler [32] and an additional database which was inhouse curated. Finally, 26869 SNPs remained after filters. VCF-kit [33] was used to concentrate all the remaining SNPs into pseudo-sequences for phylogeny analysis. All of these SNPs were loaded into a MongoDB database and performed queries using Perl script with MongoDB driver. A maximum-likelihood phylogeny was reconstructed using RAxML [34] v8.1.9 with a general time reversible (GTR) nucleotide substitution model and 100 bootstrap replicates. De novo assembly of reads used SPAdes [35] v3.13.0, and the binning used MyCC.py [36]. Bacterial species of binned contigs were identified by ANIm with a cut-off 0.95 using in-house scripts. All statistical analyses were performed in R-language v3.4.4 unless otherwise stated. M. tuberculosis lineages were assigned and Beijing strains sub-lineages used Coll's 62 SNPs schema [37]. In silico spoligotyping used SpoTyping-v2.0 [38]. Data visualization used the online tool iTOL [39].