Comparing tuberculosis gene signatures in malnourished individuals using the TBSignatureProfiler

Background Gene expression signatures have been used as biomarkers of tuberculosis (TB) risk and outcomes. Platforms are needed to simplify access to these signatures and determine their validity in the setting of comorbidities. We developed a computational profiling platform of TB signature gene sets and characterized the diagnostic ability of existing signature gene sets to differentiate active TB from LTBI in the setting of malnutrition. Methods We curated 45 existing TB-related signature gene sets and developed our TBSignatureProfiler software toolkit that estimates gene set activity using multiple enrichment methods and allows visualization of single- and multi-pathway results. The TBSignatureProfiler software is available through Bioconductor and on GitHub. For evaluation in malnutrition, we used whole blood gene expression profiling from 23 severely malnourished Indian individuals with TB and 15 severely malnourished household contacts with latent TB infection (LTBI). Severe malnutrition was defined as body mass index (BMI) < 16 kg/m2 in adults and based on weight-for-height Z scores in children < 18 years. Gene expression was measured using RNA-sequencing. Results The comparison and visualization functions from the TBSignatureProfiler showed that TB gene sets performed well in malnourished individuals; 40 gene sets had statistically significant discriminative power for differentiating TB from LTBI, with area under the curve ranging from 0.662–0.989. Three gene sets were not significantly predictive. Conclusion Our TBSignatureProfiler is a highly effective and user-friendly platform for applying and comparing published TB signature gene sets. Using this platform, we found that existing gene sets for TB function effectively in the setting of malnutrition, although differences in gene set applicability exist. RNA-sequencing gene sets should consider comorbidities and potential effects on diagnostic performance. Supplementary Information The online version contains supplementary material available at 10.1186/s12879-020-05598-z.


Background
Tuberculosis (TB) is the leading cause of death due to an infectious disease worldwide, killing 1.6 million people in 2017 [1]. The EndTB strategy aims to reduce TB deaths by 95% and to cut new cases by 90% between 2015 and 2035 [2]. A critical component of this strategy is early identification of individuals with TB and prevention of transmission. Although the roll-out of GeneXpert has facilitated rapid TB diagnosis, the test has limitations (e.g., lower sensitivity if low bacillary burden, in children, and in extra-pulmonary disease) [3][4][5]. Furthermore, not all individuals with possible pulmonary TB are able to produce sputum [6]. Newer blood-based diagnostics using gene expression profiles have the potential to address the limitations of GeneXpert and other sputum-based tests [7].
Over the past several years, researchers have been able to identify nearly four dozen gene expression signatures that distinguish TB disease from latent TB infection (LTBI) [8,9], TB from other infections [10][11][12], incipient pre-symptomatic TB disease and/or the future development of TB disease in those with LTBI [13][14][15], and response to therapy [16,17]. Signatures can be used to understand the heterogeneous response to TB and help identify the pathways and underlying biology of TB disease progression. These signatures have been developed using multiple profiling technologies (microarray, RNAsequencing, rt-PCR) and using a diverse set of computational and machine learning prediction algorithms. Some of these signatures were developed using direct training or cross-validation approaches on a single study, while others were developed using a meta-analytical approach [17,18]. Furthermore, several of these gene signatures have been validated by independent research teams on diverse cohorts in different settings and using multiple computational algorithms [19][20][21]. Importantly, recent studies have systematically compared the performance of TB signatures, and their associated gene sets and original predictive models, across a multiple of TB datasets to compare the performance of these signatures to predict TB outcomes [20,21]. However, despite this work, there is not a single resource of compiled signature gene lists, methods or biomarkers to apply to new datasets, and most gene sets have not been independently validated using alternative computational methodologies.
Existing studies of blood-based TB diagnostics have another important limitation: most have not evaluated the impact of comorbidities on the modulation of the TB signature. In high-TB burden settings, much of the population has comorbidities that affect host immune response, and likely alter gene signatures of TB disease. Some of these have been directly studied (e.g., diabetes, HIV) [22][23][24] and others have not (e.g., malnutrition, pregnancy, parasites). In particular, the role of malnutrition, which is known to modulate the innate and adaptive immune responses, has not been explored [25,26]. Malnutrition affects much of the population in TB endemic countries including one-third of the adult population in India, the country with 27% of the world's TB cases [1]. It is the most common secondary immunodeficiency and has been termed nutritional acquired immunodeficiency syndrome [27,28]. Undernutrition appears to impact both the innate and adaptive immune systems [29], and so can conceivably alter gene expression in these patients in significant ways. For example, undernourished individuals have been noted to have decreased expression of Th1 cytokines and increased concentrations of Th2 cytokines, which hobbles the Th1 response against Mtb [30,31]. Prior research has also suggested that undernutrition may also diminish the effectiveness of TB vaccines. Furthermore, a study over two decades in the United States found that a BMI < 18.5 kg/m2 was associated with an adjusted hazard ratio of 12.43 (CI: 95% CI: 5.75, 26.95) for developing TB disease as compared to those with BMI greater than 18.5. In India, more than 50% of TB cases are attributable to undernutrition in most states [32]. Because of the significant TB risk malnutrition poses and the gap in current knowledge, we sought to determine whether the published gene lists indicating TB disease accurately discriminate TB from LTBI in the setting of malnutrition in India.
In this work, we curated almost four dozen existing TBrelated signature gene sets and developed our TBSignature-Profiler software toolkit. We also added two single-gene biomarkers to this comparison that were compared in a previous meta-analysis [21]. This platform was used to evaluate the function of these signatures for distinguishing between TB and LTBI in severely malnourished individuals. We applied the TBSignatureProfiler to this condition to determine whether existing TB gene sets work in a severely malnourished population. While it is unlikely that these signatures will be implemented in clinical practice for detecting TB disease, we do note that many existing signatures were developed for this purpose. Thus, comparisons between prevalent and latent TB is the logical first step in evaluating and validating these signature gene sets in the setting of malnutrition. Once these signatures are established and validated, they can be used for more innovative and useful applications, such as predicting risk of progression or worsening disease, monitoring treatment efficacy, or the diagnosis of extrapulmonary disease.

Collection of published TB signature gene sets
Our goal was to compile a comprehensive set of multigene signature gene lists and make them available through our TBSignatureProfiler platform. The only criteria for inclusion in this study was that the signature gene set consisted of at least two genes and was used and presented as a biomarker of a TB outcome (disease, risk, treatment, etc) in a peer reviewed publication. We collected a set of 45 previously published gene sets in total (Table 1). References for these gene sets are available in the Supplementary Materials and from the TBSignature-Profiler software documentation. These gene sets were derived from multiple studies, using several transcriptional profiling platforms, and using disparate predictive methods and algorithms. As such, we defined the term "gene sets" or "signature" as the collection of genes that were used in the predictive model in its original study. We then define the "gene set/signature strength" or "gene set/signature score" by the single sample gene set enrichment score for that set. For presentation gene signatures are labeled using the first author's last name and the number of genes in the signature (e.g., Berry_393). Gene sets that focus on the presence of comorbidity with TB and another disease have additional labels. Details for these naming conventions are available in the Supplementary Methods. We also included two previously proposed single gene biomarkers, NPC2 [33] and BATF2 [34,35], using their gene expression counts in our comparison.

TBSignatureProfiler platform
The 45 previously published gene sets of TB outcomes are included in our TBSignatureProfiler, a novel R package that allows users to quickly and easily perform single sample pathway enrichment analysis using our set of TB signature gene sets and multiple scoring methods, including ssGSEA, GSVA, PLAGE, combining Z-scores, ASSIGN and singscore [36][37][38][39][40][41] (these methods are detailed in the Supplementary Methods). This workflow can then be used for profiling and visualizing these gene sets/pathways and plotting functions in our TBSignatureProfiler R package. The R package is available on GitHub (https://github.com/compbiomed/ TBSignatureProfiler) and through Bioconductor (https://bioconductor.org/packages/release/bioc/html/TBSignatureProfiler.html). Additional details for the scoring visualization, and comparison functions are detailed in the Supplementary Methods and in the software package vignette.

Malnourished individuals from RePORT-India
Our malnourished samples came from the Regional Prospective Observational Research in TB (RePORT)-India cohort based at Jawaharlal Institute of Postgraduate Medical Education and Research (JIPMER). The study is conducted in collaboration with Boston Medical Center and Rutgers-New Jersey Medical School. Ethical approval was obtained by the JSAC and IEC committees of JIPMER and the institutional review boards of Boston Medical Center and Rutgers University. This household contact study enrolls newly-diagnosed smear-positive, culture-confirmed pulmonary TB cases identified at Revised National TB Control Programme clinics as well as their household contacts; additional study details have been previously reported [19,[42][43][44]. In brief, index cases are visited at enrollment, 1, 2, 6 and 12 months and household contacts at enrollment, 12 and 24 months. Blood is collected in PaxGene RNA sequencing tubes at each time point. Household contacts undergo tuberculin skin testing (TST) for LTBI and are monitored for symptoms of active TB; sputum testing is done on symptomatic individuals.
In addition to demographic characteristics, questionnaires address relevant comorbidities that affect host response and TB risk including HIV, diabetes, renal failure, other immunosuppressive conditions, alcohol use (and atrisk alcohol use based on the Alcohol Use Disorders Identification Test [45], tobacco use, and other parameters. These values are summarized in Table 2. Participant BMI is measured at baseline and categorized into severe malnutrition (BMI < 16 kg/m2), malnutrition (16-18.4), and normal/ overweight (> 18.4) henceforth referred to as wellnourished. In individuals less than 18 years of age, BMI was categorized based on standard deviations relative to the World Health Organization median: children whose BMI was more than two standard deviations away from the median for their age were categorized as malnourished [46]. In index cases, blood samples are taken to diagnose diabetes mellitus (random blood sugar > 200 mg/dL) and HIV.

RNA-sequencing data generation and processing
We analyzed RNA-seq data from enrollment PaxGene tubes from a subset of 23 severely malnourished individuals with TB and 15 severely malnourished tuberculin skin test positive (TST ≥5 mm) household contacts as previously described [19]. The data were batch corrected using ComBat-Seq [47,48] (Supplementary Figure 1). Differential expression between TB and LTBI samples produced 6706 differentially expressed features using an adjusted p-value (FDR) cutoff of 0.01, including 4913 protein coding genes, 1052 lncRNAs, 135 T cell receptive elements, 19 immunoglobulin genes, and 13 miR-NAs. The list of protein coding genes was used to develop a list of differentially expressed genes and pathways of TB vs LTBI. Detailed methods for the processing of the PaxGene tubes, RNA-sequencing, and data analysis are available in the Supplementary Methods.

Analysis and overlap of existing TB signature gene sets
The 45 TB signature gene sets described in Table 1 consist of between 2 and 700 unique UCSC gene symbol annotations. Overall, these gene sets include 1513 unique TB associated genes. Most genes (965, 63.8%) are listed in a single gene set and 96.8% (1465 genes) are listed in four or fewer gene sets; only 48 (3.2%) genes are listed in five or more gene sets (Fig. 1). Individual genes that occur frequently include FCGR1A, GBP5, GBP6, C1QB, FCGR1B, SEPT4, and ANDKRD22 (in 14, 14, 13, 12, 11,

Analysis of TB signatures in malnourished individuals
Differential gene/pathway expression We found 4913 significant differentially expressed protein coding genes using an FDR threshold of 0.01; 56.9% of the genes from the 45 TB signature gene sets were present in that list. A pathway enrichment analysis using the 1000 most significant genes resulted in multiple relevant enriched pathways, including the NF-kappa B signaling pathway, cytokine-cytokine receptor interaction, and multiple infection response pathways (including response to TB). We used the 500 most differentially   Figure 2), which separated the majority of those with TB from those without. We applied the TBSignatureProfiler to evaluate the performance of existing TB signature gene sets on our data. We used the SignatureHeatmap() function with ssGSEA scoring to evaluate the scores for all 45 gene sets simultaneously (Fig. 2). Similar plots using the GSVA and PLAGE scores are available in Supplementary  Figures 3 and 4. The heatmap illustrates that the scores are highly concordant across samples and that the signature gene sets are able to classify TB from LTBI. Specifically, the top four clusters segregated by the dendrogram consisted of one cluster with generally low gene set scores comprising only LTBI samples (n = 9), two clusters with the highest gene set scores for most pathways consisting of only TB samples (n = 11 and 8), and a fourth cluster consisting of mild to moderate scores for most pathways that consisted of both LTBI (n = 7) and TB samples (n = 3). Despite general agreement, it is clear that the signature gene sets are not completely concordant, and that all the signature gene sets provide more classification accuracy than each individual signature gene set.

Evaluation of individual signature gene set performance
The performance of signature gene sets can be evaluated using boxplots of individual gene set scores. We used  Table 1). These boxplots and table values are generated by bootstrapping gene set scores and calculating the AUC of the ROC plot. The ROC curves for these were also generated using the signatureROCplot_CI() function (Supplementary Figure 7). The bootstrapped confidence intervals were supplemented with more direct intervals using AUC variance and estimation procedures defined previously [49,50]. We note that most of the signatures in our malnutrition data, the bootstrapped confidence intervals were more conservative than the more direct approach. These results show that most gene sets are able to distinguish between the TB and LTBI samples.  Higher scores trend towards yellow-green and lower scores trend towards blue-purple. The color bar at the top designates whether the sample is from an LTBI individual (blue) or an individual with active TB (red). These signatures are able to separate most of the TB samples from the LTBI samples. The pathway signature scores are largely concordant. This heatmap was generated using the SignatureHeatmap() function from the TBSignatureProfiler value = 0.83), and Lee_4 (AUC = 0.511; p-value = 0.30) had t-test p-values above 0.1 using ssGSEA scores.

Evaluation of gene set enrichment scoring methods
We used the signatureGeneHeatmap() function to evaluate the gene-level performance of a few pathways one at a time (Fig. 4). The genes in Samabrey_HIV_10 and Thompson_9 segregate malnourished TB and LTBI. The Lee_4 gene set showed poor performance using this metric, as there is no clear clustering of genes and an upregulation of the four genes among both TB and LTBI. The Maertzdorf_4 gene set showed better performance in clustering and visual analysis than the ssGSEA and AUC analyses showed. This gene set performed better with GSVA (AUC = 0.764) and PLAGE (AUC = 0.932). This is likely because ssGSEA scoring (and GSVA to a lesser extent) relies heavily on concordance of genes (i.e., all are either up-regulated or down-regulated), and thus a gene set such as Maertzdorf_4 that consists of genes that are negatively correlated or pick out different features from samples do not score well with this algorithm. Conversely, we observed that Thompson_9, which consists of highly concordant/redundant genes, had among the best AUCs with ssGSEA (AUC = 0.983), GSVA (AUC = 0.980), and PLAGE (AUC = 0.983).

Discussion
In this study, we present our set of 45 curated TB signature gene sets along with our TBSignatureProfiler software and use it to assess the impact of malnutrition on discriminative ability of a large number of signature gene sets.
The TBSignatureProfiler is an important contribution that provides the first comprehensive, open-source evaluation tool to compare TB signature gene sets in a direct and reproducible way. This automated platform enables investigators to apply nearly three dozen TB gene sets directly to their datasets using multiple different scoring methods with tools to visualize signature gene set strength. Future analyses performed using these same gene sets on additional datasets can be directly compared with past results using the same scoring methods and analytic approach. In addition, new/future signature gene sets can be added and evaluated in a simple and straight-forward way-by merely adding them to the TB signature gene sets collection in the software. This functionality has never been previously available in the TB research field, despite the publication of many dozens of previous gene expression studies, signatures, previous evaluations and metanalyses [17,19,20]. Ultimately, the TBSignatureProfiler will enable investigations into whether signature gene sets work in different geographic settings and in the context of different social conditions, contexts, or co-morbidities (e.g., high alcohol use), and efficiently evaluate and compare new signature gene sets in these populations as they are developed.
Overall, there were very few genes that overlapped between the signature gene sets. There were, however, many common functional families that are represented across the gene sets. For example, guanylate-binding proteins (GBPs) are IFN-induced GTPases and contribute to an inflammatory response by activating the NLRP3 and AIM2 inflammasome assembly [51][52][53]. Interferons are produced during Mtb infection which could lead to activation Fig. 3 a Boxplots of the ssGSEA scores for each signature individually show that some of the signatures are highly predictive of TB compared to LTBI in malnourished individuals. b Boxplots for the AUCs (y-axis) from bootstrapped samples for each pathway (x-axis) demonstrate that that most of the signatures were able to classify TB from LTBI, although some of the signatures there of the signatures, including Maertzdorf _4, Lee_4, and Sloot_HIV_2, had boxplots arms below the 0.5 mark. These figures were generated using the SignatureBoxplot() and AUCBoxplot() functions of the TBSignatureProfiler of GBP5 and GBP6. These GBPs then further enhance the inflammatory response via inflammasome activation. FcGR1 (CD64) is the high affinity receptor for IgG and is expressed on most myeloid cells. In humans, FcGR1 is encoded by three genes, FcGR1A, FcGR1A and FcGR1C that are highly homologous. Interaction of IgG and FcGR1 results in cellular activation, including phagocytosis, generation of reactive oxygen species, antigen-presentation, release of inflammatory cytokines, and antibody-mediated cellular cytotoxicity [54], FcGR1 expression on neutrophils has been proposed as a biomarker of infection and sepsis [55]. Neutrophils in Juvenile Idiopathic Arthritis, an inflammatory disease, express higher levels of FCGR1B compared to controls [56]. It is therefore not surprising that many signature gene sets encompassed either FcRG1A or FcRG1B. Kinase activation and phosphorylation cascades induced following immune cell activation are regulated by dual-specificity phosphatases (DUSPs) [57]. Since active TB is associated with increased inflammatory response, the presence of DUSP3 in several signature gene sets is expected. Another gene found in many signature gene sets is ANKRD22, an ankyrin repeat protein with four copies of the ankyrin motif. The motif interacts with an array of unrelated proteins to affect many cellular processes [58,59] and it is likely that ANKRD22 expression is upregulated because of the enhanced inflammatory response in TB. Basic leucine zipper transcription factor ATF-like (BATF)2, is a transcription factor that belongs to the activator protein 1 family of transcription factors and contains the basic leucine zipper domain. BATF2 dominance in the TB signature gene sets is consistent with its upregulation by type I IFNs [60], and by IFNγ and Mtb in macrophages [61].
The single gene biomarkers NPC2 and BATF2 were very effective in distinguishing between TB and LTBI in malnutrition. Although these single gene biomarkers are highly effective, activation of these genes are not specific to TB infection, but are associated with common inflammatory pathways (this may also be the case for some of the multi-gene "Disease" signatures). We note that NPC2 plays a key role in lysosomal cholesterol egress [62,63] and the expression of NPC2 is directly regulated by the nuclear factor kappa B subunit 2 (NF-κB2) protein [64]. In addition, NPC2 plays a significant role in other infectious diseases, for example, upregulation of NPC2 is crucial for viral replication in Chikungunya, Zika, West Nile and Dengue infections [65]. BATF has been shown to Maertzdorf_4 signature gene sets. The Sambarey_HIV_10 and Thompson_9 signatures scored well with the ssGSEA algorithm scoring, whereas the Lee_4 and Maertzdorf_4 gene sets scored poorly. These heatmaps were generated using the signatureGeneHeatmap() function from the TBSigntureProfiler directly control TH17 differentiation [66], and transcriptomic analysis has established that up regulation of BATF2 in HIV-specific CD8+ T cells leads to the inhibition of T cell function [67]. Thus, although these genes are sensitive biomarkers for separating TB from LTBI, they lack in specificity to TB as their expression is associated with common processes involved in host immune responses to multiple infectious agents. Thus, we would recommend using more specific, multi-gene signatures if specificity is needed for the context.
The TBSignatureProfiler was applied to samples from severely undernourished individuals with TB and LTBI in India. This analysis found that existing blood RNA signature gene sets of TB generally work in the setting of severe undernutrition, although some differences in performance do exist. Differences seen in the application of the signature gene sets may reflect the size of the gene sets (i.e., smaller gene sets may not perform as well) and/or the settings in which those data were trained. A few selected signature gene sets do not perform optimally in the setting of severe undernutrition. These findings suggest that most TB signature gene sets are robust and could work in many different settings and with different comorbidities, but some gene sets perform slightly better in different contexts. This finding has important implications in India and many high TB-burden countries.
We had hypothesized that malnutrition might modulate the transcriptional profiles in different ways and using different mechanisms than in well-nourished individuals, but this was generally not the case. Malnutrition clearly affects the immune response with effects on macrophage activity and phagocytosis, antigen presentation, and induction of the Th1 immune response among other sequelae [29]. It is plausible that these effects were not detected because the dominant immunomodulatory effect of TB that are common between well-nourished and malnourished individuals outweigh the more specific transcriptional impacts induced by changes in nutritional status. It is also likely that some of the signature gene sets themselves were developed in settings with high rates of malnutrition, so the effect of malnutrition on TB signature gene sets was incorporated. For example, Sambarey_HIV_10 signature was trained on data obtained from participants in Chennai and Bengaluru, India where malnutrition is highly prevalent. Further investigation is needed to understand the role of inflammation and immune response in the setting of malnutrition, although we show here that most existing TB signature gene sets work well in the setting of malnutrition.
Malnutrition is not the only comorbidity that is associated with TB incidence. Endemic countries have high rates of alcohol use, diabetes, HIV and other immunomodulatory conditions [68][69][70]. Little has been done to explore whether blood-based transcriptional TB signatures may be altered in the setting of such comorbidities. Such studies are needed before these signatures can be accepted as validated diagnostic modalities. For example, it has been shown that the Zak_RISK_16 signature has a lower AUC in the setting of HIV infection [13]. Furthermore, transcriptional profiling of individuals with diabetes and TB demonstrate activation of pathways associated with diabetes complications [24]. It is possible that signature performance in other TBendemic settings may also be affected by genetic or Mtb strain differences. Additional work is needed to determine the impact of other common comorbidities. The TBSignatureProfiler can play an important role in facilitating future analyses in these different settings.
This work is a demonstration that existing signature gene sets can be effectively used on samples from comorbid TB contexts, although the efficacy of the gene sets may vary. While it is unlikely that these gene signatures will be used in clinical practice to distinguish pulmonary TB from LTBI controls, our work does provide the promise that existing gene sets can be used to detect TB in circumstances where existing diagnostics are less effective, e.g. distinguishing extrapulmonary, paucibacillary, and pediatric TB from controls in malnourished individuals. In addition, evaluation of the subtle differences between signature gene set performance combined with the dissection of the gene set content may provide insight on potential mechanisms specific to demographic, comorbidities, or other context-related specifics for each patient group under consideration.
We recognize that this study has several limitations. While the study has large enough sample size to determine the significance of the signature gene sets' abilities to distinguish between TB and LTBI, the sample size was not large enough to clearly distinguish between the performance of the top-scoring gene sets. Therefore, we can only conclude that many of the gene sets work well, but we cannot determine which is the best gene set in this context. It is possible that our results do not reflect the full spectrum of gene sets in severely malnourished individuals with LTBI, as severe malnutrition may blunt the TST response; however, our previous analyses suggest this is not universally true [71]. In addition, the characteristics of the participants with TB and LTBI differed with regard to demographics (e.g. age) and risk factors (e.g. smoking and alcohol), and we do not have power to control for these differences in our analysis. While this may lead to the confounding of signature gene set strength differences between TB and LTBI, we point out that differences in demographics and comorbidities are quite common among the TB and LTBI populations; these data represent the population dynamics of these groups. In addition, several of our signature gene sets were trained in pediatric cohorts [13,72], but we see no difference in performance between these child/adolescent gene sets between those trained on adults.
One final limitation of our TBSignatureProfiler platform is that many existing signature gene sets were trained on different transcriptional profiling platforms (microarrays, RNA-seq) using different machine learning and predictive modeling tools. Gene set scoring methods may not perform as well with the signature gene set compared to the original platform or method-this is an area of further development for the package that is beyond the scope of this paper. However, here we evaluate existing signature gene sets across multiple scoring methods to highlight which gene signature sets of TB are the most robust across platforms and methods, and thus should work well across a variety of predictive modeling approaches and contexts. This approach may also have the benefit of reducing the likelihood of model overfitting for individual signatures trained on specific datasets.

Conclusion
In conclusion, we have developed the TBSignatureProfiler platform that enables the application of several dozen TB signature gene sets to new datasets. The TBSigna-tureProfiler allows multiple scoring options and innovative graphical outputs. Using the TBSignatureProfiler, we demonstrate that severe malnutrition does not significantly alter the predictive performance of most TB gene sets. As we move toward expanded use of signature gene sets, these findings will have relevance in India and other settings with a high TB and malnutrition burden.
Additional file 1. Supplementary signature list and references. This file contains the TBSignatureProfiler software doucmentation, list of signatures and references for all the signatures in the TBSignatureProfiler.
Additional file 2: Supplementary Table 1. Output of the table_AUC() function for the ssGSEA scored signatures and single genes. Supplementary Figure 1. PCA plots before and after batch correction. (A) Illustrates principal components colored by batch which has a significant batch effect in the first principal component, and very little separation on the first two components by TB status. The two plots to the right show the ComBat-Seq corrected data colored by batch (B) and by TB status (C). These plots clearly demonstrate the reduction of batch effects and the magnification of signal due to TB status. Supplementary  Figure 2. A heatmap and unsupervised clustering of the 500 most differentially expressed genes clearly separates the malnourished TB individuals from the malnourished LTBI. Supplementary Figure 3. A heatmap displaying the scaled GSVA scores for all 47 signatures (rows) for the malnourished TB and LTBI individuals (columns). The color bar at the top designates whether the sample is from an LTBI individual (red) or an individual with active TB (green). These signatures are able to separate most (all but five) of the TB samples from the LTBI samples. The pathway signature scores are largely concordant for most of the signatures. This heatmap was generated using the SignatureHeatmap() function from the TBSignatureProfiler. Supplementary Figure 4. A heatmap displaying the scaled PLAGE scores for all 47 signatures (rows) for the malnourished TB and LTBI individuals (columns). The color bar at the top designates whether the sample is from an LTBI individual (red) or an individual with active TB (green). These signatures are able to separate most (all but five) of the TB samples from the LTBI samples. The pathway signature scores are largely concordant for most of the signatures. This heatmap was generated using the SignatureHeatmap() function from the TBSignatureProfiler. Supplementary Figure 5. (A) Boxplots of the GSVA scores for each signature individually further show that some of the signatures are highly predictive of TB compared to LTBI in malnourished individuals. (B) Boxplots for the AUCs (y-axis) from bootstrapped samples for each pathway (x-axis) using the GSVA algorithm. The GSVA scores were more variable overall compared to the ssGSEA scores (Figure 2), and several of the signature AUCs had tails below 0.5 (Lee_4, Anderson_42, Thompson_RES_5, Sloot_HIV_2, Maertzdorf_OD_100, among others). Supplementary Figure 6. (A) Boxplots of the PLAGE scores for each signature individually further show that some of the signatures are highly predictive of TB compared to LTBI in malnourished individuals. (B) Boxplots for the AUCs (yaxis) from bootstrapped samples for each pathway (x-axis) using the PLAGE algorithm. The PLAGE scores were more variable overall compared to the ssGSEA scores (Figure 2), although a smaller number had tails around 0.5 (Lee_4, Thompson_FAIL_13, Sloot_HIV_2). Supplementary Availability of data and materials The raw and processed sequencing data from this study are available in the GEO repository, under accession numbers GSE101705 and GSE152218. Furthermore, processed sequencing data and R code used for analysis and figure generation is available in the following GitHub repository: https:// github.com/wevanjohnson/tbsp_malnutrition. The TBSignatureProfiler software is available through Bioconductor (https://bioconductor.org/ packages/release/bioc/html/TBSignatureProfiler.html) and GitHub (https:// github.com/compbiomed/TBSignatureProfiler).

Ethics approval and consent to participate
The associated protocol, informed consents, and data collection forms are approved for scientific content and compliance with human subjects' research regulations by the Institutional Review Boards of BUMC, Rutgers New Jersey, and JIPMER. Written informed consent was obtained from all study participants. All investigators and study staff are strain in research ethics and human subjects' research.