Projecting COVID-19 disease severity in cancer patients using purposefully-designed machine learning

Background Accurately predicting outcomes for cancer patients with COVID-19 has been clinically challenging. Numerous clinical variables have been retrospectively associated with disease severity, but the predictive value of these variables, and how multiple variables interact to increase risk, remains unclear. Methods We used machine learning algorithms to predict COVID-19 severity in 348 cancer patients at Memorial Sloan Kettering Cancer Center in New York City. Using only clinical variables collected on or before a patient’s COVID-19 positive date (time zero), we sought to classify patients into one of three possible future outcomes: Severe-early (the patient required high levels of oxygen support within 3 days of being tested positive for COVID-19), Severe-late (the patient required high levels of oxygen after 3 days), and Non-severe (the patient never required oxygen support). Results Our algorithm classified patients into these classes with an area under the receiver operating characteristic curve (AUROC) ranging from 70 to 85%, significantly outperforming prior methods and univariate analyses. Critically, classification accuracy is highest when using a potpourri of clinical variables — including basic patient information, pre-existing diagnoses, laboratory and radiological work, and underlying cancer type — suggesting that COVID-19 in cancer patients comes with numerous, combinatorial risk factors. Conclusions Overall, we provide a computational tool that can identify high-risk patients early in their disease progression, which could aid in clinical decision-making and selecting treatment options. Supplementary Information The online version contains supplementary material available at 10.1186/s12879-021-06038-2.


Background
At the time of this writing, SARS-CoV-2 infection (COVID- 19) continues to exact a substantial toll across a wide range of individuals. Although previous studies have uncovered factors that increase risk of severe COVID-19 infection --e.g., older age, obesity, or preexisting heart or lung disease [1][2][3][4] --the clinical course and outcome of patients with COVID-19 illness remains variable and difficult for clinicians to predict. In cancer patients, projecting outcomes can be more complex due to uncertainty regarding cancer-specific risk factors; further, physicians must balance the risk of an untreated malignancy with the risk of severe infection due to specific anti-neoplastic therapies.
To help clinicians predict COVID-19 severity [5,6], we turned to robust machine learning methods to identify high-risk cancer patients based on their pre-existing conditions and initial clinical manifestations. Prior work using machine learning [7,8] or other analytic techniques has focused on non-cancer patients primarily from China or Italy [9][10][11][12][13][14][15]. In this study, we developed a model to predict clinical outcomes (levels of oxygen support needed) in cancer patients, using only clinical variables that were available on or before COVID-19 diagnosis (called "time zero"). Importantly, these variables were selected purposefully, combining both datadriven approaches and expert clinical opinion, and were designed to minimize over-fitting of the model and to increase clinical credibility. We gauged the prospective of this approach to accurately identify cancer patients at the greatest risk for impending severe COVID-19 illness, in the hopes of improving outcomes through timely and appropriate interventions.

Study population and clinical variables collected
We analyzed patients admitted to Memorial Sloan Kettering Cancer Center with laboratory-confirmed SARS-CoV-2 (COVID- 19) infection during the first 2 months of the pandemic, from March 10, 2020 (when testing first became available at our institution) to May 1, 2020. New York City was the first major metropolitan area in the United States that experienced widespread COVID-19 infections. Clinical treatment and risk-stratification strategies at this time were far from established, particularly in cancer patients, who may have a number of underlying conditions that place them at greater risk of severe outcomes. During this time, 40% of symptomatic individuals in our hospital were hospitalized for COVID-19, 20% developed severe respiratory illnesses, and 12% died within 30 days of infection [6]. The Memorial Sloan Kettering Cancer Center Institutional Review Board granted a Health Insurance Portability and Accountability Act (HIPAA) waiver of authorization to conduct this study.
We aimed to study patients specifically hospitalized for COVID-19 illness by including all patients admitted between 5 days prior, to 14 days after, diagnosis of SARS-CoV-2 infection. COVID-19 patients who were not hospitalized, or who were admitted outside of this window were not included. Analysis of disease severity of this patient cohort was previously reported by Robilotti et al. [6] and Jee et al. [16]; however, these studies did not develop nor apply machine learning predictive models to forecast future outcomes.
Importantly, we only used clinical variables that were collected on or before a patient's COVID-19 diagnosis date. For clinical laboratory values, only the most recent value was used. To reduce redundancy, groups of highly correlated variables (Pearson r > 0.90) were removed, and one random variable from the group was kept. Variables could be either mutually exclusive (e.g., indicator variables for a patient having an abnormal vs. a normal X-ray), or overlapping (e.g., having a hematologic cancer and leukemia). Overlapping (hierarchical) variables were included to provide the algorithm with multiple resolutions to find discriminating risk factors.

Defining patient outcomes
Patients were grouped into three possible outcomes based on whether and when they required high levels of oxygenation support, which we defined as oxygen delivered via a non-rebreather mask, high flow nasal cannula (HFNC), bilevel positive airway pressure (BiPAP), or mechanical ventilator. When the oxygen content in a patient's blood falls below normal limits, it puts the patient at risk of organ failure and death. COVID-19 can cause significant lung injury, which impairs the ability of Fig. 1 Overview of the study. a Data for 348 inpatients at Memorial Sloan Kettering Cancer Center were analyzed. For each patient, up to 267 clinical variables were collected, including basic patient information, cancer history, ICD medical history, laboratory work, and radiology work. Variables were only collected up to the patient's COVID-19+ date (time zero). b Variables are inputted into a machine learning algorithm (a random forest classifier), which learns to predict patient outcomes based on interactions between multiple variables. c Three possible patient outcomes. Of the 348 patients, 206 did not require high levels of oxygen support, 71 required oxygen support within 3 days of being tested positive for COVID-19, and 71 patients required oxygen support after 3 days oxygen to enter the circulatory system. In those instances, supplemental oxygen is administered through a variety of delivery methods. Patients requiring high oxygen support within 3 days (0 to 3 days relative to  were deemed "severe-early". Patients requiring high oxygen support after 3 days (4 days after COVID-19 or later) were deemed "severe-late". Patients not requiring high oxygen (i.e., patients who remained on room air and/or standard nasal cannula) for at least 30 days after COVID-19 were deemed "non-severe".

Machine learning algorithms and validation
To predict patient outcomes, we employed a random forest ensemble machine learning algorithm, consisting of multiple independent classifiers, each trained on different subsets of training variables [17]. These classifiers collectively estimate the patient's most likely outcome. Our random forest model consisted of 500 decision trees, trained using the information gain criterion, and each with a maximum depth of 10 decision nodes and a minimum of 1 sample per leaf. These parameters were selected after performing a standard grid search with the number of trees = {100,500,1000}, max-depth = {10,20, None}, and minimum samples per leaf = {1,2,5}. Parameter optimization improved AUROC by only~4% compared to a model trained using default scikit-learn parameters. Thus, our reported performance is unlikely a result of overfitting model parameters.
The model was evaluated using 10-fold stratified cross-validation, in which 90% of the dataset (approximately, 313 patients) were used to train the model, and the remaining 10% of the dataset (35 patients) were used to test the model. This process was repeated 10 times, such that each subject was assigned to the test set exactly once. This procedure also ensured that each fold had a class (outcome) distribution that approximately matched that of the complete dataset. We report area under the receiver operating characteristic (AUROC) and average precision scores for each class separately using a one-vs.-rest classification scheme [18].
The importance of each clinical variable towards performance was assessed using permutation testing [17], in which values for each variable (column) were randomly permuted over the observations and then model performance was reassessed using cross-validation; the drop in performance was used as a measure of the variable's importance. The machine learning algorithms, statistical analyses, and visualization procedures were implemented in python (v3.6.12) using the scikit-learn (v0.22.2) and matplotlib (v3.3.3) packages.

Comparison of performance to prior work
Previous machine learning studies have reported impressive performance predicting COVID-19 outcomes for non-cancer patients using only a few clinical variables. For example, Yan et al. [7] (Nature Mach. Intell., 2020) report 90 + % performance using just three variables (lactate dehydrogenase, C-reactive protein, and absolute lymphocyte count). Huang et al. (Lancet, 2020) reported statistical significance for 10 clinical variables (white blood cell count, absolute neutrophil count, absolute lymphocyte count, prothrombin time, D-dimer, albumin, total bilirubin, lactate dehydrogenase, troponin I, and procalcitonin). Other studies also used many of the same clinical variables [10,13,14]. For a fair comparison, and to test whether variables previously identified as important could also well-predict outcomes for cancer patients, we trained random forest classifiers on our dataset using only the variables used by Yan et al. and Huang et al., respectively.

Experimental setup and rationale
Wynants et al. [8] recently reviewed 16 prognostic models for predicting COVID-19 severity and concluded that every study had a high or unclear risk of bias. To try and minimize bias in our analytic approach, we followed three guidelines suggested by the authors: (a) Practices to reduce model over-fitting. We used stratified cross-validation, a standard practice in machine learning, to test how well a trained model can predict outcomes on patients it has never seen before. Evaluating models in this way helps to ensure that predictive patterns learned by the model can generalize to new patients whose outcomes are unknown. (b) Using a hybrid of expert clinical opinion and datadriven approaches to select variables. The authors of our study include both clinicians and computer scientists, who collaborated closely to home-in on a set of relevant clinical variables. As an example, using a completely data-driven approach, we found that a class of medications, atypical antipsychotics, correlated highly with disease severity; in fact, including these medications in our model would have increased our reported results by~4-5%. However, these medications are frequently given to elderly patients with dementia, and we felt these medications were very unlikely to directly cause severe COVID-19, and far more likely to be confounded by functional status. So, we removed this variable. Thus, we began with a purely data-driven approach to identify candidate variables, and then iteratively eliminated those that seemed tenuous from a clinical perspective. Our final model was trained using only 55 of the 267 variables (Table S1). (c) Only including patients who had sufficient time to experience their outcome by the end of the study.
We evaluated hospitalized patients diagnosed with COVID-19 from March 10 to May 1, 2020, and evaluated outcomes from March 10 until May 15, 2020, to ensure at least 2 weeks of follow-up for all patients.

Results
From March 10, 2020 to May 1, 2020, there were 348 inpatients at Memorial Sloan Kettering Cancer Center in New York City. Below, we test several models for predicting disease severity in this cancer patient cohort.

Univariates and bivariates weakly correlate with COVID-19 patient outcomes
Figure 2a-f shows that neither of six clinical variables commonly associated with COVID-19 severity (age, Creactive protein, D-dimer, albumin, lactate dehydrogenase, BMI) are by themselves able to discriminate the three patient outcomes. Some laboratory variables can only stratify between non-severe and severe-early patients (e.g., Fig. 2b, C-reactive protein), indicating that these labs may only be valuable for prognosing immediate risk as opposed to future risk. Others laboratory variables may be more discriminative but were only available for a fraction of patients at time zero (e.g., Fig.  2c, D-dimer). Overall, none of the variables we tested were significantly different between all three outcome groups (non-severe, severe-early, severe-late). We next tested whether interactions between two variables could be used to increase prediction accuracy. While there are hundreds of pairs of variables to test, Fig. 2g-i shows three representative plots using pairs of commonly used labs, none of which show any clear clustering of patients by outcome (i.e., clustering of the same-colored dots together).

Improved prediction using machine learning
To test if a combinatorial approach, which takes interactions between numerous risk factors into account, may improve projections of COVID-19 severity, we trained an ensemble machine learning algorithm using a wide range of clinical variables (Methods). Clinical variables included those related to the patient's underlying cancer diagnosis and treatment, laboratory work, radiological work, pre-existing diagnoses (ICD code history), and other basic patient information. We validated our model using stratified cross-validation: a portion of the patients were used to train the model, and then the model was evaluated on the remaining or left-out patients, whose outcomes are known but are never provided to the model.
Our model accurately predicted outcomes for COVID-19 cancer patients who required high levels of oxygen support within 3 days of COVID-19 diagnosis (AUC = 0.829 for severe-early patients; Fig. 3a). The model achieved fair accuracy in the more challenging instances of predicting severity that occurs after 3 days (AUC = 0.704 for severe-late patients) or that never occurs during the length of the patient's disease (AUC = 0.710 for non-severe patients). The model maintains an AUC of greater than 0.8 if "severe-early" was defined as all patients that required oxygen support within 4 days of diagnosis (instead of 3 days), but performance then begins to drop at longer time horizons: AUC = 0.823 for ≤4 days (81 patients); AUC = 0.790 for ≤5 days (88 patients); and AUC = 0.727 for ≤6 days (99 patients). These  results suggest that prediction is only reliable within a 3-4 day window from the time of diagnosis.
Prior work has reported that a small set of clinical variables can serve as a robust "signature" of COVID-19 disease severity [1,7] (Methods). However, we found significantly worse performance using these variables (Fig.  3b-c). For example, for severe-early patients, Yan et al. Similarly, for non-severe patients, the two studies achieved AUCs of 0.499 and 0.604, compared to 0.710 for our method. AUROC scores can be unreliable when used on datasets, such as ours, with imbalanced class sizes. We thus also computed average precision scores (a summary statistic of the precision-recall curve) and found similar gains for our method compared to prior works (Table 1).
Identifying multi-variable interactions that are useful for predicting patient outcomes Figure 4a shows the top 30 variables that were most discriminative in classifying patient outcomes. These were variables which, if effectively removed from the analysis, would result in a drop in performance (Methods). For example, ferritin and interleukin 6 were the two most important individual labs. Because we used dozens of variables, and many variable combinations may be correlated, we do not expect the loss of one or a few variables to make a significant difference in performance. Nonetheless, many of these variables have been previously identified in the COVID-19 literature (e.g., interleukin 6 [19,20], C-reactive protein [21]). Interestingly, there are also variables the model used that are less discussed in the literature, including ferritin [12,22]. Our study also highlights the importance of variables related to cancer diagnoses and treatments on COVID-19 severity; for example, whether the patient had leukemia or lung cancer was particularly discriminative.
Variables from all five categories (cancer-related, basic patient information, ICD codes, laboratory work, radiological work) are represented in Fig. 4a, highlighting how each clinical category contributes complementary information towards projecting COVID-19 severity. Indeed, classifying patient outcomes using variables from each category individually reduces accuracy compared to when using all variables together (Fig. 4b). For example, training the model using only cancer variables produced an average AUROC of only 55.2%. On the other hand, using all variables except cancer-related variables dropped performance by 5.7%. The former means that the underlying cancer type, by itself, is not a very valuable predictor, but the latter suggests that when the cancer type is combined with clinical variables from other categories, its contribution becomes more pronounced and is unique. Similarly, using only radiology variables produced an average AUROC of 60.6%, and using all variables except for radiology variables dropped performance by 6.0%. The performance of the classifier (y-axis) when trained using variables from each category separately. For example, using only radiology variables, the random forest classifier achieved an AUROC, averaged over all three classes, of 60.6%. "All" shows the combination of all variables, achieving an average AUROC of 74.7%

Discussion
We used machine learning algorithms to identify clinical variables predictive of severe COVID-19 illness in cancer patients at time zero. We achieved an AUC ranging from 70 to 85%, with high performance for classifying patients with an immediate risk of decompensation (severe-early, ≤ 3 days), and fair performance for patients with less immediate risks (severe-late, > 3 days) or no risk at all (not-severe). Our tool is designed to complement (not replace) a clinician's experience and judgement and may be most helpful to untangle complex interactions among multiple risk factors.
Following the guidelines of Wynant et al. [8], we combined data-driven variable selection with expert clinical opinion to reduce overfitting and minimize bias in the model. Had we included all variables, our model's performance would increase by at least 5%, but we deliberately did not report these results and instead opted to build a model with more clinical credibility. In addition, our study was meant to tackle two real-world challenges in treating COVID-19 patients. First, we used the time of COVID-19 diagnosis (time zero) as a landmark; we only provided to our model data available on or before time zero in order to represent the information available to providers at the time of presentation and diagnosis. As a result, there may be a lack of consistency in what clinical variables are available for the model to use. For example, even though D-dimer are commonly associated with COVID-19 severity [23], very few of our patients (16.1%, 56/348) had available D-dimer labs on the date of their COVID-19 diagnosis. Second, patients enter the hospital at different points in their disease progression, and we did not attempt to correct for these differences. A useful model, we reasoned, needs to deal with this lack of synchronicity to be practical.
There are several advantages and disadvantages to the machine learning approach taken here. On the plus side, automated models can help evaluate a large pool of clinical variables as risk factors for disease severity, and has potential to go beyond conventional modelling approaches, which are generally limited to evaluation of only a handful of variables. Further, evaluating the model using cross-validation reduces the probability of overfitting and highlights a model's prognostic ability. On the downside, the model seeks variables that are correlated with patient outcomes, and these variables are not necessarily causal drivers of the disease. For example, corticosteroids given to severe COVID-19 patients are known to affect blood glucose levels, and our model makes no attempt to distinguish the directionality of the interaction between the two. We attempted to overcome this by using a hybrid of expert clinical opinion and data-driven approaches to select variables in a purposeful manner, though it remains a challenge to differentially weigh the importance of clinical experience versus data.

Conclusions
Moving forward, several challenges remain in bringing clinical machine learning to the bedside for COVID-19 treatment. First, we analyzed a modestly-sized dataset of 348 cancer patients; larger, more comprehensive datasets of cancer patients are needed to test the true generality of our approach. Second, better algorithms are needed to forecast future outcomes (severe-late and non-severe); e.g., time-series analyses of how clinical variables change over time may provide one avenue forward. Third, models should aid clinicians in the real-time process of deciding which diagnostic tests to order on a patient based on the putative discriminative power of the test results. Ideally, models would interact with clinicians in a back-and-forth manner to home-in on the clinical variables most critical for accurate forecasting [24].
To better prepare us for the next outbreak --be it a second wave of COVID-19 or something else altogether --we hope that physicians, epidemiologists, and computer scientists will continue working together to understand and build useful models to predict an individual's susceptibility to disease.