A machine learning model to predict death outcome in severe COVID-19 patients

Background The novel coronavirus disease 2019 (COVID-19) spreads rapidly among people and causes a global pandemic. It is of great clinical signicance to identify COVID-19 patients with high risk of death. Of the 2,169 COVID-19 patients, the median age was 61 years and male patients accounted for 48%. A total of 646 patients were diagnosed with severe illness, and 75 patients died. Obvious differences in demographics, clinical characteristics and laboratory examinations were found between survivors and non-survivors. A decision tree classier, including three biomarkers, neutrophil-to-lymphocyte ratio, C-reactive protein and lactic dehydrogenase, was developed to predict death outcome in severe patients. This model performed well both in train dataset and test dataset. The accuracy of this model was 0.98 and 0.98, respectively. The machine learning model was robust and effective in predicting the death outcome in severe COVID-19 patients. In train dataset, 54 out of 113 patients with NLR ≥ 6.9 died, while only 3 out of 339 patients with NLR < 6.9 died. In test dataset, 18 out of 43 patients with NLR ≥ 6.9 died, while no patients with NLR < 6.9 died. COVID-19 LDH IU/L only 4 out of 378 patients with LDH < 330 IU/L died. In test dataset, out of 31 patients LDH ≥ 330 IU/L of patients with LDH < 330 IU/L died. CRP expression, train dataset, of patients with CRP ≥ mg/L died, while only 5 out of 326 patients with CRP < mg/L died. In test dataset, 17 out of 53 patients with CRP ≥ 27 mg/L died, while only out of patients with < mg/L


Background
The novel coronavirus disease 2019 (COVID-19) has been a global pandemic. The most common symptoms of COVID-19 patients were fever, followed by dry cough, fatigue, dyspnea, etc [1,2]. A small piece of patients had digestive symptoms, such as nausea, vomiting and diarrhoea [3,4]. A study from the Chinese Center for Disease Control and Prevention reported that about 81% COVID-19 patients were considered as mild, and the proportion of severe and critical patients was 14% and 5%, respectively [5]. The mortality in overall population was 3.2%, but it increased to 49% in critical population [5]. Hence, how to use effective biomarkers to identify patients with poor clinical outcomes have caused extensive concern.
COVID-19 patients with comorbidities were considered to be prone to having poor clinical outcomes. A study reported that COVID-19 patients with chronic obstructive pulmonary disease, diabetes, hypertension and malignancy had a higher risk of admission to an intensive care unit (ICU), invasive ventilation or death [6]. Another study demonstrated that the risk factors included older age, high Sequential Organ Failure Assessment score, and higher D-dimer expression on admission [7]. To help physicians identify potential patients with critical illness, Liang and colleagues developed a risk score model based on ten characteristics of COVID-19 patients on admission to calculate the probability of developing critical illness [3].
It is of great clinical signi cance to identify COVID-19 patients with high risk of death. In this study, we aimed to develop a machine learning model to distinguish COVID-19 patients with high risk of death from those without.

Data sources
A total of 2,169 adult patients (aged ≥ 18 years) were enrolled from Wuhan, China between February 10th and April 15th, 2020. All patients were con rmed with COVID-19 infection by real-time reverse-transcription polymerase-chain-reaction (RT-PCR) assay. In addition, medical records, including demographics, clinical characteristics and laboratory examinations on admission of all patients were collected. This study was approved by the Ethics Committee of the Taikang Hospital (TKTJLL-007), and performed in accordance with the Declaration of Helsinki. The Ethics Committee of the Taikang Hospital waived the need for informed consent of each patients.

Study design
First of all, we performed a difference analysis of medical records between severe group and non-severe group. All the patients meeting the severity diagnosis criteria during hospitalization were assigned into the severe group. We de ned disease severity according to the Seventh Revised Trial Version of the COVID-19 Diagnosis and Treatment Guidance (2020) of China [8]. Next, we performed difference analyses of medical records between survivors and non-survivors. The de nition of survivors was patients who were discharged from hospital or still in hospital at the end of study. Last, we developed a decision tree classi er to identify risk factors for death outcome.

Statistical analysis
Continuous variables were described as median and interquartile range (IQR) and compared by the Mann-Whitney U test. Categorical variables were represented as frequencies and compared by Pearson's Χ 2 test. All statistical analyses were performed and the decision tree classi er was developed using R software (version 3.5.2). The following R packages were used: CBCgrps, rpart, rpart.plot and pROC.
A two-sided P value < 0.05 was considered statistically signi cant.
On admission, 117 (5%) patients had high body temperature, 270 (12%) had low oxygen saturation, 359 (17%) had abnormal heart rates and 596 had faster respiratory rates. In total, 1134 (52%) patients had at least one comorbidity, and the common comorbidities were hypertension, diabetes and coronary heart disease. In addition, 728 (34%) patients had one system symptom, 1130 (52%) patients had two systems symptoms and 218 (10%) patients had three or more systems symptoms. The most common systems symptoms were respiratory symptoms, systemic symptoms and digestive symptoms ( Table 1).
A total of 646 (29.8%) patients were diagnosed with severe illness during hospitalization. Compared to non-severe group, severe group had a signi cantly higher median age (68 vs. 58 years, p<0.001) and a higher proportion of male patients. On admission, a higher proportion of high body temperature, low oxygen saturation, abnormal heart rate and faster respiratory rate were found in severe group.
Moreover, patients in severe group had a higher proportion of comorbidities and symptoms. No difference was found in the smoking history (Table 1). When comparing laboratory examinations between the two groups, we found that severe group had a signi cantly higher white blood cell (WBC) count, neutrophil count, neutrophil-to-lymphocyte ratio (NLR), and C-reactive protein (CRP), lactic dehydrogenase (LDH), interleukin-6 (IL-6), procalcitonin and D-dimer levels, but lower lymphocyte count, eosinophilia count, basophilia count, red blood cell (RBC) count, hemoglobin and platelet count. No difference was found in monocyte count (Table 1).
Until April 15th, 2020, 75 patients were dead. Differences in demographics and clinical characteristics between survivors and nonsurvivors were similar to the differences between severe and non-severe groups. By laboratory examinations comparison, a much higher WBC count, neutrophil count, NLR, and higher CRP, LDH, IL-6, procalcitonin and D-dimer levels were found in non-survivors (Table 2).
RBC count and hemoglobin level showed no difference between the two groups. Other laboratory indexes were lower in non-survivors (Table 2).
To explore crucial predictive biomarkers of disease mortality in severe patients, we used a machine learning model, decision tree, to identify related biomarkers. Laboratory indexes with missing values over 20% were excluded, including IL-6, procalcitonin and D-dimer.
Then all missing values were imputed with mean value of each laboratory index. All severe patients were randomly split into train dataset and test dataset in a ratio of 7:3. A total of 452 patients were included in train dataset, including 57 non-survivors. In this step, a decision tree classi er was developed to differentiate non-survivors from survivors. As shown in Figure 1, three biomarkers were included in the decision tree classi er, including LDH, NLR and CRP. The threshold of each biomarker helped to classify each patient into survivor group or non-survivor group. The area under the curve (AUC) of the receiver operating characteristic of this model was 0.96, which was higher than each single biomarker ( Figure 2). The associated confusion matrix of train dataset was presented in Supplementary Table 1. The accuracy of this model was 0.98. The precision, recall and F1 score for survivor prediction was 0.97, 1.00 and 0.98, respectively. For non-survivors, the precision, recall and F1 score was 1.00, 0.81 and 0.90, respectively (Table 3).
To validate the performance of the decision tree classi er, we applied it to the test dataset. The associated confusion matrix of test dataset was presented in Supplementary Table 1. The accuracy in test dataset was also 0.98. The precision, recall and F1 score for survivor prediction in test dataset was 0.98, 0.99 and 0.98, respectively. For non-survivor prediction in test dataset, the precision, recall and F1 score was 0.94, 0.83 and 0.88, respectively (Table 3).

Discussion
In this study, we found that COVID-19 patients in severe group or non-survivor group had a higher median age. Also, these patients had a higher proportion of comorbidities and symptoms than their counterparts. Most importantly, we proposed a machine learning model to quickly quantify the risk of death based on three biomarkers (LDH, NLR and CRP), which could be easily obtained on admission. This model had a high AUC of 0.96, and performed well in both train dataset and test dataset.
Zhang and colleagues [9] reported that the median age in a small cohort of COVID-19 non-survivors was 72.5, similar to our ndings. In addition, both Zhang and we found that male COVID-19 patients accounted for the majority of the non-survivors. They also reported that approximately 76.8% of the non-survivors had comorbidities, including hypertension, heart disease, diabetes, cerebrovascular disease and cancer [9].
NLR is one of the research hotspots of in ammatory biomarkers in infectious diseases. It can comprehensively re ect the in ammatory response and immune status in patients with infectious diseases [10][11][12]. In COVID-19 patients, elevated NLR on admission was reported to be signi cantly associated with disease severity [13,14]. Liu and colleagues proposed a simple model based on NLR and age to stratify COVID-19 patients into four groups [15]. COVID-19 patients with age < 50 years old and NLR < 3.13 or NLR ≥ 3.13 had no risk of severity, and these patients should be treated in a community hospital, home isolation or general isolation ward. While COVID-19 patients with age ≥ 50 and NLR < 3.13 or NLR ≥ 3.13 had a higher risk of severity, and these patients should be admitted to isolation ward or ICU with active treatment and care. In addition, Yang and coworkers found that approximately 46.1% of the mild COVID-19 patients could become severely ill if patients had an age ≥ 49.5 years old and NLR ≥ 3.3 [14]. The dynamic change of NLR could also be used to distinguish severe patients from mild/moderate patients. A study demonstrated that NLR in severe group always kept a higher level on day 1, 4 and 14 compared with mild/moderate group [16]. In this study, we found that NLR ≥ 6.9 was a high-risk factor for death in severe COVID-19 patients. In train dataset, 54 out of 113 patients with NLR ≥ 6.9 died, while only 3 out of 339 patients with NLR < 6.9 died. In test dataset, 18 out of 43 patients with NLR ≥ 6.9 died, while no patients with NLR < 6.9 died.
CRP re ects a persistent in ammatory activity state, and helps in assessing the severity of infectious patients [17]. A few studies have demonstrated that a higher CRP expression on admission was observed in severe COVID-19 patients compared with non-severe COVID-19 patients [17,18]. LDH was a biomarker of severe illness and poor prognosis in COVID-19 patients [19]. Zeng et al. found that LDH decreased within 10 days after admission in non-critical COVID-19 patients, but did not decrease obviously in critical patients or nonsurvivors [20]. These ndings indicate that CRP and LDH play important roles in COVID-19 patients. In this study, LDH and CRP were also

Conclusion
In summary, this study found that male COVID-19 patients were more prone to experience severe illness and death. Clinical characteristics and laboratory examinations were signi cantly different between severe and non-severe groups, as well as between survivors and non-survivors. Most importantly, we identi ed three biomarkers (LDH, NLR and CRP) on admission as risk factors for death, and developed a simple decision tree classi er to help clinicians rapidly identify patients at high risk of death and to give priority treatment and intensive care.

Declarations
Ethics approval and consent to participate: This study was approved by the ethics committee of the Ethics Committee of the Taikang Hospital (TKTJLL-007).

Consent for publication
Not applicable.

Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests All authors have approved to submit this study and any substantially modi ed version that involves the author's contribution to the study. And all authors have agreed both to be personally accountable for the author's own contributions and to ensure that questions related to the accuracy or integrity of any part of the work, even ones in which the author was not personally involved, are appropriately investigated, resolved, and the resolution documented in the literature. Data are n (%), n/N (%), or median (IQR), unless specified otherwise. Temperature, oxygen saturation, heart rate, respiratory rate were detected at rest when patients were admitted to hospital.
a P values indicate differences between severe and non-severe patients. Data are n (%), n/N (%), or median (IQR), unless specified otherwise. Temperature, oxygen saturation, heart rate, respiratory rate were detected at rest when patients were admitted to hospital.
a P values indicate differences between survivors and non-survivors.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.