Presentation of the selected models of the systematic review
Thirteen articles were included in our systematic review “A systematic review of prediction models to diagnose COVID-19 in adults admitted to healthcare centers” [10] and all were performed in 2020. Each study proposed diagnostic models for COVID-19 based on socio-demographics, clinical symptoms, blood tests, or other characteristics that were compared to the qRT-PCR test. The number of variables included in the model varied from 4 to 15. The presence of fever appeared in 7 models, the blood value of eosinophils in 6 models, and C-reactive protein (CRP) in 5 models. Four studies included comorbidities, gender (male) or chest X-ray as a predictor in their models. Finally, age, cough, white blood cells (WBC) were significant predictors in three out of 13 studies and lymphocytes was present in two out of the 13 studies. It can be noted that some variables can be directly collected while others require more time for their investigation. Sample sizes varied from 100 to 172 754 subjects and most studies were conducted at a single site or institution. Most of the models were developed using logistic regressions. From these logistic regressions, some authors developed a score and derived cut-off values. Models such as XGBoost, random forest and machine learning were also applied. All presented classification measures, with a wide range of sensitivity and specificity values depending on the model and 12 presented a discrimination measure. All models performed well to identify patients at risk of COVID-19 but only one proceeds to an external validation. The risk of bias was estimated as low for all models using the PROBAST tool [11].
Among these 13 articles, six were kept in this study to calculate scores, cut-off values and fit models. The other articles were discarded due to missing information > 20% and/or the impossibility to calculate the score or to fit the model due to the methodology used and/or lack of information despite contacts with the authors as it will be explained in detail in the following sections. As mentioned in [10], it can also be noted that the collected variables were sometimes country-specific and cannot be obtained if the model is to be put into use in a setting other than the research context. They are studies from Vieceli et al. [12], Tordjman et al. [13], Kurstjens et al. [14], Aldobyany et al. [15], Nakakubo et al. [16] and Fink et al. [17] and are presented in detail in the Additional file 1: Appendix A1. For most of them, a score and cut-off values could be obtained but a binary logistic regression was only available for three studies [12, 13, 17]. A score and cut-off value had to be refitted due to missing information and another missing variable was replaced by its median value to fit the logistic regression model. For the score derived from Nakakubo et al., the two categories “moderate and high risk” were combined due to few subjects in the last category in the sample.
Study population
Data in the present study have been extracted from the Medical and Economic Information Service (SIME) of the University Hospital Center of Liège (CHU Liège) and included patients present at the two ED triage centers [18] of the CHU (Sart Tilman and Notre-Dame des Bruyères) with suspicion of COVID-19. Data were collected during the period from March 2, 2020, to January 31, 2021. The number of patients was 8033. This period primarily covered two complete waves of cases and patient admissions in Belgium [3]: from March 2020 to June 2020 (wave 1) and from September 2020 to January 2021 (wave 2).
Socio-demographic information (age and gender) as well as comorbidities (cardiac disease, immunosuppression, renal failure), symptoms (fever, dry or wet cough, dyspnea, diarrhea), blood parameters (lactic acid dehydrogenase LDH, CRP, procalcitonin, lymphocytes or lymphocytes count ALC, basophils, ferritin, leukocytes, neutrophils or neutrophils count ANC), radiology exams, particularly chest X-ray results, were collected in the database. Socio-demographic information and clinical symptoms were factors easily available at ED’s admission whereas hospital diagnostic resources required a more important time-to-results. In addition, radiological resources were not recommended to all patients, as their clinical presentation could not require this type of work-up. The outcome was confirmed or unconfirmed COVID-19 case using a qRT-PCR. Two different qRT-PCR tests were used during these periods: one adapted from the protocol described by Corman et al. [19]; and a second was a commercial assay using the cobas® 6800 platform (Roche) [18]. Patients for whom no qRT-PCR test was realised, aged < 18 years and for whom no biological parameters were not included in the analysis, representing 80% of the original dataset.
Eventually, 1618 patients (20% from the original database) were included in this study, with no pregnant women and with 32.1% positive cases to the qRT-PCR.
Statistical analysis
Results were expressed as numbers and frequencies for qualitative parameters and as mean and standard deviation (SD), median (P50) and interquartile range (IQR, P25-P75) and range (Min–Max) for quantitative parameters, globally and by groups, namely positive and negative confirmed COVID-19 patients. The normality of the distribution of the quantitative parameters was investigated using the mean-median comparison, the histogram and Quantile–Quantile plot and tested with the Shapiro–Wilk hypothesis test.
For all models and scores, discrimination was assessed by the Area Under the Receiver Operating characteristic Curve (AUROC). Values could range from 0 to 1 where AUROC of 0.5 suggests no discrimination, values from 0.7 to 0.8 are considered acceptable, from 0.8 to 0.9 as excellent, and more than 0.9 as outstanding [20]. For models that provided a cut-off value, sensitivity (Se), specificity (Sp), positive and negative predictive values (PPV and NPV respectively) were also calculated with 95% confidence interval (95CI%).
For models where information was available to calculate outcome probabilities, model calibration was assessed by means of the Brier score, values can range from 0 for a perfect model to 0.25 for a non-informative model [21, 22] and represents a measure of accuracy, and by calibration of predicted probabilities versus observed probabilities using LOESS- smoothed plot. Results were reported with calibration slopes and intercept (calibration-in-the-large). A perfect calibration slope is equal to 1 while slopes < 1 indicate an underestimation of low risk and overestimation of high risk and slopes > 1 means underestimation of high risk and overestimation of low risk. The estimated regression intercept represents the overall miscalibration, where 0 indicates good calibration, > 0 denotes an average underestimation, and < 0 denotes an average overestimation [23]. For models where information for intercept was missing, we calculated the intercept using the model linear predictors as an offset term as suggested by Gupta et al. [9].
A sensitivity analysis was also conducted to compare Se, Sp, PPV, NPV but also discrimination and calibration measures for each selected model when using the complete data set and data set where patients were excluded from (1) wave 1 and between the two waves and (2) wave 2 and between the two waves.
Agreement between models was tested by means absolute and relative measures. For continuous scores, a pairwise comparison using Bland–Altman (BA) with limits of agreement (LOA) and two-way fixed IntraClass Correlation Coefficient (ICC (A, 1) [24]) with 95CI%. Values less than 0.5 are indicative of poor reliability, values between 0.5 and 0.75 indicate moderate reliability, values between 0.75 and 0.9 indicate good reliability, and values greater than 0.90 indicate excellent reliability [25]. As scores had different value range, they were rescaled (mean/standard deviation) for these calculations. For binary or categorical score, Cohen’s Kappa was computed [26]. Values > 0.6 indicates substantial agreement [27].
If a maximum 20% of the information to calculate a score or to fit a model was unobtainable from the data, calculation was based on the available variables [28]. Scores and possible cut-off values were refitted to the actual number of variables. To fit models, missing variables were replaced by the mean, or the median value given in the original article. Where more than 20% of variables were missing, the score/model was discarded from this study.
The amount of missing data varied from 0.2% to 63%. Multiple imputation using the Fully Conditional Specification (FCS) method [29] was applied and all statistical analyses, diagnostic values, discrimination, calibration and agreement, were realised on the 60 generated data sets. The Rubin’s rules [30] was applied to pool the obtained results.
Results were significant at the 5% critical level (p < 0.05). The statistical analyses were carried out using SAS (version 9.4 for Windows) statistical package and R (version 4.0) with particular packages rms [31], CalibrationCurve [32], BlandAltmanLeh [33], and multiagree [34] and more common iir [35] and psych [36].