General methods and patient population
This study was performed by retrospective review after approval from the Office of Institutional Research’s institutional review board (IRB). Need for informed consent was waived per retrospective nature of this study. Inclusion criteria in this study was > 18 years of age, presentation to the emergency department, with a documented real-time SARS-CoV-2 reverse transcriptase polymerase chain reaction (RT-PCR) test within 14 days of admission from the dates of January 1st, 2020, to March 15th, 2021. Exclusion criteria consisted of patients < 18 years of age, who had a pediatric-view CXR, lacked a RT-PCR within 14 days, or had insufficient follow-up time for outcomes analysis (defined as < 1 month after admission). Variables collected included basic demographic information (age, sex, ethnicity, body mass index (BMI)), relevant clinical history (history of hypertension (HTN), diabetes, chronic obstructive (COPD) pulmonary disease, etc.), imaging and laboratory identification (exam codes, imaging date, RT-PCR date, image impression), AI results (ASOS), and outcomes data (hospitalization, intensive care unit (ICU) admission, intubation, and all-cause mortality with duration and dates of each event).
Figure 1 contains a flow diagram describing inclusion of patients for COVID-19 training and test datasets. 23,785 CXRs were queried and ultimately 2456 met criteria of a documented COVID-19 RT-PCR test within 14 days of an eligible PA or AP CXR. A total of 2488 patients were initially enrolled in this study. Missing data from 32 patients, defined as images that failed the AI segmentation due to poor imaging quality, were excluded. The validation cohort consisted of 1000 RT-PCR positive patients and 1000 RT-PCR negative patients. Validation indices include mortality and COVID-19 diagnostic prediction. The test cohort of 456 patients was obtained using a randomized 1:1 internal holdout from the original 2456 patients. Additional file 1: Table S1 contains demographics information for the 2000 training patients.
Image acquisition and expert evaluation
One-view chest X-rays were obtained according to institutional protocol. Posteroanterior (PA) and anteroposterior (AP) views, but not lateral views, were included in this study. A master list of CXRs for patients who were admitted to the emergency department were obtained via billing code. Images were subsequently exported from the picture archive and communication system without patient identifiers and manually uploaded to Siemens AI-Radiology Companion for evaluation. A total of 2456 images were used in this study. Categorical airspace opacities were defined as presence of airspace disease regardless of severity.
A panel of three fellowship-trained cardiothoracic radiologists independently quantified the airspace opacity severity score for all 2456 images (~ 800 randomized chest radiographs each) for use in ground truth of this study. Briefly, each CXR was evaluated for the presence of pulmonary opacification according to the following [20]:
“The presence of patchy and/or confluent airspace opacity or consolidation in a peripheral and mid to lower lung zone distribution on a chest radiograph obtained in the setting of pandemic COVID-19 was highly suggestive of severe acute respiratory syndrome coronavirus 2 infection…” Airspace opacity severity (ASOS) was determined by visually estimating the percentage of lung involved with airspace opacification. The percentage of lung involvement was then converted into a whole number. For example, if 40% (score = 2/5 or 2) of the right lung and 60% (score = 3/5 or 3) of the left lung contained airspace opacities, the ASOS would be 5 (2 + 3). ASOS ranged from 0 to 10 for each CXR. The score can also be calculated by summing the percentage of airspace opacities in each lung and then multiplying by 0.5.
Deep convolutional neural network algorithm
The CNN was previously trained on 11,622 cases with 5653 images positive for airspace opacities. Additionally, a set of 540 cases (261 positives for airspace opacities) was previously used as validation and for initial model selection. This patient cohort consisted of adult patients with a mix of typical and atypical infectious pneumonia and was trained to recognize airspace opacities. The predictive models were then trained on 2000 patients (1000 RT-PCR Positive and 1000 RT-PCR Negative) from this study’s CXR dataset. Analysis on the 2000 additional patients before the test dataset can be found in the supplemental material. The following description is designed to fulfill the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) criteria for reproducibility in machine learning as well as avoiding common pitfalls in COVID-19 machine learning studies [14, 21].
The architecture of the proposed dCNNs model comprises an early feature extractor acting as candidate generator in an abstract feature space, followed by a discriminator sub-network used to compute probabilities on whether the abnormality is present or not (in an image sub-region of interest) [FCOS]. The architecture is fully convolutional and processes the entire image content in one single pass, while analyzing its content on multiple levels of scales. As such, the architecture is capable of implicitly capturing both global as well as local comorbidities present in the image. Severity score was based on a summation of the geographical extent (as represented by the bounding boxes) of airspace opacities present in both lungs converted into a whole number ranging from 0 to 10. Figure 2A gives an example of a CXR with a low-moderate airspace opacity severity score of 4/10 (~ 40%). EKG leads overlie the chest. Figure 2B gives an example of a CXR with large volume bilateral airspace opacities. The AI severity score in this case was 8/10 (~ 80%). A dual chamber pacemaker with atrial and ventricular leads overlies the left chest, highlighting the robustness of the algorithm for patients with overlying chest hardware. Figure 2C describes the dCNN architecture used in this study. For full details of the neural network architecture please see Homayounieh et al. 2021 Appendix E from which the architecture is sourced [22].
Model input and output at inference
The input to the model presented in Fig. 2C was an image rescaled to an isotropic resolution of 1025 × 1025 pixels using letterboxing. The output was a set of boxes indicating the location of the abnormalities (airspace disease), each associated with a label and a probability. As a pre-processing step, the images were rescaled to an isotropic resolution of 1025 × 1025 pixels using letterboxing. Bilinear interpolation was used for resampling, followed by a robust brightness/contrast normalization is performed based on a linear remapping of the pixel values.
Training was conducted in one end-to-end manner. The loss function is based on summation of three elements: (1) a classification loss based on the focal loss described in detail in Tsung-Yi et al. [23]; (2) a bounding box coordinate regression loss based on an intersection-over-union based metric; and (3) a center-ness loss designed to reduce outlier detections which is based on a weighted binary cross entropy loss. A batch-size of 8 was used for training. Separate independent validation set was used for model selection and perform early stopping, if necessary. For augmentation we used various intensity and geometric transformations [23, 24].
Statistical analysis
A power calculation beforehand was performed for the purpose of prediction of outcomes; assuming a 1:10 ratio of events in a 1:1 case: control split, 429 patients were required for a power of 0.9. Prediction of positive SARS-CoV-2 RT PCR results was established using simple logistic regression. Additional file 1: Fig. S1, Tables S2 and S3 provide the power calculation materials. All simple logistic regression variables were constrained by alpha of 0.05 and measures of model performance included Akaike information criterion (AIC) and pseudo-R2 (McFadden). All models were evaluated using receiver-operator characteristic (ROC) curves with area under curve (AUC) with 95% confidence interval as the primary measure of prediction. DeLong’s test of two correlated ROC curves was used for statistical comparison. Extracted logistic probabilities were evaluated from the simple logistic regression models. For multivariate analysis, demographics and clinical variables known to be associated with poor outcomes in COVID-19 from the literature were loaded on the initial regression model. A stepwise-backwards logistic regression model was then applied until all variables remaining were considered significant in the model (P < 0.05). Competing models were evaluated using AIC. Optimal threshold values were empirically determined using bootstrapping. Briefly, 400 bootstrapped 1:1 COVID + /COVID- samples were run and the most accurate values were selected. All statistical analysis was performed in R statistical programming version 3.6.3.