First wave of COVID-19 in New Zealand
The study period is from the 26th February 2020 to the 9th June 2020, during which time New Zealand recorded 1153 confirmed and 350 probable cases of COVID-19 . Diagnosis of confirmed cases was by real time PCR using WHO recommended primers and probes targeting the E and N genes . During this first wave of COVID-19 in New Zealand, the decision to test individuals presenting to health services was based on a health practitioner-led risk assessment in the context of the Ministry of Health’s COVID-19 case definitions. The clinical criteria included: “any acute respiratory infection with at least one of the following symptoms: new or worsening cough, sore throat, shortness of breath, coryza, anosmia, with or without fever”. Significant changes to the clinical features in the case definition included: the removal of fever as a requirement on 14 March 2020, and the addition of anosmia or coryza on 01 April 2020.
All suspect cases of COVID-19 who presented to health services during the study period and were notified to Public Health officials via EpiSurv, the national notifiable disease database, were considered in this analysis (N = 7549). Of these individuals, 7036 were tested by PCR and reported as either positive or negative for SARS-CoV-2. Those with a confirmatory PCR test were classified as confirmed cases. Probable cases were close contacts of confirmed cases with compatible clinical illness where PCR testing results were negative or inconclusive. This analysis includes only the 6224 cases who were classified as symptomatic, and of these 1125 were classified as ‘confirmed’ cases, 349 were classified as ‘probable’ cases and 4750 were classified as ‘not a case’. Patient symptom data were collected at notification along with diagnostic, demographic, outcome and risk factor details using a standardised case report form on EpiSurv,Footnote 1 with most symptoms included as specific yes/no fields. ‘Loss of sense of smell’ (i.e. anosmia) was added as a specific field in April, but was reported under ‘other symptoms’ prior to this. Ageusia was only reported under ‘other symptoms’.
Variables included the presence or absence of symptoms and epidemiological variables, including age, sex and prioritised ethnicity. Age and sex were determined from the standardised case report form, and prioritised ethnicity was self-determined and obtained by linkage to the national patient demographics dataset (National Health Index, NHI) and grouped in order of prioritization to Māori, Pacific, Asian then European and Other [14 Ethnicity from the NHI should be self-identified according to health sector protocols, but may be by proxy or assigned .
Symptom variables considered in this analysis (and their symbols) are those that occurred in at least 100 individuals: Ageusia (loss or reduction of sense of taste) (G), Anosmia (loss of sense of smell) (A), Coryza (runny nose) (Z), Cough (C), Diarrhoea (D), Fever (F), Headache (H), Nausea/Vomiting (V), Pain:Abdomen (P), Pain:Chest (S), Pain:Joint (J), Pain:Muscle (M), Shortness of Breath (B), Sore Throat (T), General Weakness (W). All symptoms were specific fields in the standard case report form (https://surv.esr.cri.nz/episurv/CaseReportForms/Coronavirus_Sep2020.pdf) with the exception of ageusia which was noted in the free text ‘other symptom’ field.
A group of methods was used to explore the datasets with increasing levels of complexity, starting with a simple visualisation of frequency distributions, extending to multivariate analyses and machine learning models.
Frequency distributions using UpSet plots
The frequencies of different symptoms and symptom combinations for those classified as ‘Not a case’, ‘Confirmed case’ and ‘Probable case’ were displayed as ‘UpSet’ plots using the R package ComplexHeatmap (https://github.com/jokergoo/ComplexHeatmap).
Multiple correspondence analysis
Patterns of symptoms were explored using Multiple Correspondence Analysis (MCA) using the R packages FactoMineR and factoextra . MCA is a factor analysis related to correspondence analysis (CA) and principal component analysis (PCA) . It calculates the chi-square distance between variables from an indicator matrix of: individuals × variables. Variables are factors or dummy variables with value 1/0 (present/absent). Here the matrix columns were the active variables consisting of the 15 most common symptoms, and supplementary variables (status, ethnicity, age group and sex). The calculated distances can be used to represent each variable as a point in space, and this is then mapped into the two dimensions that capture the most variation. The further away from the origin, the greater the contribution to the overall variance; and the closer variable categories are located in the graph, the more likely they are to occur in the same individual .
Supplementary variables are not used to calculate the pairwise distances between individuals in the MCA but their coordinates are predicted and plotted on the MCA plot.
The supplementary variables were considered as follows:
Status: “Not a case”, “Confirmed case” and “Probable case”
Prioritised Ethnicity: “Māori”, “Pacific Peoples”, “Asian”, “Middle Eastern/Latin American/African”, “European or Other”, “Unknown”.
Age Group: “ < 1”, “1–4”, “5–9”, “10–14”, “15–19”, “20–29”, “30–39”, “40–49”, “50–59”, “60–69”, “70 + ”.
Sex: “Male”, “Female”, “Unknown”
Month of year: “2”, “3”, “4”, “5” and “6”
Informed by the MCA, groups of variables linked by Boolean ‘AND” and ‘OR’ statements were assessed against ‘status’ as a quasi-gold standard, after omitting probable cases, to determine the sensitivity and specificity of each set of symptoms. This was augmented by the application of an automated search algorithm (see Additional file 1: Additional material).
Decision tree analysis and machine learning/random forests
Decision tree analysis (DTA) partitions a dataset recursively using binary splits based on the predictor variables, at each stage seeking to maximize the ‘purity’ of the components of the partition in terms of the target variable . Here the target was the Status variable, restricted to “Not a case” or “Confirmed case”, and the predictors were the symptoms and supplementary variables used in the MCA analysis. The mean decrease in ‘impurity’ (using the GINI index) is an evaluation of how important removal of a particular variable is on the purity of nodes, where high purity is where each node contains predominantly one outcome (i.e. ‘Confirmed case’ or ‘Not a case’). The maximum purity would be a situation where each node contains only one status outcome (see Additional file 1: Additional material for the results of a simple decision tree analysis).
One of the limitations of simple decision tree analysis is that, unless the dataset is very large, the tree topology can vary markedly, depending on the random selection of the training and test subsets of the full data. An extension of this method is the machine learning method using ‘Random Forests’. This method generates a large number of decision trees, each constructed using a different subset of the full dataset for training. The subsets are selected by sampling at random and with replacement from the full data set. These multiple decision trees (i.e. a forest of trees) are used to determine a classification consensus. One of the outputs of random forests includes a determination of the importance of each variable in terms of its impact on accuracy and ‘impurity’ (using the GINI index). The mean decrease in accuracy for a given variable is an estimate of the loss of prediction performance if that variable is removed from the training dataset. Random forest analysis was performed using the R packages ‘randomForest’ and ‘caret’ [17, 18]. For the random forest analysis both symptoms and supplementary variables were considered. Each model randomly sampled multiple variables as candidates for each split, and grew 500 trees, resampled using tenfold cross validation. The number of variables sampled was determined by an optimisation procedure using Cohen’s Kappa as the evaluation metric.
Both the decision tree analysis and the random forest analysis randomly selected 60% of the 4750 symptomatic ‘not cases’ and 1125 ‘cases’ for training, and 40% for testing. Given the ‘unbalanced’ nature of the dataset a number of approaches were used to optimise the performance of the models in terms of their ability to predict confirmed cases (i.e. sensitivity) and individuals that were not a case (i.e. specificity). These included sampling from the two groups, for example up-sampling the minority class “Confirmed case”, or down-sampling of the majority class “Not a case”, and the use of other similar methods such as the ‘rose’ and ‘smote’ algorithms  and varying prediction cut-off values, informed by Receiver Operator Curves (ROCs), using the R package pROC .
Other analytical techniques
Methods and results of other analyses are included in the Additional file 1: Additional material – an automated exploration of the sensitivity and specificity of symptom combinations using ‘status’ as a quasi-gold standard; Bayesian Latent Class Analysis (BLCA) ; and single decision tree analysis – are shown in the Additional file 1: Additional material.
Further details of the implementation of the statistical methods, including code, are available from the authors on request.