Characterization of clinical patterns of dengue patients using an unsupervised machine learning approach

Background Despite the greater sensitivity of the new dengue clinical classification proposed by the World Health Organization (WHO) in 2009, there is a need for a better definition of warning signs and clinical progression of dengue cases. Classic statistical methods have been used to evaluate risk criteria in dengue patients, however they usually cannot access the complexity of dengue clinical profiles. We propose the use of machine learning as an alternative tool to identify the possible characteristics that could be used to develop a risk criterion for severity in dengue patients. Method In this study, we analyzed the clinical profiles of 523 confirmed dengue cases using self-organizing maps (SOM) and random forest algorithms to identify clusters of patients with similar patterns. Results We identified four natural clusters, two with features of dengue without warning signs or mild disease, one that comprises the severe dengue cases and high frequency of warning signs, and another with intermediate characteristics. Age appeared as the key variable for splitting the data into these four clusters although warning signs such as abdominal pain or tenderness, clinical fluid accumulation, mucosal bleeding, lethargy, restlessness, liver enlargement and increased hematocrit associated with a decrease in platelet counts should also be considered to evaluate severity in dengue patients. Conclusions These findings suggest that age must be the first characteristic to be considered in places where dengue is hyperendemic. Our results show that warning signs should be closely monitored, mainly in children. Further studies exploring these results in a longitudinal approach may help to understand the full spectrum of dengue clinical manifestations.


Background
Dengue is an acute and systemic disease caused by the dengue virus (DENV), with a broad clinical spectrum ranging from asymptomatic to severe infections. Most infections by DENV result in a mild disease known as dengue without warning signs, but a small proportion of patients develop the severe form [1].
Although the World Health Organization (WHO) issued the revised dengue guideline with a new clinical classification [2], there is still debate regarding its specificity [3,4]. This new classification grouped the patients according to the presence or absence of warnings signs and severe dengue. Studies evaluating the new classification demonstrated a greater sensitivity when applied in endemic regions in both prospective and retrospective data, but the authors highlighted the need for a better definition of warning signs, mainly in the absence of laboratory tests [5,6].
Classic statistical methods have been used to evaluate warning signs and determine risk criteria for severity in dengue patients [7][8][9], however, the complexity of the clinical profiles and the many overlapping levels of severity makes the disease prognosis nearly impossible to predict. The challenge in modeling this type of clinical data in a multifactorial disease such as dengue relies on the choice of an appropriate modeling approach that instead of providing a single predictive attribute, considers a combination of variables regardless of the data structure.
The use of unsupervised machine learning techniques is becoming popular in the medical field to reduce the dimensionality of the data and to help visualize possible patterns. The self-organizing map (SOM) is especially suitable for this task because it projects the high dimension data into a low dimension without losing the data structure [10]. Due to the increase in volume and complexity of data over the last decade, especially in the medical field, several variants of the SOM algorithm were introduced to generalize the original algorithm to handle both numerical and categorical attributes. When data are described by categorical variables or by relations between objects, a common solution is to use a measure of resemblance (i.e., a similarity or a dissimilarity measure). A general extension of this idea is a stochastic version of SOM that can be used to analyze dissimilarity data [11].
Another machine learning technique that has been successfully used in the medical field is the random forest [12]. It provides a proximity scores matrix that assesses the number of samples and the similarity/ dissimilarity matrices between them. Both matrices (similarity and dissimilarity) can be used to perform a powerful unsupervised analysis to identify patterns in the data structure.
In this study we combined random forest followed by stochastic SOM to visualize possible natural clusters associated to clinical natural patterns in dengue confirmed cases. These clusters were then reviewed for possible characteristics that could be used as risk criteria in dengue patients.

Study population and eligibility criteria
In this descriptive cross-sectional study, we analyzed retrospective data of patients with suspicion of dengue infection, assisted at the hospital of the Instituto Nacional de Infectologia Evandro Chagas/FIOCRUZ, Brazil between January 2007 and December 2013. Following the age shifting in Brazil in 2007, the Instituto Nacional de Infectologia Evandro Chagas started a project to study dengue infection in children in collaboration with three pediatric hospitals in the city. These hospitals also serve as primary care and tertiary care for dengue, therefore patients with suspicion of dengue infection and admitted into these pediatric hospitals in Rio de Janeiro RJ were also included in this analysis.
The inclusion criteria were laboratory-confirmed dengue virus (DENV) cases enrolled up to 7 days from the onset of the symptoms and followed until the outcome (cure or death), which encompassed the acute and critical phases of the disease. Patients with comorbidities and cases with more than 7 days after the onset of the symptoms at the time of admission were excluded from this analysis. Only subjects with complete data for all variables including laboratorial and clinical dengue classification were included in our analysis. All dengue cases included in this study were confirmed by at least two of the following criteria: (i) positive DENV-specific realtime reverse transcription polymerase chain reaction (RT-PCR) for any serotype (I-IV) (QIAamp Viral RNA Mini Kit, Qiagen, Hilden, Germany, following the protocol described in Lanciotti et al. [13]), (ii) at least one positive DENV-specific immunoglobulin M (IgM) antibody in the convalescent serum compared to in the acute-phase serum or positive for qualitative IgM with dengue clinical profile during epidemic periods. Tests for detection of anti-dengue IgM were conducted using an antibody-capture enzyme-linked immunosorbent assay (PanBio, Brisbane, Australia), and/or (iii) NS1 antigen capture by using the Platelia™ Dengue NS1 Ag-ELISA Kit (Bio-Rad Laboratories, Marnes-La-Coquette, France) in the acute-phase serum (up to 3 days after the onset of fever), (iv) clinical-epidemiological diagnosis during epidemic periods .

Data preprocessing and clinical classification
Data were obtained from each patient's medical records. We created a categorical variable called "age group" (≤ 18 years old and > 18 years old) to better define the distribution of the variables among children and adults in the exploratory analysis. The variables age and days were normalized to reduce its variability before using to define the similarity matrix. Variables such as hematocrit and platelet increase/decrease as well as imaging data to define cavities fluid accumulation were based on at least two tests of complete blood count analyses and X-ray images, respectively. These variables were used to define some warning signs, but they were not included in this analysis.
The clinical classification was performed by trained clinicians based on the WHO guideline. It was then used to compare to the natural clusters defined by the unsupervised neural network. The classification divided the patients into three groups: dengue without warning signs, dengue with warning signs and severe dengue. Warning signs included: abdominal pain or tenderness; persistent vomiting (more than 5 times in 6 h or more than 3 times in 1 h); clinical fluid accumulation including pleural effusion and ascites identified as a reduction of vesicular murmur or reduction of thoracic-vocal trill; abdominal distention or dullness decubitus, confirmed by abnormal imaging findings; mucosal hemorrhage (gastrointestinal hemorrhage and/or metrorrhagia); lethargy (alteration of consciousness and/or Glasgow score < 15) or irritability; and liver enlargement (> 2 cm below the costal margin). Laboratory findings were defined as follows: thrombocytopenia (platelet count, 50, 000/mm 3 ) and hematocrit change of 20%, either raised or decreased by 20% from the baseline value during the convalescent period. Severe dengue was defined by the following characteristics: (i) Plasma leakage resulting in shock or fluid accumulation with respiratory distress. Shock was defined as the presence of at least 2 of the clinical signs of hypoperfusion, with or without an associated weak pulse pressure (≤20 mmHg) or hypotension for the specified age (decrease in blood arterial systolic pressure 5th percentile for age [<PAS5], calculated as age [years] 2 + 70) [14]; or (ii) severe bleeding, or (iii) severe organ involvement, e.g., severe hepatitis (aspartate aminotransferase/alanine aminotransferase levels > 1000 IU/L); Multiple-organ dysfunction syndrome was considered when dysfunction involved 2 or more organs. Definition and clinical criteria of these signs and symptoms are better described elsewhere [15,16].

Unsupervised machine learning techniques Random forest
Random forest is an algorithm based on constructing a binary tree using recursive partitioning. Each binary split recursively divides the parent branch into homogeneous or near homogeneous daughter nodes (the ends of the tree). The trees are built using a two-stage randomization procedure. The first stage introduces the randomization using a bootstrap sample of the original data, and in the second, the randomization is introduced at the node level, by selecting a random subset of variables, and only those variables that keep the purity (homogeneity) of the node are kept during the split. This homogeneity is calculated by the Gini Impurity Index (Eq. 1) and it determines the purity of each node based on the relative frequency of the class in the node being evaluated.
where S is the node being evaluated and π i is the frequency of the class k in the node S.
The advantage of this two-step randomization relies on the generation of decorrelate trees besides the guarantee that even those weak features are considered in the analysis [12]. An "out-of-bag" (OOB) error rate for each observation is calculated using the samples not included in the bootstrap and it is determined by majority vote across trees. Each tree is unpruned to obtain lowbias trees while bagging and random variable selection results in low correlation of the individual trees. Thus, the algorithm yields an ensemble that can achieve low bias and low variance [12].
Random forests can be used as an unsupervised technique. This approach involves the generation of a synthetic dataset to represent data without dependence. The synthetic dataset is appended to the original one, and a two-level classification variable ("original" and "synthetic") is created. Then a supervised random forest predictor is constructed to classify original from synthetic data [17]. One important output information provided by random forest predictor is a measure of the internal structure of the data (the proximity between data points). This proximity can be determined by examining the node membership of the data. Once this process is done for all trees, the proximities are normalized by dividing them by the total number of t-trees. These scores are then stored in a proximity matrix. This matrix can be used to calculate a dissimilarity matrix by subtracting one from each of the elements, allowing a direct comparison for clustering and visualization approaches to detect data structures in high-dimensional space [17].
The Random forest algorithm was applied in an unsupervised setting to calculate the dissimilarity matrix (SOM input). All analyses were performed by using ran-domForest package in R [18].

Stochastic self-organizing maps (SOM)
SOM is a neural network that uses unsupervised competitive learning to map nonlinear statistical relationships between high-dimensional data into low dimensional grids while maintaining its original topology. Its architecture usually consists of a two-dimensional grid (input and output) with each cell in the array having a processing unit called "neuron". The neurons are connected to adjacent ones by a neighborhood function and the data points closest to each other in the input space are mapped into nearby neurons on the grid [19].
The SOM training is iterative, and it uses competitive learning where the neurons of the output layer compete to be updated. First, there is a competitive learning phase where a sample xi is randomly chosen from the input data and the distance (Euclidean Distance) between the sample and all prototypes p are computed [19].
The closest neuron to the input is declared the winning neuron or Best Matching Unit (BMU) p u , u ϵ {1, ...,U} , and it can be calculated by Eq. 2.
where ||.|| is the Euclidean distance in ℜ d . The next step consists of a cooperative phase identifying BMU's neighboring neurons using a neighborhood Gaussian function. Finally, all the prototype vectors are updated. It can be performed either by updating all prototypes via a weighted average (batch SOM) or in a stochastic version where the prototypes are updated mimicking a stochastic gradient descent scheme as described by the Eq. 3. The resulting grid shows the relationship between the neurons displaying the distances between input data [19].
where t means time, α(t) learning rate and h is a neighborhood kernel function centered on the winner unit.
The best grid is usually chosen by checking two error measures: (1) quantization error and (2) topographic error. The topographic error or Te (Eq. 4) quantifies the continuity of the map with respect to the input space metric by counting the number of times the second-best matching (BMU2) unit of a given observation belongs to the direct neighborhood of the BMU for this observation, whereas the quantization error or Qe (Eq. 5) provides the mean distance between each vector and the cluster prototypes for k clusters [10].
where u(x k ) is 1 if the winning neuron (BMU1) and the second neuron (BMU2) are neighbors and 0 if they are not neighbors.
where the mean error corresponding to the difference between the input vector (x k ) and vector weight (ωBMU).
Several variants of the SOM algorithm have been introduced to overcome its limitations such as the inability of distinct types of variables and/or structure [11,20,21]. One of the extensions relies on the computation of a measure of similarity or dissimilarity as input data where a natural Euclidean structure is not necessarily existent, instead, the dissimilarity between the observations can be described by a dissimilarity measure Δ, where Δ = (δ ij ) i,j = 1,...,n, such that Δ is non negative (δ ij ≥ 0), symmetric, (δ ij = δ ji ) and null on the diagonal (δ ii = 0).. In this case, the Eq. 2 cannot be carried out straightforwardly since the distances between the input data and the prototypes are not be directly computable. The solution is based on the pseudo-Euclidean framework which considers the prototypes as symbolic convex combinations of the original data [11]. In the stochastic version, it is calculated as described in the Eqs. 6 and 7 which are modification of the Eqs. 2 and 3.
where γ u, given u = 1, …, U are the convex combinations of the input data and 1 i is a vector with a single non-null coefficient at the i-th position, equal to one.
Here we apply this technique to reduce the high-dimensionality and find natural patterns in the data by using the package SOMbrero in R [22]. This package provides the use of dissimilarity measures as input data and an ascending hierarchical clustering algorithm on the prototypes of the trained grid (superClass) for visualization of the natural clusters. The workflow of the methodologies applied is summarized in the Fig. 1.

Statistical analysis
For descriptive analyses, frequency and percentages were used for categorical variables. For continuous variables, median, range and IQR were used. Categorical variables were compared by using Chi-square test, whereas Fisher's exact test was performed when the expected table values were smaller than 5. The difference by age and days after the onset of the symptoms among the natural clusters were compared by a one-way ANOVA analysis. Significant differences considered p-value < 0.05. Statistical analyses were performed using the R statistical software R 3.5.1 [23].

Results
From 710 cases reported as dengue during the studied period, 93 were excluded due the lack of laboratorial data and/or missing information and/or had comorbidity, 48 had more than 7 days after the onset of the symptoms, 46 were not able to be classified due to missing information for one or more clinical variables. It resulted in 523 confirmed dengue cases that were used to identify natural patterns in the clinical profiles of patients up to 7 days of disease based on the presence of 30 different variables used for dengue clinical classification. Overall, the average age was 31 years and 49.1% were female. One hundred sixty-two (31%) were children (≤ 18 years old) and 361 (69%) were adults (> 18 years old). Dengue serotypes were identified in 183 (35%) of the patients. The profile of the 523 patients included in this study and serotypes identified are shown in Table 1.
The resulting dissimilarity matrix generated by unsupervised random forest was then used as input to perform the stochastic SOM algorithm. One resulting grid was selected after several trainings based on the final energy and the topographic and quantization errors. The grid with the lowest topographic error (0.05) and quantization error (0.43) was chosen ( Table 3). The lowest topographic error provides the best representation of the data structure on the grid; therefore, it was prioritized.
A hierarchical clustering was then applied on the SOM prototypes to better understand and visualize the structure in the grid resulting in a dendrogram (Fig. 2), from which 4 clusters were considered as the best division. The best division was selected based on a nonparametric MANOVA described by Anderson [24]. This test is a multivariate analogue to Fisher's F-ratio and is calculated directly from any symmetric distance or dissimilarity matrix. The P-values are then obtained using permutations of the observations to obtain a probability associated with the null hypothesis of no differences among clusters. For 4 clusters, the results were F: 6·57 and p-value < 0·001.
The numbers in each node in Fig. 2 correspond to one neuron of the SOM grid and they were grouped according to the clusters defined by the MANOVA analysis (rectangles). These clusters were then all labeled with the patient's number to obtain the classification and clinical profile from the original data (Table 4).
From 523 patients analyzed, 61 were grouped in cluster 1, 124 in cluster 2, 129 in cluster 3 and 209 in cluster 4. According to the specialist's classification (Table 4), 78·7% and 75·2% of the patients in clusters 1 and 3 were classified as Dengue without Warning Signs only (p-value: 5·11e-14). Clusters 2 and 4 had 30·7% and 39·2% of the patients classified as Dengue with warning signs (p-value: 0·001) respectively, however, cluster 4 had the highest percentage of patients classified as severe dengue, which characterize this group as more severe than the others (p-value:5·55e-09). The distribution of the WHO classification by clusters is shown in the Fig. 3.
The distribution of days after the onset of symptoms between clusters was also analyzed (Fig. 4). The patients in cluster 1 were between the 1st and 3rd day of disease. Cluster 3 showed higher frequency of patients between the 2nd and 4th days. Clusters 2 and 4 showed both a higher frequency of patients between 4th -6th and 5th-7th days of disease respectively (p-value < 0.05).
The age distribution between the clusters showed that clusters 1 and 2 concentrate the older patients (age range between 40 and 80 years old), cluster 3 had a higher frequency of young adults (20-40 years old) and cluster 4 consisted mainly of children (5-15 years old) (Fig. 5).
A pairwise analysis was then applied to compare the proportions of each variable between the clusters. The adjusted p-values are shown in the Table 5.
The variables rash, tourniquet positive test, narrow pulse pressure, cold skin, impaired consciousness and AST or ALT> = 1000 did not have any influence on the cluster's division (p-values > 0·05). Alternatively, the variables nausea/vomit and persistent vomiting differentiated all clusters (p-values < 0·001). These variables were also more frequent in children ≤18 years old than in adults ( Table 2).
The patient's status (outpatient/inpatient) had also showed to be significant in the cluster's division except for clusters 1 and 3 that disclosed no difference in the distribution of this variable (p-value: 0·6296).
The variables history of abdominal pain, leukopenia, and warning signs such as abdominal pain or tenderness,  clinical fluid accumulation, mucosal bleeding, lethargy, restlessness, liver enlargement and increase hematocrit associated with a decrease in platelets count were responsible for the cluster 4 definition. The variables that define shock (respiratory distress, rapid and weak pulse and slow capillary filling) were also more significant in distinguishing cluster 4 from the others. The only exception was the variable cold clammy skin/cyanosis that did not show any difference between the clusters. Other variables such as edema and hypotension were responsible only for distinguishing clusters 1 and 3 from cluster 4 but did not show any difference between clusters 1, 2 and 3. Whereas mucosal bleeding distinguished clusters 1 and 3 from 2 and 4, it also differentiated clusters 2 and 4. Cluster 2 shared characteristics with cluster 4 (petechiae) and clusters 1 and 3 (myalgia, arthralgia and pain behind the eyes) which define this cluster as an intermediary profile. Dehydration was the only variable that differentiated cluster 1 from the others including cluster 3 which shares similar characteristics with cluster 1.

Discussion
Combined unsupervised machine learning methodologies were useful to identify natural patterns in clinical dengue data, leading to the identification of four well defined clusters profiles. Two clusters (1 and 3) had more than 70% of the patients classified as Dengue without Warning Signs only (p-value: 5·11e-14), which could be denominated as low-risk patients. Cluster 4 had the highest percentage of patients classified as severe dengue (p-value:5·55e-09) which could be labeled as high-risk group (Table 4). By using similar methodology, Faisal et al. [10] found five natural clusters that could also be clustered in two major clusters as lower risk and higher risk. However, these authors considered only laboratory data (numerical data) of patients in critical phase whereas here we included continuous and categorical variables characterized by the WHO guideline [2] to classify dengue patients in all phases (acute, critical and recovery).
The analysis of all phases of the disease suggests that, besides the rapid evolution of dengue, the transition from dengue without warning signs to dengue with warning and severe dengue happens gradually and it may be linked to the age of the patients. Patients classified as dengue without warning signs (Cluster 1) showed a higher percentage of patients in the acute phase (up to 3 days after the onset of symptoms) whereas cluster 4, with the highest percentage of severity, had a higher percentage of patients between the critical and recovery phases (5-7 days). The same was not observed for clusters 2 and 3. These clusters grouped the patients in the  end of the acute phase (Cluster 3) and in the beginning of the critical phase (Cluster 2) (Fig. 4). Nevertheless, these results should be further explored in future studies including age-dependence of infection and clinical presentation in a longitudinal approach. Age appeared as the most remarkable variable in the cluster's division showing statistical significance between clusters (Fig. 5). While the low-risk cluster 1 showed a higher concentration of older patients (40-80 years old), the high-risk cluster 4 was characterized mostly by children (5-15 years old). This can be explained by dengue's age shifting in 2007 in Brazil, when there was an increase of 53% of severe cases occurring in children under 15 years old, and the association of occurrence of severe dengue and hospitalization in younger patients [26]. By simulating the force of infection of dengue based on an age stratified seroprevalence dataset, Rodriguez-Barraquer et al. [27] proposed that the conditions for the age shifting in Brazil were being set gradually and that they represent the transition from re-emergence to hyperendemicity. Our study suggests a similar association because more than 60% of patients in the high-risk cluster were younger than 10 years old (Fig.  5), so that age must be the first characteristic to be considered in dengue hyperendemic areas such as Brazil. The high-risk cluster was mainly characterized by young inpatients showing signs or symptoms of shock (Table 5), justifying the high rate of hospitalization in this cluster. Warning signs such as clinical fluid accumulation, abdominal pain, leukopenia, mucosal bleeding, lethargy, restlessness, liver enlargement and increase hematocrit (hemoconcentration) associated with the decrease of platelets (thrombocytopenia) were also crucial to discriminate this cluster from the others ( Table 5). In our exploratory analysis we also found some of these symptoms more frequent in children ≤18 years old than in adults ( Table 2). Wakimoto et al. [28] confirmed that abdominal pain, bleeding, lethargy, liver enlargement, hemoconcentration with thrombocytopenia were independently associated with severe dengue in children. Our results confirm these findings and the need for monitoring these parameters in children with dengue.
There was divergent clinical presentation among the low-risk clusters. Besides these clusters sharing a higher frequency of Dengue without warning signs symptoms such as arthralgia, myalgia, and retroorbital pain, the variable dehydration was the only variable discriminating  (Table 5). In a prospective observational study in adults with median 35 years old, Thomas et al. [29] observed that dehydration and electrolyte loss was associated with severe patients with symptoms of presyncope, intense weakness, prolonged gastrointestinal symptoms, hypotension and no evidence of plasma leakage. Indeed, cluster 1 presented the lowest percentage of patients with dehydration and highest percentage of patients classified as Dengue without warning signs, indicating a good prognosis.
Cluster 2 showed intermediary characteristics, holding the second highest percentage of patients with warning signs and severe dengue but also sharing characteristics with the low-risk clusters (myalgia, arthralgia and pain behind the eyes) ( Table 5). Kuo et al. [30] showed that elderly patients with dengue had significantly higher frequencies of vomiting, mucosal bleeding; higher WBC count, AST and ALT levels, and lower platelet count; when compared with their younger counterparts in critical phase. As cluster 2 was characterized by patients varying in age including children and the elderly, this cluster characteristics could represent the extreme age group of patients presenting warning signs. However, the higher percentage of children with warning signs and the lower number of elderlies included in this study makes this assumption not conclusive. Further studies are needed to characterize differences between the clinical profile in these ages.
The major limitation of this study was its cross-sectional design with retrospective data collection. Although patients were prospectively followed, clinical manifestations could have been incompletely recorded, especially among less severe cases. Therefore, the variables tourniquet positive test, narrow pulse pressure, impaired consciousness and AST/ALT levels that showed to not have any influence on the division of the clusters, need to be evaluated in more detail since the frequency of these variables were lower when compared with the others. Alternatively, the diversity of clinical profiles included in this study (ambulatory/hospitalized patients, adults/children) was an advantage, as it allowed visualization of the categorization of the full spectrum of dengue clinical manifestations. The co-circulation of several arboviruses in an endemic area such as Rio de Janeiro is a drawback that should also be considered. This study had the advantage of being conducted before the emergence of Zika and Chikungunya in the country, as they can present similar clinical manifestations.
Lastly, besides the descriptive design, this study was able to identify natural patterns in dengue clinical profile, giving insights of which clinical factors should be carefully considered in a hyperendemic area.

Conclusions
Dengue has a wide profile of clinical manifestations and the complexity of the cases with many overlapping levels of its severity has created many difficulties for the physician to predict the disease progression. Our study showed that computational techniques can be useful to identify patterns in the clinical profile of patients with dengue. Our findings suggest that age must be the first characteristic to be considered. Warning signs such as abdominal pain or tenderness, clinical fluid accumulation, mucosal bleeding, lethargy, restlessness, liver enlargement and increase hematocrit should be closely monitored, mainly in children. Further studies exploring these results in a longitudinal approach could be useful to create models to help clinicians and pediatricians to predict severity in dengue infection, mainly in areas where others arbovirus also circulates.  -5). This study was also financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior -Brasil (CAPES) -Finance Code 001F. The funders had no role in this study design, data collection, analysis, interpretation, or writing of the report. The corresponding author had full access to all data in the study and had final responsibility for the decision to submit for publication.

Availability of data and materials
The datasets generated and/or analyzed during the current study are not publicly available due sensitivity of the data, but the scripts used for the analysis are available from the corresponding author on reasonable request.
Ethics approval and consent to participate