A Dirichlet process model for classifying and forecasting epidemic curves
© Nsoesie et al.; licensee BioMed Central Ltd. 2014
Received: 10 July 2013
Accepted: 27 December 2013
Published: 9 January 2014
Skip to main content
© Nsoesie et al.; licensee BioMed Central Ltd. 2014
Received: 10 July 2013
Accepted: 27 December 2013
Published: 9 January 2014
A forecast can be defined as an endeavor to quantitatively estimate a future event or probabilities assigned to a future occurrence. Forecasting stochastic processes such as epidemics is challenging since there are several biological, behavioral, and environmental factors that influence the number of cases observed at each point during an epidemic. However, accurate forecasts of epidemics would impact timely and effective implementation of public health interventions. In this study, we introduce a Dirichlet process (DP) model for classifying and forecasting influenza epidemic curves.
The DP model is a nonparametric Bayesian approach that enables the matching of current influenza activity to simulated and historical patterns, identifies epidemic curves different from those observed in the past and enables prediction of the expected epidemic peak time. The method was validated using simulated influenza epidemics from an individual-based model and the accuracy was compared to that of the tree-based classification technique, Random Forest (RF), which has been shown to achieve high accuracy in the early prediction of epidemic curves using a classification approach. We also applied the method to forecasting influenza outbreaks in the United States from 1997–2013 using influenza-like illness (ILI) data from the Centers for Disease Control and Prevention (CDC).
We made the following observations. First, the DP model performed as well as RF in identifying several of the simulated epidemics. Second, the DP model correctly forecasted the peak time several days in advance for most of the simulated epidemics. Third, the accuracy of identifying epidemics different from those already observed improved with additional data, as expected. Fourth, both methods correctly classified epidemics with higher reproduction numbers (R) with a higher accuracy compared to epidemics with lower R values. Lastly, in the classification of seasonal influenza epidemics based on ILI data from the CDC, the methods’ performance was comparable.
Although RF requires less computational time compared to the DP model, the algorithm is fully supervised implying that epidemic curves different from those previously observed will always be misclassified. In contrast, the DP model can be unsupervised, semi-supervised or fully supervised. Since both methods have their relative merits, an approach that uses both RF and the DP model could be beneficial.
Influenza pandemics result from influenza viruses capable of human-to-human transmission and for which a large global population has little or no pre-existing immunity . A well known example is the 1918 pandemic, which resulted in an estimated 20–50 million deaths worldwide . A similar pandemic today would likely result in higher morbidity and mortality due to increased travel within and between countries, increased urbanization and a growing aging and immunosuppressed population [3, 4]. Reliable forecasts of public health measures (such as peak time, peak height and attack rate) could impact timely and effective implementation of interventions to limit the effect of a pandemic [5, 6].
There were several endeavors towards real-time prediction of the expected peak time, peak height and attack rate during the 2009 pandemic. Examples include studies by Nishiura  and Ong et al. . Nishiura  employed a discrete time stochastic likelihood-based model for forecasting epidemic dynamics in Japan, whereas, Ong et al.  proposed a Bayesian and stochastic compartmental model for real-time epidemic monitoring and forecasting in Singapore. In contrast to the likelihood and Bayesian methods, Nsoesie et al.  recently presented a supervised classification approach for predicting the epidemic curve (defined as the daily/weekly counts of infected persons or influenza-like illness (ILI) cases for the duration of an outbreak) based on the idea of matching ongoing influenza activity to historical and simulated influenza epidemic curves. The supervised classification approach presented by Nsoesie et al.  involved training the model on pre-defined groups of curves and all new curves were assigned to one of the pre-defined groups, which implies curves that are significantly different from those in the training set will always be misclassified.
The individual-based model used in simulating the data in this study is a computational epidemiology model consisting of a disease model and a dynamical network with detailed representation of synthetic populations [9–11]. Disease is transmitted through contacts between susceptible and infectious individuals. Individuals move through four disease states (Figure 1): susceptible, exposed, infectious and recovered. Recovered individuals remain in the population but can no longer spread the disease due to immunity. To classify epidemics stochastically simulated using the individual-based model, we explore the grouping properties of a semi-supervised Dirichlet process model. The Dirichlet process is a nonparametric Bayesian procedure that presents a good solution to this problem since curves different from those in the library can be identified. Since Ferguson  formalized the Dirichlet process as a prior over distributions, there have been several extensions in terms of inference and applications . The Dirichlet process has been proposed as a solution to finding the number of spatial activation patterns in fMRI images , the modeling of unknown number of topics across several corpora of documents , grouping population genetics data , detecting positive selection in protein-coding DNA sequences  etc. Although the Dirichlet process has been used in several studies, to our knowledge it has not been used in a procedure aimed at classifying and forecasting epidemic curves.
The premise of this paper is to validate the functionality of the Dirichlet process model for classifying and forecasting epidemic curves using model-generated epidemic data. The main aims of the study are therefore to: (i) compare the accuracy of the Dirichlet process model to that of random forest, which has been shown to achieve high accuracy in correctly classifying simulated epidemics based on the partial epidemic curve; (ii) forecast the epidemic peak time before the peak is observed and (iii) identify epidemic curves different from those in the library. Additionally, we also forecast influenza outbreaks in the United States from 1997–2013 based on ILI data from the Centers for Disease Control and Prevention (CDC). There are several advantages to using this simulation and classification approach for classifying and forecasting epidemic curves. First, the method captures the temporal trend of the epidemic curve. Second, the epidemic curve can be estimated under different scenarios by implementing changes to the individual-based model. Third, the approach identifies curves similar or different from those in the library.
The individual-based model used in generating the data in this study has been used for studying influenza dynamics and transmission, and evaluating control strategies [8, 9, 18]. The construction of the individual-based model involves the creation of a state-of-the-art behavioral model and a disease model. To create the behavioral model, a synthetic population for a specific geographic region is constructed using United States census data . Each individual is assigned a set of demographic variables such as age, household size, household income etc. Households in the synthetic population are located such that a census of the synthetic population is statistically identical to the real census data at the block level . Each synthetic individual is also assigned a schedule based on data from an activity survey . Synthetic individuals come in contact with other individuals at different activity locations resulting in a dynamic social contact network through which disease transmission occurs. The individual-based model is not the focus of this study since similar data can be simulated using a compartmental SEIR model. However, we use the individual-based model since the overall aim of the project is to forecast the epidemic curve, and investigate the possible effects of interventions and changes in individual behavior during an outbreak. Details of the individual-based model are summarized in the Additional file 1. A detailed description can also be found in .
During most infectious disease outbreaks that confer immunity upon recovery, the number of new cases increases until it peaks and then declines thereafter . Under the assumption of a single peak, we modeled the simulated daily infected-counts using parametric models based on select statistical distributions. We used nonlinear models based on negative binomial, Poisson, Weibull, normal, Pareto, generalized extreme values (GEV) and Cauchy basis functions. The models were selected to allow exploration of different discrete and continuous distributions that appeared to capture different aspects of the epidemic curves. For example, the normal model was selected since it appeared to represent the shape of the simulated epidemic curves. On the other hand, heavy tailed distributions such as the Weibull, Pareto and Cauchy were selected to capture the heavy tails observed in some epidemic curves. In addition, since the epidemic curves were daily counts of infected persons over time, we also used models based on discrete distributions such as Poisson and negative binomial. Finally, the GEV, which is a three parameter (location, scale and shape) family distribution, was chosen to model deviations in the shape of the epidemic curves. We provide explicit details on the model formulation in a later section.
Fit of data to parametric models
Mean of mean
Variance of mean
As stated, epidemic curves simulated using the same disease model parameters were assigned to the same group. In addition, each group also had a different set of parameters for the same family of distributions such as the normal distribution. Parameter references for the remainder of this section refer to the later. The number of possible groups and the parameters describing each curve were considered to be random variables. Furthermore, the prior probability distribution for the number of groups was described by a Dirichlet process prior. The Dirichlet process model was used in classifying epidemic curves and parameters in the fitted model were used in forecasting the epidemic peak.
The Dirichlet process (DP) represents a process prior in nonparametric Bayesian mixture models. Here nonparametric implies that distributions from the Dirichlet process have an infinite number of parameters . The Dirichlet process can be defined as follows:
The two parameters H and α denote the mean: E[ G(B)]=H(B) and inverse variance: V[ G(B)]=H(B)(1−H(B))/(α+1) of the DP respectively for any measurable partition B⊂Γ. Since α represents the inverse variance, when α is large, which implies a small variance, the DP is concentrated around its mean. As α→∞ for any measurable B, G(B)→H(B). Additionally, draws from a DP are discrete since G is discrete with countably infinite point masses, even when H is small .
A simple property of a finite-dimensional Dirichlet distribution is that the sum of the probabilities of disjoint partitions is also a joint Dirichlet distribution whose parameters are sums of the parameters of the original Dirichlet distribution . This property also holds true for the Dirichlet process. Moreover, samples from a DP are discrete, which leads to the observation of ties useful for grouping. The grouping property of the Dirichlet process can be best described using the Chinese restaurant process.
Where α/(α+k+1) is the probability that the k t h customer takes a new table. Note that the number of tables grows logarithmically in the number of observations. A large α will result in a large number of tables a priori.
The classification problem was formulated using the CRP representation of the Dirichlet process. We used the epidemic curves and reference type for each epidemic group to represent customers and tables in the restaurant respectively. Note that all members of an epidemic group were assigned to the same table. At each point of classification, there were at most (k+1) tables, since a new epidemic curve was either classified to a pre-existing epidemic group or to a new one based on the posterior probability of belonging to each of the groups. Initially, we developed a Dirichlet process model for each of the three previously selected parametric distributions (normal, Poisson and negative binomial) and used a Markov Chain Monte Carlo (MCMC) procedure  for parameter estimation. After comparing the performance of the three models, we decided to use the normal model because the model fit well to the epidemic curve data and the model parameters were easy to interpret.
The normal Dirichlet process procedure was implemented in four steps. First, we grouped epidemics with the same disease model parameters (transmissibility value, incubation and infectious period distributions). Second, for each day j, for each group, we inferred normal model parameters using the slice sampling procedure. Slice sampling is also an MCMC method, which enables random sampling from probability distributions . See Neal  for additional information. This procedure is equivalent to parameter estimation in a nonlinear regression framework.
where N was the number of epidemic curves in each group and T was the total number of time points. Third, at each day j, we calculated the posterior predictive probability of a new curve belonging to each of the groups in the library. Last, we classified the new curve to one of the groups in the library or created a new group. We performed the prediction procedure independently for six hundred simulated epidemic curves.
Additionally, we forecasted the timing of the peak and evaluated the accuracy of the DP model in identifying epidemic curves different from those in the library. As stated, if an epidemic curve was different from the curves in the library, the DP model was expected to create a new group. The number of new groups created by the DP model is dependent on the choice of α. To select an appropriate value for α, we performed a sensitivity analysis by perturbing the value of α and measuring the prediction accuracy. We set α at 0.001 after comparing the error rates of higher and lower values.
We used both random forest and the DP model to forecast peaks of influenza outbreaks observed in the US from 1997–2013. Data on estimated percent ILI were obtained from the CDC influenza surveillance website (http://www.cdc.gov/flu/weekly/pastreports.htm). We divided the data into training and test sets. The training set consisted of yearly ILI data from 1997–2007 and the test set contained data from 2007–2013. Data for the first five influenza seasons started on week 40 in one year and ended on week 20 in the next year. For consistency, we defined the epidemic curve for each influenza season starting from week 40. The data in the training set was used to the train the random forest algorithm and the DP model, whereas data in the test set was used for predictions. Each of the ILI epidemic curves in the training set was placed in a separate group in the library and each epidemic curve in the test set was independently classified. We made predictions using sixteen, twenty-five and thirty-three weeks of data. The numbers of weeks were selected to make predictions before and after the peaks of most of the outbreaks, and towards the end of the influenza season. We evaluated accuracy of forecast based on prediction of the peak time.
The results are organized into three sections. In the first section, we discuss the results from classifying epidemic curves similar to those in the epidemic library. In the next section, we present additional features of the DP model; namely the forecasting of the epidemic peak time, and the identification of novel epidemic curves. Finally, we discuss the forecasting of the peak time of seasonal epidemics in the United States using percent ILI data.
Although the DP model performed extremely well in identifying previously discussed epidemic curves, the model encountered difficulties in classifying moderate and mild epidemics (Figure 4(d) and (e)). This was likely due to the fact that mild and moderate epidemics were sandwiched between strong and milder epidemics. Both methods appeared unstable in identifying these epidemics, although the instability was most evident in the DP model. In several instances, moderate epidemics were misclassified as strong and mild, while mild epidemics were misclassified as milder. However, as the epidemics neared their peaks, both methods could distinguish the curves from the other groups. Both methods had an accuracy of over 90% before day 84, which was several weeks from the mean peak day; day 121 for moderate epidemics. On the contrary, the DP model achieved over 90% accuracy around day 98 in the classification of mild epidemics, which had a mean peak day of 136.
The accuracy of classifying partial epidemic curves was higher for simulated epidemics with a higher reproduction number (R). The mean R at the start of the simulated epidemics were approximately 1.42, 1.45, 1.51, 1.59, 1.60 and 1.79 for milder, mild, moderate, strong, severe and catastrophic epidemics respectively. Both methods achieved significantly higher accuracy in the classification of catastrophic, severe and strong epidemics compared to moderate, mild and milder epidemics. In addition, the accuracy of both methods were more reliable and consistent as epidemics neared their peaks. This agrees with other published studies, which suggest that the accuracy of forecasting methods can be sensitive to the point at which forecasts are made [5, 7, 28].
As stated, the DP model has additional features, which are not inherent in the RF approach. The DP model can be used to identify epidemic curves different from those in the library and forecast the expected peak based on parameters estimated using the slice sampling approach. On the other hand, RF is a fully supervised classification algorithm, which implies that epidemic curves that are significantly different from those in the library will always be misclassified. In addition, RF can only forecast the peaks of epidemic curves similar to those in the library.
Simulated epidemics in the various groups peaked within the following time ranges (days): severe [63–72], catastrophic [73–81], strong [94–106], moderate [116–128], mild [127–150] and milder [149–179]. Peak days falling within the true range for severe epidemics were forecasted on day 70 with a 95% credible interval of [67.385–67.726]. For catastrophic epidemics, the peaks were correctly forecasted on day 77 with a 95% credible interval of [78.964–79.544]. Peak timing for strong epidemics were correctly predicted on day 91 with a 95% credible interval of [104.42–105.08]. Moderate, mild and milder peak days were also correctly forecasted in the correct range on days 126, 91, and 133 with 95% credible intervals of [125.07–125.53], [126.57–127] and [149.78–150.1] respectively. After the specified days, the peak time was accurately forecasted.
Random forest correctly forecasted the peak time for 3/5, 2/5 and 3/5 of the seasonal epidemic curves given sixteen, twenty-five and thirty-three weeks of data, respectively. Similarly, the DP model correctly forecasted the peak time for 2/5, 3/5 and 3/5 of the epidemics in the test set given sixteen, twenty-five and thirty-three weeks of data, respectively. This suggests that the methods performed about the same in the forecasting of seasonal epidemics.
In contrast, the 2009–2010 epidemic curve that was observed during a pandemic peaked in week 42 of 2009, which was much earlier compared to the seasonal epidemics. Instead of evaluating accuracy based on prediction of the peak time, we focused on the accuracy of classification. The epidemic curve was incorrectly classified by random forest. Since random forest is a fully supervised classification algorithm, epidemic curves that are significantly different from those in the library will always be misclassified. The DP model identified the epidemic curve as being distinct from those in the library and was also able to accurately model the trend of the curve.
In this study we presented a Dirichlet process model for classifying epidemic curves and forecasting the expected peak time. The focus of this initial study was to establish the accuracy and usefulness of the method. We achieved our main objectives which were to: (i) compare the accuracy of the Dirichlet process model to that of random forest, (ii) forecast the epidemic peak time before the peak was observed and (iii) identify epidemic curves different from those in the library. We also observed that epidemics with higher transmission rates, implying higher R values, were easier to classify.
We acknowledge that there are limitations to using the proposed approach. For instance, using a parametric model to describe the epidemic curve might not be the most suitable option since the shape of the epidemic curve can vary from one epidemic to another. However, modeling the dynamics of epidemics using a nonparametric functional is a viable option (e.g. Gaussian processes, wavelets, or splines). In addition to limitations in the parametric model, inconsistencies in the accuracy of the DP model could also be attributed to the shape of the epidemic curves and the stochasticity inherent in MCMC methods. Since the DP model tries to capture the shape of the curve, it is also significantly influenced by the shape of the curve. In theory, one would expect the accuracy of the DP model to consistently improve with additional data. However, this is not always the case especially in the early stages of an epidemic, when the shape of the epidemic curve is not yet evident. Random forest on the other hand fails to take into account the shape of the curve and is therefore not heavily influenced by the variability introduced due to the shape of the curves.
The performance of random forest is likely due to its classification scheme. In this study, random forest classifies each epidemic curve one-thousand times and the final prediction is based on a voting process . The epidemic group to which the new epidemic is classified the majority of the times is assigned the epidemic curve. Random forest also requires a shorter computation time compared to the DP model. This is because rather than considering every data point in the library and test set, each of the one-thousand classifications is based on a random sample of size from each curve where j is the day of prediction. The number of classifications enables the method to capture different aspects of the curve by drawing different samples each time. Although the results seem to indicate that random forest predicts epidemic curves slightly better than the DP model, the performances are comparable especially in the classification of the real epidemics and epidemics simulated with high transmissibility (R) values.
The results present situations in which the DP model performs well and situations in which it is likely to encounter problems. In most scenarios, the DP model encounters problems when distinguishing epidemic curves that are extremely similar or if the curve shape is not yet visible. However, there are several advantages to using the DP model over random forest. First, the DP model captures the shape and temporal structure which is not the case with random forest. This implies that under the random forest method, the data on day twenty can be moved to day five and the performance of random forest will not be affected. Second, classification algorithms such as random forest can suffer from the curse of dimensionality; the accuracy of the method depreciates as the dimensionality of the data increases. However this does not present an issue in this study. Third, the DP model does not only capture the trend of the epidemic but also the mean and variance of the curve. This information can be used to predict the peak time and the spread of the epidemic curves. Lastly, the DP model can also identify epidemics different from those in the library. However, since there are benefits to using both methods in prediction, an ensemble approach might be beneficial. In addition, classification using ILI data suggest that the performance of both methods could be improved. In future studies, we will explore ways to make the DP method more applicable to real-time forecasting of the epidemic curve.
We thank our external collaborators and members of the Network Dynamics and Simulation Science Laboratory (NDSSL) for their suggestions and comments. This work has been partially supported by Defense Threat Reduction Agency Validation Grant HDTRA1-11-1-0016, HDTRA1-11-D-0016-0001, NIH MIDAS grant, 2U01GM070694-09, NSF OCI-1032677, and by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (Do(/NBC) contract number D12PC00337, the U.S. Governmnet is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.
Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.