A Dirichlet process model for classifying and forecasting epidemic curves

Background A forecast can be defined as an endeavor to quantitatively estimate a future event or probabilities assigned to a future occurrence. Forecasting stochastic processes such as epidemics is challenging since there are several biological, behavioral, and environmental factors that influence the number of cases observed at each point during an epidemic. However, accurate forecasts of epidemics would impact timely and effective implementation of public health interventions. In this study, we introduce a Dirichlet process (DP) model for classifying and forecasting influenza epidemic curves. Methods The DP model is a nonparametric Bayesian approach that enables the matching of current influenza activity to simulated and historical patterns, identifies epidemic curves different from those observed in the past and enables prediction of the expected epidemic peak time. The method was validated using simulated influenza epidemics from an individual-based model and the accuracy was compared to that of the tree-based classification technique, Random Forest (RF), which has been shown to achieve high accuracy in the early prediction of epidemic curves using a classification approach. We also applied the method to forecasting influenza outbreaks in the United States from 1997–2013 using influenza-like illness (ILI) data from the Centers for Disease Control and Prevention (CDC). Results We made the following observations. First, the DP model performed as well as RF in identifying several of the simulated epidemics. Second, the DP model correctly forecasted the peak time several days in advance for most of the simulated epidemics. Third, the accuracy of identifying epidemics different from those already observed improved with additional data, as expected. Fourth, both methods correctly classified epidemics with higher reproduction numbers (R) with a higher accuracy compared to epidemics with lower R values. Lastly, in the classification of seasonal influenza epidemics based on ILI data from the CDC, the methods’ performance was comparable. Conclusions Although RF requires less computational time compared to the DP model, the algorithm is fully supervised implying that epidemic curves different from those previously observed will always be misclassified. In contrast, the DP model can be unsupervised, semi-supervised or fully supervised. Since both methods have their relative merits, an approach that uses both RF and the DP model could be beneficial.

detailed contacts between individuals and (ii) a dynamical model that simulates the spatial spread of disease and effectiveness of public health interventions. The synthetic social contact network is constructed using various open source and commercially available data combined with social and behavioral theories. The synthetic social contact network of an urban population is a particular kind of random network that is statistically comparable to a realistic social contact network and preserves anonymity of individuals. To construct the network, first a synthetic population is created using an iterative proportional fitting technique. The synthetic population consists of synthetic people, with assigned demographical attributes based on data from the US census. Each individual is placed in a household and each household is located in a realistic geographical location such that when aggregated at the block group level, the synthetic population is statistically identical to the original census data [2][3][4][5] .
Next, each household is allotted activity templates by time of day based on several thousand responses to an activity or time-use survey for a specific region. The activity templates provide detailed description of activities for each household member throughout the day. Using a decision tree based on demographics such as the number of workers in the household, number of children of various ages, etc., each synthetic household is matched and assigned the activity template of a household in the survey. Each activity performed by individuals in each household is assigned a location based on land-use patterns, tax data, etc., and the assigned locations are calibrated against data on travel-time distributions. These steps result in a synthetic population representing individuals and their activity patterns in a specified urban region. Synthetic individuals in the population interact with each other at various activity locations to produce realistic contact graphs where vertices represent individuals and edges represent contacts between individuals [6].
In addition to the time varying social contact network, a dynamical model that simulates spatial propagation of disease is also developed. The model is based on a Susceptible, Exposed, Infectious, Recovered (SEIR) representation. Individuals progress through the different disease states based on probabilistically timed incubating and infectiousness periods. The transition between disease states can be impacted by the attributes of the individuals (such as age, and health status) and the type of contact (casual, or intimate). The probability of transmission between susceptible (i) and infectious (j) individuals is given by: Here w(i, j) represents the contact duration and r is the disease transmission rate, which is defined per sec/contact time. Each individual in the model has a separate disease model such that at each time step of a simulation, an individual is either susceptible, exposed, infectious, or recovered. See [5], [2] and [13] for examples.

Contacts between infectious and susceptible individuals
The approaches used in constructing this model can be found in several publications. See [7], and [8], and [9] for information on urban population mobility models. See [10], [11], [12], [13], and [14], for information on disease transmission models and the natural history of the disease. For further information on contact networks, see [12], [5], and [15].

Random Forest
Hastie et al. [16] define the random forest algorithm as follows: 1. For b = 1 to B: Random Forest is an extension of bagging, an approach for combining several predictors to reduce the variance of an estimated prediction function [16,17]. Advantages of random forest include efficiency on large databases and estimation of importance variables [18]. For the analysis in this paper, we used the randomForest package in R [19].