As stated, epidemic curves simulated using the same disease model parameters were assigned to the same group. In addition, each group also had a different set of parameters for the same family of distributions such as the normal distribution. Parameter references for the remainder of this section refer to the later. The number of possible groups and the parameters describing each curve were considered to be random variables. Furthermore, the prior probability distribution for the number of groups was described by a Dirichlet process prior. The Dirichlet process model was used in classifying epidemic curves and parameters in the fitted model were used in forecasting the epidemic peak.

### Dirichlet process models

The Dirichlet process (DP) represents a process prior in nonparametric Bayesian mixture models. Here nonparametric implies that distributions from the Dirichlet process have an infinite number of parameters [24]. The Dirichlet process can be defined as follows:

Let H be a distribution over a measurable space *Γ* and *α* a positive real number. For any finite measurable partition *B*
_{1},…,*B*
_{
r
} of *Γ*, the random vector (*G*(*B*
_{1}),…,*G*(*B*
_{
r
})) has a finite-dimensional Dirichlet distribution with base measure H and concentration parameter *α* given by *G*∼*D*
*P*(*α*,*H*) if:

\begin{array}{l}\left(G\right({B}_{1},\dots ,G\left({B}_{r}\right))\sim \mathit{\text{Dir}}(\mathrm{\alpha H}\left({B}_{1}\right),\dots ,\mathrm{\alpha H}\left({B}_{r}\right))\end{array}

(1)

The two parameters H and *α* denote the mean: *E*[ *G*(*B*)]=*H*(*B*) and inverse variance: *V*[ *G*(*B*)]=*H*(*B*)(1−*H*(*B*))/(*α*+1) of the DP respectively for any measurable partition B⊂*Γ*. Since *α* represents the inverse variance, when *α* is large, which implies a small variance, the DP is concentrated around its mean. As *α*→*∞* for any measurable B, *G*(*B*)→*H*(*B*). Additionally, draws from a DP are discrete since G is discrete with countably infinite point masses, even when H is small [25].

A simple property of a finite-dimensional Dirichlet distribution is that the sum of the probabilities of disjoint partitions is also a joint Dirichlet distribution whose parameters are sums of the parameters of the original Dirichlet distribution [13]. This property also holds true for the Dirichlet process. Moreover, samples from a DP are discrete, which leads to the observation of ties useful for grouping. The grouping property of the Dirichlet process can be best described using the Chinese restaurant process.

#### Chinese restaurant process

The Chinese Restaurant Process (CRP) can be described as follows: consider a restaurant with infinitely many tables and an infinite number of customers can be seated at each table. Each customer enters the restaurant and selects a table. In general, the (*n*+1)^{st} customer would sit at an occupied table with probability proportional to the number *n*
_{
k
} of customers at that table or sit at a new table with probability proportional to *α*. This can be represented as:

\phantom{\rule{-12.0pt}{0ex}}\begin{array}{l}{X}_{n}|{X}_{1},\dots ,{X}_{n-1}\sim \\ \phantom{\rule{1em}{0ex}}\left\{\begin{array}{rrrr}{{X}_{k}}^{\ast}& \text{with probability}& \frac{{n}_{k}}{\alpha +n-1}& \mathrm{k}=1,\dots \\ \text{new draw from G}& \text{with probability}& \frac{\alpha}{\alpha +n-1}\end{array}\right.\end{array}

(2)

An important aspect of CRP, is the fact that most Chinese restaurants have round tables. This implies that with *n* customers in the restaurant, the tables would define both a distribution over permutations of *n* and a distribution over partitions *n*. The expected number of tables m among the n customers is given by:

\begin{array}{l}\mathbb{E}\left[\phantom{\rule{0.3em}{0ex}}m\right|n]=\sum _{k=1}^{n}\frac{\alpha}{\alpha +k-1}\in O(\alpha \text{log n})\end{array}

(3)

Where *α*/(*α*+*k*+1) is the probability that the *k*
^{th} customer takes a new table. Note that the number of tables grows logarithmically in the number of observations. A large *α* will result in a large number of tables a priori.

### Semi-supervised Dirichlet process model

The classification problem was formulated using the CRP representation of the Dirichlet process. We used the epidemic curves and reference type for each epidemic group to represent customers and tables in the restaurant respectively. Note that all members of an epidemic group were assigned to the same table. At each point of classification, there were at most (k+1) tables, since a new epidemic curve was either classified to a pre-existing epidemic group or to a new one based on the posterior probability of belonging to each of the groups. Initially, we developed a Dirichlet process model for each of the three previously selected parametric distributions (normal, Poisson and negative binomial) and used a Markov Chain Monte Carlo (MCMC) procedure [26] for parameter estimation. After comparing the performance of the three models, we decided to use the normal model because the model fit well to the epidemic curve data and the model parameters were easy to interpret.

The normal Dirichlet process procedure was implemented in four steps. First, we grouped epidemics with the same disease model parameters (transmissibility value, incubation and infectious period distributions). Second, for each day *j*, for each group, we inferred normal model parameters using the slice sampling procedure. Slice sampling is also an MCMC method, which enables random sampling from probability distributions [27]. See Neal [27] for additional information. This procedure is equivalent to parameter estimation in a nonlinear regression framework.

The nonlinear regression model relating the daily infected counts for epidemic curve *j*, at time *t* (*y*
_{
j,t
}) was given by:

\begin{array}{l}{y}_{j,t}=f(\mathit{\theta},t)+{\epsilon}_{t}\end{array}

(4)

*θ* was the vector of parameters and *ε*
_{
t
} was the random error. *f*(*θ*,*x*) represented nonlinear basis function, which was a normal curve with parameters *θ*=*ϕ*,*μ*,*σ* given by:

\begin{array}{l}f(\mathit{\theta},t)=\varphi {e}^{\frac{-{(t-\mu )}^{2}}{2{\sigma}^{2}}},\end{array}

(5)

where *ϕ* scaled the height of the function, *μ* was the mean of the function (representing the peak day), and *σ*
^{2} modeled the variability of the epidemic curve. We defined *ε*
_{
t
}∼*N*(0,1/*γ*) and used standard reference priors for p\left(\gamma \right)\propto \frac{1}{\gamma} and *p*(*θ*)∝1 to arrive at the posterior:

\begin{array}{lll}p(\mathit{\theta},\gamma )& \propto & \left(\prod _{j=1}^{N}\prod _{t=1}^{T}{\gamma}^{1/2}{e}^{-\gamma \frac{{({y}_{j,t}-f(\mathit{\theta},t\left)\right)}^{2}}{2}}\right)p\left(\gamma \right)p\left(\mathit{\theta}\right)\\ =& {\gamma}^{(N\ast t-2)/2}{e}^{-(\gamma /2)\sum _{j=1}^{N}\sum _{t=1}^{T}{(\phantom{\rule{0.3em}{0ex}}{y}_{j,t}-f(\mathit{\theta},t\left)\right)}^{2}},\end{array}

where *N* was the number of epidemic curves in each group and *T* was the total number of time points. Third, at each day *j*, we calculated the posterior predictive probability of a new curve belonging to each of the groups in the library. Last, we classified the new curve to one of the groups in the library or created a new group. We performed the prediction procedure independently for six hundred simulated epidemic curves.

Additionally, we forecasted the timing of the peak and evaluated the accuracy of the DP model in identifying epidemic curves different from those in the library. As stated, if an epidemic curve was different from the curves in the library, the DP model was expected to create a new group. The number of new groups created by the DP model is dependent on the choice of *α*. To select an appropriate value for *α*, we performed a sensitivity analysis by perturbing the value of *α* and measuring the prediction accuracy. We set *α* at 0.001 after comparing the error rates of higher and lower values.

### Forecast of peak time of epidemics in the U.S

We used both random forest and the DP model to forecast peaks of influenza outbreaks observed in the US from 1997–2013. Data on estimated percent ILI were obtained from the CDC influenza surveillance website (http://www.cdc.gov/flu/weekly/pastreports.htm). We divided the data into training and test sets. The training set consisted of yearly ILI data from 1997–2007 and the test set contained data from 2007–2013. Data for the first five influenza seasons started on week 40 in one year and ended on week 20 in the next year. For consistency, we defined the epidemic curve for each influenza season starting from week 40. The data in the training set was used to the train the random forest algorithm and the DP model, whereas data in the test set was used for predictions. Each of the ILI epidemic curves in the training set was placed in a separate group in the library and each epidemic curve in the test set was independently classified. We made predictions using sixteen, twenty-five and thirty-three weeks of data. The numbers of weeks were selected to make predictions before and after the peaks of most of the outbreaks, and towards the end of the influenza season. We evaluated accuracy of forecast based on prediction of the peak time.