### The empirical data

Two datasets were employed in this study, both of which have been described in detail elsewhere [14, 15]. For the first dataset, all children from three rooms (*n*
_{
c
} = 47) of a day care centre (DCC) with approximately 150 children in Lisbon, Portugal, were due to 11 scheduled nasopharyngeal samples with approximately one month sampling interval from February 1998 to February 1999. No samples were collected in July and August as the DCC was closed for the summer break. There were 16, 15 and 16 attendees in the three rooms, with mean age of 2 years (range 1.2-3.1) at the beginning of the study. Altogether, 80% of the scheduled samples were obtained, resulting in a total of 416 samples from the three DCC rooms.

Approval for the Portuguese study was obtained from the Ministry of Education and the director of the day care centre. Signed informed consent was obtained from the parents or guardians of all children.

For the second dataset, a portion of attendees (*n*
_{
c
} = 61) in all seven rooms in three DCCs with a total of approximately 150 children in the Tampere region, Finland, were due to 10 scheduled nasopharyngeal samples with approximately one month sampling interval from September 2001 to May 2002. In the three DCCs, 25, 18 and 18 children were enrolled, respectively, and 90%, 82% and 99% of the samples were obtained, resulting in a total of 215, 148 and 179 samples. At the beginning of the study, the mean age of the enrolled children was 4.2 years (range 1.2-6.6).

In the Finnish study, informed consent was obtained from parents or guardians of the children. The study was conducted in compliance with the Helsinki Declaration and the ethics committee of the Pirkanmaa Hospital District (PSHP) gave a favourable opinion on the study protocol.

In both studies, calcium alginate swabs were taken from the deep nasopharynx. Pneumococcal identification and serotyping was done following routine procedures as previously described [14, 15]. Only one serotype per positive sample was identified. In the Portuguese dataset, the genotype and antibiotic resistance of pneumococcal isolates were also identified but only the serotype information is used in the present study. Among the few clones that were merged into one serotype, most were segregated in different rooms and represented only few samples. In a reanalysis of samples from one of the Finnish day care centres (rooms F1 and F2), samples were incubated and cultured after an enrichment step. Pneumococci were identified from both blood and blood agar with gentamycin plates and serotyped. In addition, multiple serotypes were searched directly by the Quellung method [16].

### Transmission model

We modelled transmission dynamics of pneumococcal serotypes in children as previously described [6, 13]. Children were assigned a state corresponding to their carriage status. Thus, at any given time, a child was either a non-carrier (state *s* = 0) or carrier of one of the *n*
_{
s
} serotypes (state *s* ∈ {1, …, *n*
_{
s
}}). The rate of acquisition was taken to depend on the prevalence of pneumococci in the DCC room as well as on exposure from outside the room. We assumed that a carrier could only acquire serotypes different from the currently colonising serotype and that the rate of acquisition may be affected by current carriage. All serotypes were assumed to share the same acquisition and clearance rates.

For a non-carrying child *i*, the rate of acquisition of serotype *s* at time *t* was defined as

{\alpha}_{i}^{s}\left(t\right)=\frac{\beta {C}_{i}^{s}\left(t\right)}{{n}^{i}-1}+\kappa

where *n*
^{i} is the number of attendees in the room of the child and {C}_{i}^{s}\left(t\right) is the number of carriers of serotype *s* in his/her room just before time *t*. Parameter *β* is the rate at which one carrier transmits carriage to other children in the room. Parameter *к* is the community force of infection (per serotype), which can be interpreted as the part of acquisition rate which cannot be assigned to observed exposure within the room.

For an individual currently carrying pneumococcus, the rate of acquisition was multiplied by a competition parameter (relative rate) *ϕ*. The clearance rate within child *i* was modelled using a Weibull hazard with shape parameter *ρ* and rate parameter \mu :{\alpha}_{i}^{0}\left(t\right)=\rho {\mu}^{\rho}{\left(t-{t}_{i}^{\mathit{\text{acq}}}\right)}^{\rho -1}, where {t}_{i}^{\mathit{\text{acq}}} is the acquisition time of the carriage episode. The shape parameter ρ of the Weibull distribution was set to value 3 to prefer carriage episodes with lengths close to the mean over very short carriage episodes. In particular, with this choice the median and mean of the Weibull distribution are approximately the same. In summary, the hazard rate for child *i* moving from state *r* to state *s* at time *t* was defined as

{\alpha}_{i}^{r,s}\left(t\right)=\left\{\begin{array}{ll}{\alpha}_{i}^{s}\left(t\right),\hfill & \mathrm{if}\phantom{\rule{0.25em}{0ex}}r=0,\text{i}.\text{e}.\phantom{\rule{0.25em}{0ex}}\mathrm{for}\phantom{\rule{0.25em}{0ex}}\mathrm{acquiring}\phantom{\rule{0.25em}{0ex}}\mathrm{serotype}\phantom{\rule{0.25em}{0ex}}s\phantom{\rule{0.5em}{0ex}}\mathrm{when}\phantom{\rule{0.25em}{0ex}}\mathrm{non-carrier};\hfill \\ \phi {\alpha}_{i}^{s}\left(t\right),\hfill & \mathrm{if}\phantom{\rule{0.25em}{0ex}}r,s>0,\text{i}.\text{e}.\phantom{\rule{0.25em}{0ex}}\mathrm{for}\phantom{\rule{0.25em}{0ex}}\mathrm{acquiring}\phantom{\rule{0.25em}{0ex}}\mathrm{serotype}\phantom{\rule{0.25em}{0ex}}s\phantom{\rule{0.25em}{0ex}}\mathrm{when}\phantom{\rule{0.25em}{0ex}}\mathrm{carrier}\phantom{\rule{0.25em}{0ex}}\mathrm{of}\phantom{\rule{0.25em}{0ex}}\mathrm{serotype}\phantom{\rule{0.25em}{0ex}}r;\hfill \\ {\alpha}_{i}^{0}\left(t\right),\hfill & \mathrm{if}\phantom{\rule{0.25em}{0ex}}s=0,\text{i}.\text{e}.\phantom{\rule{0.25em}{0ex}}\mathrm{for}\phantom{\rule{0.25em}{0ex}}\mathrm{clearing}\phantom{\rule{0.25em}{0ex}}\mathrm{carriage}\phantom{\rule{0.25em}{0ex}}\mathrm{of}\phantom{\rule{0.25em}{0ex}}\mathrm{any}\phantom{\rule{0.25em}{0ex}}\mathrm{serotype}\phantom{\rule{0.5em}{0ex}}r.\hfill \end{array}\right.

(1)

The Portuguese DCC data referred to children from three rooms. Due to the clear temporal separation caused by the summer break, these were analysed as 6 rooms (3 rooms before and 3 rooms after the summer break). The Finnish children corresponded to a total of 7 rooms (2, 3 and 2 rooms the three DCCs). For simplicity, we only specified exposure and transmission within the same room and considered transmission between rooms as part of the general community exposure.

Exposure from the community was assumed to remain constant during the study period. The number of serotypes (*n*
_{
s
}) the children were at risk of acquiring from contacts with people outside the rooms was assumed to be constant and equal to the total number of observed serotypes in the dataset (14 in Portugal, 20 in Finland). This choice affects the estimates of community acquisition rates per serotype but has negligible effect on the overall rate of community acquisition.

### Statistical methods

Adopting a Bayesian latent process approach, a likelihood-based estimation of the model parameters *θ* = {*β*, *κ*, *μ*, *ϕ*} was applied using the same algorithm and setup as Hoti et. al [6] with a new modification which handles the left censoring of the first episodes of carriage at the start of the follow-up (see below). A Markov chain Monte Carlo (MCMC) algorithm was constructed to sample from the joint posterior distribution of the model parameters and carriage histories (unobserved events) compatible with the observed data, including the carriage histories of children with completely missing data (59% of attendees in the Finnish dataset; no child had completely missing data in the rooms of the Portuguese dataset).

Given the carriage histories, i.e. the initial state of carriage and all times at which the state changes for all the children in the rooms, the complete data likelihood can be calculated. For child *i*, let {T}_{i}^{r,s} denote the set of times the carriage status changes from state *r* to state *s* in the time interval ]*t*
_{min}, *t*
_{max}], where *t*
_{min} is one day before the first nasopharyngeal sample and *t*
_{max} is the day after the last nasopharyngeal sample is taken in the dataset. Let {T}_{i}=\left\{{T}_{i}^{r,s}:r,s=1,\dots ,{n}_{s}\right\} be the collection of all times child *i* changes carriage status. The likelihood function of the model parameters *θ*, based on data from *n*
_{
c
} individuals on the time interval ]*t*
_{min}, *t*
_{max}] and defined by model (1), is

\begin{array}{c}p\left({T}_{1},\dots ,{T}_{{n}_{c}}|\theta \right)=\left({\displaystyle \prod _{i=1}^{{n}_{c}}{\displaystyle \prod _{s=0}^{{n}_{s}}{\displaystyle \prod _{r\ne s}{\displaystyle \prod _{t\in {T}_{i}^{r,s}}{\alpha}_{i}^{r,s}\left(t\right)}}}}\right)\\ \phantom{\rule{8.9em}{0ex}}\times \text{exp}\left(-{\displaystyle {\sum}_{i=1}^{{n}_{c}}{\displaystyle \sum _{s=0}^{{n}_{s}}{\displaystyle \sum _{r\ne s}{\displaystyle \underset{{t}_{\text{min}}}{\overset{{t}_{\text{max}}}{\int}}{\alpha}_{i}^{r,s}\left(u\right)\phantom{\rule{0.12em}{0ex}}{I}_{i}^{r}\left(u\right)\phantom{\rule{0.12em}{0ex}}\mathit{\text{du}}}}}}\right)\end{array}

where {I}_{i}^{r}\left(u\right) is the indicator function of child *i* being in state *r* just before time *u*.

The prior for each of the rate parameters (*β*, *к*, *μ*) was taken to be Normal with mean 0 and standard deviation 100, restricted on positive values, independently of the other parameters. The prior for ϕ was assumed to be Exponential with scale parameter 1/ln(2), corresponding to equal *a priori* probabilities for this parameter to be less or more than one.

In our model, the hazard of clearing carriage depends on the time of acquisition. Therefore, assuming that those found as carriers at the beginning of the follow-up had just started carrying may bias the estimation of the clearance rate as the durations of these episodes of colonisation appear shorter than they should. To correct for such left censoring of initial episodes of carriage at *t*
_{min}, we included the unobserved length ℓ of the carrying time preceding time *t*
_{min} into the model. The preceding time ℓ was assigned an Exponential prior distribution with mean 60 days. Sampling of ℓ was done by proposing a new value from the prior distribution and then accepting or rejecting it according to the Metropolis-Hastings algorithm. Since the proposal distribution was taken to be the same as the prior distribution, the acceptance ratio depended only on the complete data likelihood ratio.

The joint posterior distribution was defined as the product of the prior probabilities and the complete data likelihood (cf. [6]). Three separate MCMC chains of length 200,000 were realised for both datasets. The convergence was checked by inspection of the trace plots (Figure 1). After discarding 10,000 initial iterations from each chain, the posterior distribution was investigated from the combined 570,000 iterations, separately for the two datasets. Parameter estimates were given in terms of their posterior means and 90% credible intervals.

### Crude estimates

For comparison, crude estimates of the model parameters were calculated. These estimates were derived from a simplified model of acquisition and clearance by assuming that children acquired or cleared pneumococcus at most once between any two consecutive visits one month apart and that the events could take place only at the end of each one-month time period. Crude estimates of rates were calculated by dividing the appropriate numbers of events (acquisition or clearance) by the respective person-time at risk for the event in question, as summarised in the Appendix (see also [13]).

### Model checking

To check the model fit, data were simulated from model (1) based on a subsample of size 30,000 from the posterior of parameters *θ*. For each of the rooms, the transmission process was simulated for the total number of attendees in the room, with the initial states sampled from the overall state distribution in the respective dataset. The process was simulated for one year, following which monthly samples were taken from all children, corresponding to the number of visits in that room in the dataset. The posterior predictive distributions of the prevalence and crude estimates of the transition rates were then compared to the actually observed prevalence and crude rates. For simplicity, the transitions relating to acquisition were not adjusted for exposure within DCC rooms in this analysis.