The complete data likelihood function
For individual i, denote by
the set of times the carriage status changes from state r to state s in the time interval ]t
min, t
max], where t
max is the day after the last NP sample in the data is taken. Let T
i
= {
, r, s = 1,...,n
s
} be the collection of all times individual i changes carriage status. The likelihood function for n individuals on the time interval ]t
min, t
max] defined by model (1) is
where the unknown model parameters are gathered into the vector θ = {β
fam, β
dcc, κ, μ, φ, η, δ} [33]. Within each day care cohort the transmission of 13 serotypes (n
s
= 13) was considered, which was the maximum number of serotypes observed in one day care cohort during the follow-up.
Prior distributions
The prior distribution of the community acquisition rate κ, the within-family transmission rate β
fam, the within-DCC transmission rate β
dcc, and the clearance parameter μ were assigned a normal distribution with mean zero and standard deviation 3000 (rate per month), constrained to positive values only. The relative acquisition rate η and the relative clearance rate δ were both assigned a uniform prior on the interval [0,10]. For the competition parameter φ we assumed a prior proportional to φ
-1, reflecting equal prior probabilities for the events φ < 1 and φ < 1. In addition the probability to carry a serotype at the beginning of the follow up was fixed to 0.25 for the day care attendees and to 0.10 for the others. The maximum number of carriage episodes per individual was set to 10.
The non-participants
To ensure the correct contact structure and level of exposure, the non-participating members of the day care cohort were included in the statistical analysis. These included the ten family members of participating day care attendees that had no observations during the follow-up. In addition, carriage histories were augmented for 77 non-participating day care attendees and their family members. The family size of the non-participating day care individuals was assumed to be four according to the mean reported family size. Also in line with the observed data, half of the non-participating day care attendees were assumed to have a sibling in the same day care. This resulted in augmenting 9(7), 51(38), and 27(20) day care attendees (families) in DCC1, DCC2, and DCC3, respectively. Thus the analyses are based on 213 participants (at least one NP sample) and 270 non-participants (no NP samples).
Since no measurements were available from the non-participating individuals, their carriage histories relied solely on the model parameters and on the carriage histories of the participating members. Therefore, in the parameter estimation step of the Markov chain Monte Carlo (MCMC) algorithm an ad hoc approach (cf. the "cut" function in WinBUGS User Manual [34]) was adopted, where the information flow from the non-participants was discarded, i.e., the likelihood function (2) was calculated as a product over the participants only. The non-participants were taken into account in determining exposure to pneumococci in the participants.
The Markov chain Monte Carlo (MCMC) algorithm
The MCMC algorithm used to produce estimates of the model parameters was tailor-made using Matlab (version 7.5). The sketch of the MCMC algorithm is as follows:
1. Initialise model parameters θ
2. Initialise latent processes
3. Update parameters θ
a. one at a time, update κ, β and μ
b. update η and δ as a block
4. Update latent processes for a random sample of 20% of the individuals
a. propose a new episode as described below → accept/reject proposal
b. propose moving an event time → accecpt/reject proposal
5. Iterate steps 3 and 4 for a predefined number of rounds
In total the MCMC algorithm was run for 15000 rounds after 5000 burn-in rounds.
Updating the latent processes: proposing a new episode
In the MCMC algorithm for each individual we first initialise a path consistent with the observed panel data. The path H
i
of individual i consists of the carriage status at time t
0 and a series of events times, acquisition/clearance times T
i
, together with the corresponding event types.
At each MCMC round the path of a chosen individual is updated by the following algorithm.
1.Choose randomly one of the sampling intervals [S
v
, S
v+1], v = 1,...,(N - 1), where S
1 and S
N
are the beginning and the end of the follow-up, and S
2,...,S
N-1 are the individuals sampling times in ascending order.
2. Within the chosen sampling interval choose randomly an episode [t
k
, t
k+1], k = 1,...,(M - 1), where t
1 = S
v
and t
N
= S
v+1 are the beginning and the end of the interval, and t
2,...,t
M-1 are the individual's acquisition/clearance times within the interval in ascending order.
3. Define conjoin probabilities P
left
and P
right
(needed later)
a. if S
v
= t
k
, then P
left
= 0, otherwise P
left
= 0.5
b. if S
v+1 = t
k+1, then P
right
= 0, otherwise P
right
= 0.5
4. Propose limits [
,
] for a new episode
a. with probability (1 - P
left
)(1-Pright), pick randomly
,
∈ [t
k
, t
k+1], so that
<
b. with probability P
left
P
right
,
= t
k
and
= t
k+1
c. with probability (P
left
)(1 - P
right
),
= t
k
and pick randomly
∈ ]t
k
, t
k+1]
d. with probability (1 - P
left
)(P
right
), pick randomly
∈ [t
k
, t
k+1[ and
= t
k+1
5. Propose a "sero"type for the new episode
a. if episode [t, t
k+1] is non-carriage episode, propose a serotype randomly from the n
s
possibilities
b. if episode [t
t
, t
k+1] is carriage of one of the serotypes, propose a non-carriage episode
6. Merge similar types
a. if
= t
k
, and the "sero"type of the proposed episode and the previous episode are the same, merge the episodes, i.e.,
= t
k-1
b. if
= t
k+1, and the "sero"type of the proposed episode and the following episode are the same, merge the episodes, i.e.,
= t
k+2
7. In order to calculate the probabilities of the back-proposal we define
and
a. if S
v
=
,
= 0, else
= 0.5
b. if S
v+1 =
,
= 0, else
= 0.5
8. Accept the proposed episode with probability
Where H
i
and
are the present and the proposed path (history) for individual i, M
i
is the observed data for individual i, P is the prior of the complete data (the likelihood function of the model parameters), P
c
is the likelihood function of the complete data (is one if the complete data is consistent with the observed data and is zero otherwise), and Q(u|v) is the probability of proposing path u given path v. The exact form of Q can be derived from steps 1 to 7. For example, if we propose to add a carriage episode [
,
] within a non carriage episode [t
k
, t
k+1], where t
k
<
<
<t
k+1, the proposal probability is
and the back proposal probability is