Mining social mixing patterns for infectious disease models based on a two-day population survey in Belgium

Hens, Niel; Goeyvaerts, Nele; Aerts, Marc; Shkedy, Ziv; Van Damme, Pierre; Beutels, Philippe

doi:10.1186/1471-2334-9-5

Research article
Open access
Published: 20 January 2009

Mining social mixing patterns for infectious disease models based on a two-day population survey in Belgium

Niel Hens¹,
Nele Goeyvaerts¹,
Marc Aerts¹,
Ziv Shkedy¹,
Pierre Van Damme² &
…
Philippe Beutels²

BMC Infectious Diseases volume 9, Article number: 5 (2009) Cite this article

8834 Accesses
77 Citations
14 Altmetric
Metrics details

Abstract

Background

Until recently, mathematical models of person to person infectious diseases transmission had to make assumptions on transmissions enabled by personal contacts by estimating the so-called WAIFW-matrix. In order to better inform such estimates, a population based contact survey has been carried out in Belgium over the period March-May 2006. In contrast to other European surveys conducted simultaneously, each respondent recorded contacts over two days. Special attention was given to holiday periods, and respondents with large numbers of professional contacts.

Methods

Participants kept a paper diary with information on their contacts over two different days. A contact was defined as a two-way conversation of at least three words in each others proximity. The contact information included the age of the contact, gender, location, duration, frequency, and whether or not touching was involved.

For data analysis, we used association rules and classification trees. Weighted generalized estimating equations were used to analyze contact frequency while accounting for the correlation between contacts reported on the two different days.

A contact surface, expressing the average number of contacts between persons of different ages was obtained by a bivariate smoothing approach and the relation to the so-called next-generation matrix was established.

Results

People mostly mixed with people of similar age, or with their offspring, their parents and their grandparents. By imputing professional contacts, the average number of daily contacts increased from 11.84 to 15.70. The number of reported contacts depended heavily on the household size, class size for children and number of professional contacts for adults. Adults living with children had on average 2 daily contacts more than adults living without children. In the holiday period, the daily contact frequency for children and adolescents decreased with about 19% while a similar observation is made for adults in the weekend. These findings can be used to estimate the impact of school closure.

Conclusion

We conducted a diary based contact survey in Belgium to gain insights in social interactions relevant to the spread of infectious diseases. The resulting contact patterns are useful to improve estimating crucial parameters for infectious disease transmission models.

Peer Review reports

Background

Mathematical models of person to person transmission of infectious agents require assumptions of social contact patterns between these persons. Such assumptions, in combination with the age-specific cumulative incidence of infection (for instance based on seroprevalence data [1–4]), enable estimating the so-called "Who Acquires Infection From Whom" (WAIFW) matrix [1]. The WAIFW matrix is central to dynamic transmission models, as are closely related parameters like the basic reproduction number and the force of infection. Over the last decade, some small and unrepresentative studies have explored ways of collecting data on social mixing behaviour relevant to the spread of close contact infections [5–9]. Alternatively, mixing patterns between age groups were estimated in a social network context [10]. Recently, the European commission project POLYMOD published the methods and broadly comparative results of large and representative population based surveys on social contacts recorded on a randomly assigned day in 8 countries (Belgium (BE), England & Wales (EW), Finland (FI), Germany (DE), Italy (IT), Luxemburg (LU), Poland (PL) and The Netherlands (NL)) [11].

The POLYMOD contact survey from Belgium differed from the surveys conducted in the other countries in several aspects. First, each participant registered contacts during 2 random days (one week day and one weekend day), enabling more refined coupled analyses on weekend versus week days. Second, participants could be linked to their geographical location enabling the comparison of contact behavior between regions. Third, a school holiday period was included in the sampling period, enabling more refined analyses on weekend versus holiday period days, and different simulations of the effects of school closure. Fourth, people with more than 20 professional contacts were requested to estimate these separately in terms of number and age to reduce reporting bias for persons with many professional contacts.

In the current paper, we aim to describe and analyze the complete Belgian contact survey in depth. The next section outlines the data collection procedure. We then explain the different methods of data analysis applied in this paper, starting from standard data mining techniques to simulations of the spread of newly emerging infections and the impact of school closure.

Finally, we end with a discussion on how these contact data can be used to improve estimation of infectious disease parameters.

Methods

Data collection

Over the period March – May 2006, 750 persons living in Belgium were recruited by random digit dialing on fixed telephone lines. Only one person per household could participate. In line with the total Belgian population (Chi-square test for equality of distributions, p-value 0.85), participants were recruited from the Flemish (n = 441), Walloon (n = 239) and Brussels geographic regions (n = 70), making sure also that the areas in these regions were representative of the overall urban and rural divide in Belgium. With regards to age, sampling was undertaken in order to obtain an age distribution with 10% in each of the age groups: 0–4, 5–9, 10–14, 15–19 years, and 6% in each of the adult age groups 20–24, 25–29, 30–34, 35–39, 40–44, 45–49, 50–54, 55–59, 60–64, 65+ years. Most (480) participants were adults aged 18 years and older of whom 50% were male, representative for the Belgian population (p-value 0.70). There were also 130 teenage participants aged 9–17 years (49% males, p-value 0.60) and 140 children aged 0–8 years (56% males, p-value 0.053).

The contact survey conducted as part of this study required participants to complete a survey anonymously, without changing their usual behaviour. No persons were subject to interventions and no physical samples were collected as part of this study. The study protocol was approved by the ethical committee of the Antwerp University Hospital.

Two types of physical contacts were defined: (1) two-way conversations during which at least three words were spoken; (2) contacts which involved skin to skin touching. Each participant was asked to fill in a paper diary recording their contacts during one randomly assigned weekday and one randomly assigned day on the weekend (not always in that order). Teenagers (9–17 y) filled in a simplified version of the diary, and were closely followed up to anticipate interpretation problems (all surveyors had received identical training instructions to accomplish this). For the children (< 9 y) a parent or exceptionally another adult caregiver filled in the diary. In addition to a pilot test in Luxemburg, specific Belgian Dutch and French language versions of each type of diary were made and tested in Belgium (see also [6, 11]).

Information recorded in the diary included the exact or estimated age and gender of each contacted person, the duration of contacts per person over the entire day, frequency (habitual nature) and location or circumstance of contact (multiple options possible). A list of variables used for our analysis is given in Table 1.

Table 1 Classification Tree Variables.

Full size table

The diaries were sent and collected by mail. Each participant was reminded by phone that they had to fill in the diary, one day prior to each assigned day, and was followed up after the first day to check whether they had. If they had not filled in the diary, they were assigned to a new random day. If they had not returned the diary, they were reminded by a maximum of three follow up calls to send it in. If participants repeatedly failed to fill in the diary on their assigned day, they were excluded, and replaced by a new recruit. Each participant received a small token of appreciation for the amount of 5 Euro. All diaries were double entered in a computer database and checked manually.

Virtually all (98%) participants recorded their contacts in the four weeks from March 18 to April 14 2006, and 49% and 73% of the recorded week days and weekend days, respectively, were during the Eastern holiday period. Note that weekends at the start and the end of the holiday period were considered as holiday and represented 72% of all recorded weekend days.

About half the children aged 0–2 attended childcare (55%), and almost all children aged 3–8 years attended childcare or school (93%), while school participation was 100% for participants aged 8–17 years. Adults aged 18 years or older were mostly employed (49%), unemployed (35%) or in further education (9%). Overall, 10% of the participants lived alone and these were all 18 years or older. Nearly 1 in 5 children aged 0–8 years lived in a single parent family, and 26%, 28% and 11% of the participants lived in a household of size 2, 3 and 4, respectively. Larger household sizes were only rarely observed (4%). These characteristics of the sample are all broadly in line with general Belgian population statistics (National Institute for Statistics, 2006, Belgium).

Household contacts

Although in order to keep the diary manageable, we intentionally did not record the relationship with the contacted persons, we extracted household-like contact data from the database by identifying those contacts with the same age as the registered ages of the household members (knowing that the contacts occurred at home and that an exact age was given, and not an estimated age range as for the ages of contacted persons from whom the exact age is unknown). We performed a sensitivity analysis with respect to the selected contacts of the same age.

Work contacts

As mentioned in the introduction, only a subsample of the Belgian contact survey (BE), was described and analyzed by [11], together with those from the seven other POLYMOD countries. These contact surveys, although broadly aiming to record similar outcomes, were not quite the same due to dissimilar implementations in each of the countries (see [11]). One important difference between these surveys was the registration of work contacts.

In half of the POLYMOD surveys (namely BE, DE, FI and NL), a threshold value was set for reporting contacts at work, requiring people with more estimated work contacts than a predefined threshold not to record these contacts in their diary. In the Belgian diaries, the participants were asked prior to filling in their dairy for the first time the following sequence of questions. (1) Do you have a profession through which you have a large number of contacts (e.g., clients, patients, students, etc)? YES/NO; (2) If YES, please make an estimate of the average number of persons you contact professionally per day? .... persons; (3) Please tick in which of these age categories these professional contacts mostly occur (multiple options possible): 0–5 y, 6–11 y, 12–17 y, 18–60 y and over 60 years. (4) If you estimated the number of these contacts at more than 20, then do not record these contacts in the diary, but only record the other (non-professional) contacts. The complete Belgian surveys (including diaries) are available [see Additional file 1, 2, 3, 4, 5, 6 and 7].

Although these questions were asked with the common intention to reduce reporting bias for people with many professional contacts (eg bus drivers), there are some important differences in how we approached this, in comparison to DE, FI and NL. First, based on information from the pilot studies, we set the threshold value at 20, whereas in DE, FI and NL it was set at 10. Second, we asked first if the participants thought they had many professional contacts, and only if they did subjectively think so, how many they estimated these to be. Third, we only revealed in the last question what the consequence of their estimate was for the effort required to complete the diary. Fourth, we did not only ask about the number of professional contacts, but also about their usual age range.

In reporting the results of all the POLYMOD diaries combined, [11] excluded these extra non-recorded contacts, thus presenting, in this respect, underestimates for the number of contacts in BE, DE, FI and NL, in comparison to the other 4 countries. In the current paper, we used imputation to complete the full Belgian data set. However, this too had its limitations, since imputation enabled generating data from which reliable inferences can be made, but could not recreate the values that were not recorded [12].

More specifically, the age of contacted persons provided a basis for imputation [see Additional file 7]: Consider $n_{i}^{w}$ > 20 as the number of work contacts to be imputed for participant i and (s)he indicated that these professional contacts were in specific age-categories denoted by a set $I_{i}^{a}$ (e.g. $I_{i}^{a}$ = [6, 11] for a primary school teacher). We sampled $n_{i}^{w}$ age-values from $I_{i}^{a}$ with probabilities according to the population age-distribution. Contrasting this method with data for EW, IT, LU and PL indicated good performance.

In order to impute the other contact characteristics, plausible assumptions were made based on the available information from other POLYMOD countries. Mossong et al. [11] did not find a significant association between age and whether contacts involved skin-to-skin touching for EW, IT, LU and PL. Therefore we imputed, this variable in the Belgian dataset using the same distribution as when 10 < $n_{i}^{w}$ ≤ 20. The imputation of gender of contacts was performed based on the same reasoning. For the imputation of contact characteristics like duration and usual frequency, we considered it unlikely that a single professional contact in a large set of such contacts would last longer than 4 hours and reoccur daily. Therefore, we imputed these two variables jointly by sampling from the bivariate distribution of duration and frequency for work contacts of participants for whom 10 < $n_{i}^{w}$ ≤ 20. This method could also be more widely applied to all characteristics in an attempt to avoid disrupting dependencies, but could just the same also enforce dependencies. Whereas the imputation for age and gender was well founded (negligible change in distribution with $n_{i}^{w}$ ), other imputations seemed more speculative. Indeed, the higher the recorded $n_{i}^{w}$ was, the shorter the contact durations were, but the distribution of contact frequency remained quite stable. Furthermore, the choice of distribution for the duration of contacts would be subjective since (a) it was clearly unknown for what was missing and (b) the distributions varied substantially between countries. For the remainder of the paper, we focus on this augmented data set but note that the results were similar whenever these imputations were left out, except when estimating the number of contacts. This will be illustrated for one of the methods introduced in the next section.

Statistical methodology

Since the contact survey contains a lot of information, we will, next to some descriptive statistics, use up-to-date statistical methodology to highlight different aspects from the data. We will first describe the use of association rules and classification trees to identify associations and factors related to type, location, frequency and duration of contact. Then we will use weighted generalized estimating equations to properly model the number of contacts taking into account the correlation of contacts recorded by the same participant on two different days. Finally, we relate contact patterns to the spread of infectious disease using the next generation matrix. While this methodology provides more insight into how close contact infectious diseases are spread, we note that we do not aim to describe the stochastic nature of an emerging epidemic.

In this section we present some more detail on the methods used throughout this manuscript and refer to the literature for an extensive description.

Association rules

In large databases, like the contact database, patterns between variables can be discovered by data mining techniques such as association rules [13]. If A, B denote properties of contacts (e.g., frequency, location, touching), then association rules would focus on relations A → B that estimate how likely the event B is given the occurrence of event A. While A could consist of more than one contact property (e.g. a home contact which also involved skin-to-skin touching), B is restricted to be a single property. The length of a rule is the total number of properties constituting that rule. Since numerous combinations of contact properties exist, various interestingness measures could be studied. We focused on the support, the confidence and the lift of a rule. The support of a rule expresses the relative frequency of all contacts in that rule and searching for rules with a high support can be seen as a simplification of 'bump hunting' [14]. The confidence of a rule expresses the conditional probability associated with that rule P(B|A). Finding rules with high confidence is similar to finding rules with a high association between the items, as is finding rules with a high lift value. In the latter case, a higher lift value indicates that the support of the rule is higher than the product of the support of the items in that rule (expressing independence between items [14]).

Classification trees

To gain more insight in factors determining contact intensity (physical contact, location, frequency and duration), the binary classification tree methodology as introduced by [15] was used. This is a recursive partitioning method which is nonparametric in nature, simple and intuitively appealing. At each step, the recursive partitioning algorithm determined an optimal cut off point such that all the contacts were split in two subpopulations to achieve high predictive classification with respect to the variable of interest; touching (yes or no), location (home, school, work, school, transport, leisure or other), frequency (daily, weekly, monthly, a few times a year or first time) and duration (0–5 min, 5–15 min, 15 min – 1 hour, 1–4 hours, more than 4 hours). The resulting subpopulations were split repeatedly until no additional partitioning was warranted: either a subpopulation contained only one class or it was too small to divide any further. The program Rpart [16], implemented in R, was used to generate the decision trees depicting the classification rules generated through recursive partitioning. When growing a tree equal misclassification costs were assigned to the categories of the response variable. Pruning of the trees (to correct for overtraining) was undertaken using the 1 SE rule described by [15] in combination with a maximal tree-depth of 4 layers. Error rates associated with these trees were determined by simple resubstitution, and a 10-fold cross-validation (CV) was used to evaluate the performance of the tree. Table 1 lists the variables used in the tree construction.

Modelling the number of contacts

While [11] used a censored negative binomial distribution to relate the overall number of contacts to different participant characteristics, we used generalized estimating equations (GEE, [17]). This approach allowed us to account for the correlation between the number of contacts recorded by the same individual on the two different days. Weights were included in the analyses, in the first instance to adjust for relative differences in age and household size as compared to Belgian demographic data (source: EUROSTAT), and in the second instance to adjust for differences in sampling proportions with respect to weekdays and holiday periods.

Model building was done a priori, using a nonparametric method called 'random forests' [18]. Random forests were constructed using binary regression trees [15] which relate a response variable, the contact number, to explanatory variables in a similar way as with the classification trees. However, in regression trees, the response is continuous and the objective is to minimise the mean squared error (MSE). In view of the skewness of the distribution of the number of contacts, a log-transform was used. Random forests were constructed by joining several of these regression trees, each based on a random sample of the observations and the explanatory variables, to explicitly take into account the variability associated with the construction of a single tree. A by-product of this random forests methodology, the so-called 'variable importance' list, reflects how often a variable is used as a splitting-criterion. The variables with a high importance, more explicitly using a threshold of 5% increase in MSE, were then selected and used in further model building. We refer the reader to [18] for more details on how 'random forests' are constructed.

Using the most important variables from the 'random forests' importance list, one can opt to look at the interactions between them. Since this inevitably leads to sparse cells, we only analysed interactions with respect to weekdays and holidays and for the remainder only considered main effect terms. We then further reduced the model using a stepwise Poisson regression (with log-link) based on backward selection and the BIC-criterion [19] and finally applied the weighted GEE analysis.

This procedure is applied separately for the three different types of diaries and for all types of contacts, resulting in 57 analyses.

Who mixes with whom?

Consider the random variable Y _ij, i.e. the number of contacts in age class j during one day as reported by a respondent in age class i, which has observed values y _{ij, t}= 1,..., T _i, where T _idenotes the number of participants in the contact survey belonging to age class i. The elements m _ij= E(Y _ij) make up a J × J matrix, which is called the 'social contact matrix'. Now, the contact rates c _ijare related to the social contact matrix as follows:

c_{i j} = 365 \cdot \frac{m_{j i}}{w_{i}}

(1)

where w _idenotes the population size in age class i, obtained from demographic data (EUROSTAT). When estimating the social contact matrix, the reciprocal nature of contacts needs to be taken into account (see [5]): m _ij w _i= m _ji w _j, which means that the total number of contacts from age class i to age class j must equal the total number of contacts from age class j to age class i. We used a negative binomial distribution with mean m _ijand variance $m_{i j} + \frac{m_{i j}^{2}}{k}$ to model the number of contacts, while allowing overdispersion. The following negative binomial likelihood function was maximized in order to estimate the means m _ijand the dispersion parameters k, assuming the Y _ijwere independent (i=1,...,I,j = 1,..., J):

\prod_{i = 1}^{I} \prod_{j = 1}^{J} \prod_{t = 1}^{T_{i}} {[NegBin (y_{i j, t}; m_{i j}, k)]}^{w_{i t}^{d}},

(2)

where $w_{i t}^{d}$ is the diary weight of the t ^thparticipant in age class i. In order to achieve the necessary flexibility and to exploit the continuous nature of the data, we estimated the mean contact rate using one-year age-intervals and a bivariate smoothing approach as described by Wood [20]. The basis was a tensor-product spline ensuring flexibility when modeling the average number of contacts as a function of the respondent's and contact's age. By applying a smooth-then-constrain-approach as proposed by [21], the reciprocal nature of contacts was taken into account.

Mimicking the Spread of an Epidemic and the Impact of School Closure

In order to mimic the initial spread of a newly emerging pathogen in Belgium, we used the next generation operator as defined in (3).

G [i] (a) = \frac{N D}{L} \exp {- \int_{0}^{a} μ (t) d t} \int_{0}^{+ \infty} β (a, a^{'}) i (a^{'}) d a^{'},

(3)

For Belgium, population size (N = 10 547 958), life expectancy at birth (L = 80) and age-specific mortality rates, μ(t) were obtained from official government statistics (source: EUROSTAT). The infectious period was set to five days D = 5/365, similar to the infectious period typically estimated for influenza (see e.g. [22]). We assumed β(a, a'), i.e. the effective transmission rate was proportional to the contact rate c(a, a') as estimated by c _ijusing one-year age-intervals and contacts involving skin-to-skin touching only and chose the proportionality factor such that the largest Eigen value of the next generation operator (3), i.e. the basic reproduction number R ₀, equaled 2 (thus mimicking the emergence of pandemic influenza [23, 24]). Note that daily contacts were down weighted with a factor 1/5 to reflect the unlikely occurrence of transmission of infections during a second such contact. We first focused on a regular period and used the leading eigenvector to simulate the initial spread of a newly emerging pathogen [25]. We investigated the effect of school closure by using contact rates from the non-holiday period, the holiday period and in the weekend (excluding holidays), and comparing the resulting relative impact on R ₀.

Results

Figure 1 shows the histograms of the number of contacts on a log-scale per day in the week, distinguishing between holiday and non-holiday periods. The densities (especially on the original scale) were highly skewed, and more so for the holiday period. Median values were 7–9 for the holiday period and 9–12 for the non-holiday period. Figure 2 shows contact intensity distributions in- and outside households: close or non-close (left figure); duration (middle panel); frequency (right panel), clearly indicating that most intense contacts (darker shading) take place with household members. As a final descriptive statistic, Figure 3 shows boxplots of the number of (close) contacts in and outside households (left upper panel); the number of contacts per location (right upper panel); the number of contacts for the different frequencies (left lower panel) and duration (right lower panel). Whereas most contacts inside a household involve skin-to-skin touching, this is not true outside the household. There is substantial heterogeneity in the number of contacts recorded at work/school and during leisure activities with a high number of contacts mostly observed at work/school. Although there are no apparent differences, the lower left panel in Figure 3 indicates that there is a larger number of contacts with longer duration. Note that most contacts are frequent contacts whereas fewer contacts are yearly or first time contacts.