An infectious disease model on empirical networks of human contact: bridging the gap between dynamic network data and contact matrices

Background The integration of empirical data in computational frameworks designed to model the spread of infectious diseases poses a number of challenges that are becoming more pressing with the increasing availability of high-resolution information on human mobility and contacts. This deluge of data has the potential to revolutionize the computational efforts aimed at simulating scenarios, designing containment strategies, and evaluating outcomes. However, the integration of highly detailed data sources yields models that are less transparent and general in their applicability. Hence, given a specific disease model, it is crucial to assess which representations of the raw data work best to inform the model, striking a balance between simplicity and detail. Methods We consider high-resolution data on the face-to-face interactions of individuals in a pediatric hospital ward, obtained by using wearable proximity sensors. We simulate the spread of a disease in this community by using an SEIR model on top of different mathematical representations of the empirical contact patterns. At the most detailed level, we take into account all contacts between individuals and their exact timing and order. Then, we build a hierarchy of coarse-grained representations of the contact patterns that preserve only partially the temporal and structural information available in the data. We compare the dynamics of the SEIR model across these representations. Results We show that a contact matrix that only contains average contact durations between role classes fails to reproduce the size of the epidemic obtained using the high-resolution contact data and also fails to identify the most at-risk classes. We introduce a contact matrix of probability distributions that takes into account the heterogeneity of contact durations between (and within) classes of individuals, and we show that, in the case study presented, this representation yields a good approximation of the epidemic spreading properties obtained by using the high-resolution data. Conclusions Our results mark a first step towards the definition of synopses of high-resolution dynamic contact networks, providing a compact representation of contact patterns that can correctly inform computational models designed to discover risk groups and evaluate containment policies. We show in a typical case of a structured population that this novel kind of representation can preserve in simulation quantitative features of the epidemics that are crucial for their study and management.

displays some characteristics of the aggregated network of contacts between individuals as gathered by the SocioPatterns infrastructure. The degree of an individual in the aggregated network gives the number of distinct individuals with whom she has been in contact with at least once. The strength of a node is the sum of the durations of all the contacts that node had with other individuals. The detailed properties of the network are reported in Ref. [1].

Fitting the weight distributions for each role pair
As discussed in the main text, for each pair (X, Y ) of role classes we consider the empirical cumulated duration of the contacts between all pairs of individuals x of class X and y of class Y . We used the fitdistr function of the R package MASS (http://www.stats.ox.ac.uk/pub/MASS4), which implements maximum-likelihood fitting of univariate distributions, to fit each empirical distribution with a negative binomial with parameters m and r, where m is the average of the fitted distribution and its variance is s = m + m * m/r. The fitting parameters are given in the main text. Figure 2 shows the comparison between the empirical and the fitted distributions.

An intermediate data representation
In the main text we described several data representations, which range from a very detailed record of the temporally-resolved contacts (DYN) to a much coarser representation in terms of a contact matrix (CM) that contains the average time spent in contact by members of given classes. The CMD (Contact Matrix of Distributions) representation takes into account the heterogeneity of contact durations between pairs of individuals who belong to a given class pair, as it is constructed by fitting the entire distribution of these durations.
Here we consider an intermediate representation between the CM and CMD ones. For each pair of roles (X, Y ) we describe the distribution of weights using a bimodal distribution (instead of the negative binomial used for CMD), where values are either 0 (corresponding to an absence of link) or the average of all actual weights between individuals of roles X and Y . More in detail, let N X and N Y be the numbers of individuals in classes X and Y respectively, E XY the number of links empirically observed between individuals of classes X and Y , and W XY the the total time spent in contact by any individual of class X with any individual of class Y . Then, for each pair of individuals (x, y) with x in class X and y in class Y , x is in contact with y with probability E XY /(N X N Y ), and the weight of the corresponding edge is W XY /E XY . With probability 1 − E XY /(N X N Y ), x and y are not in contact and the associated weight is taken as zero. Notice that for X = Y , N X N Y is replaced by N X (N X −1)/2, the correct number of within-class links. The above representation, which we call CMB (Contact Matrix of Bimodal distributions), takes into account the fact that not all pairs of individuals have been in contact, in contrast with the customary CM representation.

Properties of the different data representations
Tables 1 and 2 report some properties of the weights and topology of the observed contact patterns according to the various data representations we consider. The CMB, CMD and HOM representations all capture the main topological properties of the HET network, as well as the average link weight, while the CMD representation is the only one that accounts for the large dispersion in the cumulated durations of the contacts.

Rescaling the rate of infection to compare the spread over different data representations
As the probability of disease transmission between an infectious and a susceptible individual depends on the time spent in contact, in order to meaningfully compare the evolution of spreading processes over dynamical and static networks, it is necessary to rescale adequately the rate of infection β (see also the discussion in Ref. [2]). Let us consider the contact between an infectious individual A and a susceptible individual B: the probability that the infection is transmitted from A to B during a time interval dt is βdt. On representing contact patterns by means of a static weighted network (e.g., the HOM/HET cases) in which the weight W AB of the link between A and B represents the total duration of the contacts between those nodes, the infection probability over the interval dt needs to be set as βW AB dt/∆T , where ∆T is the total temporal span of the data set, so that we obtain the same average infection probability in all networks.
2 Simulation of epidemic spread

Additional measures
In the main text we focused on the probability of extinction and on the attack rate, as these quantities are the most important to quantify the impact of a disease and to assess the effectiveness of interventions. Other properties can also be measured and can help to assess the differences among the simulated epidemic dynamics based on the different data representations. In particular, here we consider the peak time of the epidemic, the duration of the epidemic, and the respective distributions of these quantities over an ensemble of stochastic realizations of the spreading dynamics. We also consider the issue of the reproductive number R 0 , defined as the expected number of secondary infections from an initial infected individual in a completely susceptible host population [3]. Several methods can be used to compute this quantity [4,5] (40, 41), possibly yielding different estimates [6] for the same epidemiological parameters.
Similarly to other works [7,8], we compute here the number of secondary cases from each single initial randomly chosen infectious individual, and obtain the distribution of these numbers S over an ensemble of stochastic realizations of the dynamics. The basic reproductive number is then approximated by the average S of the number of secondary cases over this distribution.

Simulated spread on the fully connected network
and on the CMB data representation  Figure 4 reports the distributions of the number of secondary cases from the initial seed, averaged over the initial seed of the epidemic and over the    Table 3 gives the value of the average number of secondary cases S for each case. We notice that while Ref. [2] reported similar average values for the DYN and HET cases for the case of a spread in an unstructured population, here we observe a smaller value for the DYN network. For the first set of 7 parameters of the SEIR model, corresponding to slower disease propagation, the difference between the values for DYN and HET (not shown) is smaller. It is important to notice that for both parameter sets, the HET and CMD representations yield very close values of S , while the HOM and CM cases both yield higher values. This is remarkable because the contact matrix of distributions (CMD) is a compact representation that is not individual-based and does not preserve the topology of the empirical contact network, which is instead preserved by the finer-grained HET contact network.

Reproductive number and epidemic peak and ending times
Finally, Figure 5 shows the distributions of the peak and end times of the epidemic. Although the distribution is slightly more peaked at earlier times for the HOM case, no strong differences are observed between the different cases. Each representation is thus able to yield a good estimation of the epidemic timing. 3 Results of the spreading simulations for the second set of parameters of the SEIR model In order to assess the robustness of our results we have considered, as mentioned in the main text, two different sets of parameters for the SEIR model. Here we report the results of spreading simulations with β = 6.9 × 10 −4 s −1 , 1/σ = 1 day, and 1/ν = 2days on the different representations of contact patterns. This set of parameters corresponds to a slower propagation with respect to the one discussed in the main text. Figure 6, similarly to Figure 2A of the main text, reports the global distribution of the fraction of the final number of cases, averaged over all possible seeds, for the various contact patterns representations. The results are very similar to the ones obtained with the other set of parameters, except that the results of the DYN, HET and CMD are even closer. All the other representations lead to a clear underestimation of the probability of having a small final attack rate and, in the case of a large attack rate, to a strong overestimation of the number of final cases (see Table 4). Table 5 breaks down these results according to the role class of the initial seed, and Figure 7 shows the distributions of the final attack rates within each class. Once again, at this very detailed level the similarity between the results for DYN, HET and CMD is clear, while the CM and HOM representations do not capture correctly the relative risks of the various classes.
Finally, Table 4 shows that the peak time and end time of the epidemic are reasonably well approximated even by the coarser representations.    Table 5: Extinction probability (EP) and fraction of runs leading to an attack rate of at most 10% (AR10), for the various contact pattern representations and as a function of the role class of the seed. Parameter values are β = 6.9 × 10 −4 s −1 , 1/σ = 1day, and 1/ν = 2days.

Daily aggregated networks and contact matrices
As mentioned in the main text, based on the dynamical network that describes the contact patterns in the ward of an hospital over a 1-week interval, we have build static networks by aggregating all the interactions that takes place during the course of the week. Therefore, all of the data representations we considered assume that all invididuals are present at all times, and that the contact patterns are the same every day. In Ref. [2], on the other hand, which focuses on contact patterns during a 2-day conference, two different daily networks were constructed. In this case, the epidemic spreading simulations performed over the DYN and HET networks yielded almost indistinguishable results.
Therefore, here we also consider a set of data representations aggregated on the daily scale: for each day we aggregate the contacts recorded during that day and we construct daily HET, HOM, CM, CMD representations. These representations take into account the fact that not all individuals are present each day, and that the patterns of contacts may be different from day to day.
As shown in Figure 8, the SEIR spreading simulations performed on the various data representations yield results that are only slightly different from the ones obtained when the dynamical data is aggregated at the weekly scale. The results obtained for the HET representation become even closer to the ones of the DYN case, especially for the first set of SEIR parameters, in agreement with the results of Ref. [2]. In the case of the CMD representation, the results do not change significantly. Moreover, the results obtained with the HOM and CM representations are still very far from the ones based on the DYN, HET and CMD representations, despite the additional information at the daily scale. Thus, these results highlight the interesting properties of the CMD representation, which, despite using only very parsimonious information, yields results that are very close to the case of the full dynamical representation. 5 Epidemic spreading simulations for fixed S Figure 4 and Table 3 show that the reproductive number R 0 , approximated by the average S over different realizations of the number of secondary cases from one randomly chosen initial infectious individual, is strongly overestimated when using the HOM and CM representation of contact patterns. One could thus ask whether the discrepancies between the results obtained by spreading simulations on the HOM and CM contact patterns on the one hand, and on the HET network on the other hand, may be simply due to differences in S . To check this point, here we study the dynamics of spreading processes at fixed S .

Procedure
We simulate an SEIR spreading process for the HOM and CM representations, calibrating the spreading probability with a rescaling factor p, so that the spreading probability per unit time is pβ. We compute the resulting S as a function of p, as shown in Figure 9. We can thus calibrate the SEIR parameters by choosing rescaling factors p HOM and p CM that lead to the same S measured for the HET network. We call the corresponding cases HOM(c) and CM(c).

Results
Figures 10, 11, 12, and Tables 6 and 7 report the results of the spreading simulations performed on the HOM and CM contact patterns with calibrated spreading parameter, so that the resulting value of S is the same as in the case of the HET network. In this case, although the results are closer to each other when the calibration (rescaling factor p) is not carried out, the attack rate is still overestimated (in particular for the HOM case) and the relative attack rates within the various classes are still not correctly accounted for, especially for CM(c).  Table 6: Extinction probability (EP) and fraction of runs leading to a final AR of at most 10% (AR10) in the calibrated contact pattern representations and as a function of the class of the seed. Parameter values are β = 6.9 × 10 −4 s −1 , 1/σ = 1day, and 1/ν = 2 days.  Table 7: Summary of the average properties of the realizations that lead to a final AR higher than 10%, for β = 6.9 × 10 −4 s −1 , 1/σ = 1day, and 1/ν = 2 days.  Fig. 11 when the final attack rate is larger than 10%, for the HOM and CM representations with calibrated parameters (β = p × 6.9 × 10 −4 s −1 , 1/σ = 1day, 1/ν = 2 days) and for the HET and CMD cases (β = 6.9 × 10 −4 s −1 , 1/σ = 1day, 1/ν = 2 days).