Computational models and multi-scale numerical simulations represent essential tools in the understanding of the epidemic spread of infections, in particular to study scenarios, and to design, evaluate and compare containment strategies. The applicability of such computational studies crucially depends on informing transmission models with actual data. We are currently witnessing an important evolution as more and more data on human mobility and behavioral patterns become accessible [1–9]. For instance, data on human travel patterns and mobility have been fed into large-scale models of epidemic spread at regional or planetary scale, providing important modeling and prediction tools [10–14].

When considering smaller scales, the knowledge of contact patterns among individuals becomes relevant for identifying transmission routes, recognizing specific transmission mechanisms, and targeting groups of individuals at risk with appropriate prevention strategies or interventions such as prophylaxis or vaccination [15]. Several properties of the contact patterns are known to bear a strong influence on spreading patterns, such as the topological structure of the contact network, the presence of individuals with a particularly large number of contacts, the frequency and duration of contacts, and the existence of communities [16–24].

The simplest approach to the description of contact patterns is the homogeneous mixing assumption, which postulates that each individual has an equal probability of having contacts with any other individual in the population [25]. A widely used refinement of this approach consists in dividing individuals into classes (corresponding for instance to different age groups or to different social activities), and defining a contact matrix between classes in terms of the average number (or duration) of the contacts that individuals in one given class have with individuals in another given class. These matrices are constructed using data from questionnaires or diaries, time-use data [23, 26–31], and more recently by sensing room co-presence and close-range proximity between individuals [32, 33].

The use of models with structured populations can help to define more refined analytical approaches [34]. In addition, it allows to perform numerical simulations of epidemic spread in synthetic structured populations with fixed rates of contacts between individuals belonging to different classes, given by the (possibly empirical) contact matrix elements [10, 12, 35–37]. These simulations can also be used to compare vaccination strategies targeting specific groups and to estimate which strategies are most effective [35–41]. The use of contact matrices for modeling contact patterns relies on a set of restricted homogeneous mixing assumptions within each class and on the representativeness of the average mixing behavior between classes. However, such an approach neglects the strong fluctuations that are usually observed in the distributions of the numbers and durations of contacts between two individuals of given classes, which are often modeled with negative binomial distributions [23, 29, 31, 42, 43].

Overall, in the context of a specific modeling problem, little is known about the level of detail that should be incorporated in modeling contact patterns. Coarse representations such as the homogeneous mixing assumption leave out crucial elements, but are analytically tractable and can provide a coarse understanding of epidemic processes. More realistic approaches are however needed when the aim is to predict with quantitative accuracy the outcomes of specific scenarios, to target groups of individuals who are most at risk, and to quantitatively evaluate and rank interventions and containment strategies according to their effectiveness. In these cases it is crucial to achieve realistic simulations to estimate quantities such as the extinction probability or the attack rate. However, very detailed representations may critically lack transparency about the specific role of the many modeling assumptions they incorporate, and might yield unnecessarily fine-grained predictions. For instance, in order to evaluate the relative efficacy of targeted vaccination of different groups of persons, we do not need to know the risk of infection of each individual — it is sufficient to estimate the average attack rate in each group.

As technological advances keep enhancing our ability to gather high-resolution data on proximity and face-to-face interactions [4–9] and to integrate such data in computational models, the issue of understanding what is the most adequate data representation becomes therefore crucial, and the answer may depend on the specific epidemic process under study as well as on the specific goals of the modeling approach. This points to general problems for data-rich scenarios such as the ones involving wearable sensors in real-world settings: What is the right amount of information about individual interactions that is appropriate for a given modeling task, so that the relevant information is retained but the model stays as parsimonious as possible [44]? What are the most useful synopses of high-resolution contact network data?

Several studies have started to tackle such issues within the framework of the description of human interactions as static or dynamic contact networks, in the case of unstructured populations [22, 24, 45–48]. It is known that the heterogeneity of contact durations is a crucial element that needs to be taken into account. On the other hand, in a particular case [48] it was shown that the dynamics of an SEIR (Susceptible, Exposed, Infectious, Recovered) process over an aggregated network that only takes into account daily contact durations achieves a good approximation of the transmission dynamics over the full time-varying contact network with a temporal resolution of the order of minutes.

To date, the more complex case of a structured population has not been investigated under this perspective of data summarization. Here we address this problem motivated by the needs of applications and interventions aimed at the containment of diseases, such as identifying groups of individuals that need to be prioritized in deploying containment strategies such as vaccination and prophylaxis.

The aim of the present paper is therefore to understand whether and how different representations of the contact patterns within a structured population might lead, in simulation, to different estimates for the outcomes of an epidemic process and for the attack rates within groups of individuals: How faithful to the high-resolution empirical data are the summarized representations of contact patterns, in terms of the spreading process they yield and of the risk groups they identify? Given a specific epidemic process to be modeled, what is the optimal tradeoff between the compactness of the data representation and the amount of detail retained in the model? The answer to these questions lies in designing compact yet informative representations of contact patterns that can be used to model epidemics in structured communities (e.g., the case of nosocomial infections in hospitals). In such contexts, the design and evaluation of prevention measures or containment policies is naturally based on targeting specific classes of individuals. For example, it is possible to prioritize the vaccination of a certain class of individuals (e.g., nurses in a hospital [35]), whereas no simple policy can be based on a list of specific individuals. Therefore, it is important to build data representations that yield accurate results at the level of groups of individuals and that, at the same time, can be generalized in order to support generic conclusions on the relative efficacy of different strategies, and in particular of strategies targeting only high risk groups.

Here we put forward such a representation and we validate its appropriateness and usefulness in a case study of structured population: to this aim, we leverage a high-resolution dataset on the face-to-face proximity of individuals in a hospital ward, collected by the SocioPatterns collaboration by using wireless wearable sensors. The population is structured in different classes that correspond to different roles in the hospital ward. The collected data, described in detail in Ref. [33], include the detailed time-ordered sequence of individuals’ contacts between all subject pairs.

We consider various representations of the hospital ward data, corresponding to different types and degrees of aggregation. At the most detailed level, the raw data can be viewed as a dynamic contact network, where all the available information on the interactions between pairs of individuals (contact times and durations) is explicit. We then construct static networks that project away the temporal structure of the contacts but retain the identity of each individual and the structure of her contact network. At a coarser level, we consider the customary contact matrix representation of the data, which aggregates along the class dimension: All individuals belonging to the same role class are grouped together, and the contact rate between two individuals of given classes is given by the corresponding element of the contact matrix. This representation, by definition, discards many heterogeneities of contact behavior at the individual level as well as the heterogeneities among contacts. Between the contact matrix view and the individual-based network view, other intermediate representations are possible. Here we introduce a novel representation based on a contact matrix of distributions, designed to retain the heterogeneous properties of contact durations among pairs of individuals belonging to given role classes.

We then use an SEIR process to model the spread of an infectious disease in a structured hospital population whose contact patterns are described by using all of the above representations, computed from the raw proximity data. We simulate the distribution of attack rates for the various role classes and compare these outcomes against individual-based simulations that take into account the most detailed description of the empirical contacts. Our results highlight similarities and differences in the dynamics obtained for various aggregation levels of the contact pattern representation. In particular, they show that coarse representations that do not take into account the heterogeneity of contact durations, such as the usual contact matrix, yield rather bad estimates for the global attack rate. Such representations also lead to a wrong classification of role classes in terms of their relative risk and could therefore lead to incorrect prioritization for interventions such as targeted vaccination or prophylaxis. The novel representation that we put forward is shown to strike a promising balance between simplicity and generalizability on the one hand, and on the other hand the ability to yield accurate evaluations of risk and a correct prioritization of groups of individuals in the design of targeted measures.