User and API interfaces
Django allows the development of a simple front-end interface (see examples in Fig. 2). This interface allows a user to search the database, see connections between diseases and related surveillance systems, find information about the disease, and see where the information was obtained from. In addition to the front-end interface, we implemented a REST API using Django’s REST API framework [28]. This allows users to query the database and export to JSON and XML. Further, we designed an export of the database to RDF/XML compatible with OWL, the format currently utilized by ontologists. Our own biosurveillance tools3 take advantage of the database and the API. Others, may choose to take advantage of other formats (e.g., RDF/XML), as needed. Of note, references are not currently included in exports, or as part of the API.
Utility for other applications
Using the above methods we have characterized 280 diseases encompassing 69 animal diseases, 70 human diseases, 55 plant diseases, and 63 diseases that affect both human and animal (i.e., zoonotic). Figure 2 shows the web-application interface for three such diseases as an example. Both the name and possible alternate names are shown, in addition to the hierarchical disease parent, and all relevant organisms. Organisms are classified from the most specific information collected (e.g., Bacillus anthracis) and shows all organism parents (e.g., Bacillus). Names are classified both as common names (e.g., human) or as scientific names using parentheses (Homo sapiens sapiens). This particular example illustrates a disease with varying levels of organism knowledge. For example, the causal agent is known to the species level, but an exhaustive list of possible populations that could be infected by anthrax was not available in literature. Thus we have specified humans, as well as “herbivorous mammals”.
Using this database, we have associated specific diseases, or types of diseases, with relevant biosurveillance resources and disease models in the Biosurveillance Resource Directory [3]4. The anthrax example has 29 associated biosurveillance resources including various ministries of health, and several animal health networks. This allows a user to precisely identify which diseases are related to particular biosurveillance systems and vice versa.
Limitations
Describing diseases in a useful, extensible, but detailed manner is difficult. We recognize several specific limitations in the current design of our database.
First, it is important to note that there are numerous ways to classify disease relationships, and that the appropriate classification of relationships between diseases is difficult and can depend upon context and application. Different types of influenza, for example, can be classified based on their surface glycoproteins (typically includes Influenza A), or based on their lineage and strain (typically includes Influenza B) [22, 29]. Other viruses are classified based on morphology [30], the location where the first recognized outbreak occurred (e.g., ebola) [31], or other metrics entirely.
Within the field of biosurveillance, this difficulty manifests itself in specific ways. Most surveillance systems are broad enough that they do not discriminate based on subcategories of illnesses (i.e., a surveillance system is likely to include all ebola viruses, not restrict to particular strains). However, those same surveillance systems often want to track the subcategories of common illnesses to discover and study important epidemiological trends. Thus, a correct hierarchy is important in this database.
Currently, most of the diseases included have straightforward parent-child relationships. Most diseases are included in a syndromic category, but have few if any relationships with other diseases. Influenza is the current exception, where there are some subcategories, including “avian Influenza A” and “Swine Influenza”. The next iteration of the database should be expanded to include more specific relationships (e.g., influenza A H5N1 as a child of “avian influenza A”). We plan to follow standard practice for hierarchies, based off practices accepted in literature (e.g., influenza B will be described by lineages, and influenza A by glycoproteins). It is highly likely that situations will arise where a child might belong to multiple subcategories. Fortunately, the current database architecture makes relationships like that quite simple. Hierarchies can also be refined as epidemiological practices change.
Second, requirements for this database were identified through our team’s specific needs with respect to other biosurveillance tools. We believe this framework and the resulting database are useful, more broadly. However, it is possible that our list of requirements was not exhaustive. As additional work is done in this field requirements will likely be modified and added. The framework built supports such extension. Interview-based studies with surveillance system users, public health analysts, and epidemiologists would be of tremendous use in this capacity.
Third, diseases are currently not associated with particular geographic locations. Geospatial analyses are hugely important to disease surveillance, especially as diseases emerge, re-emerge, develop various types of antibiotic resistance etc. However, associating disease with specific locations can also be difficult, because it inherently requires some temporal association. For example, a geographic field could describe if (1) the disease had ever been present, (2) the disease had been present within the past N years, (3) the disease is currently present, or if (4) this disease was projected to be present soon (within N years). All of these might provide useful information, but designing the related database components requires careful thought.
Last, the current process for developing this database relies substantially on manual curation by a team of biologists and public health experts. That has allowed us to put a level of detail into the database that we believe is beneficial. However, we also recognize the substantial number of hours required to maintain the database.