epiPATH: an information system for the storage and management of molecular epidemiology data from infectious pathogens
© Amadoz and González-Candelas; licensee BioMed Central Ltd. 2007
Received: 11 November 2006
Accepted: 20 April 2007
Published: 20 April 2007
Most research scientists working in the fields of molecular epidemiology, population and evolutionary genetics are confronted with the management of large volumes of data. Moreover, the data used in studies of infectious diseases are complex and usually derive from different institutions such as hospitals or laboratories. Since no public database scheme incorporating clinical and epidemiological information about patients and molecular information about pathogens is currently available, we have developed an information system, composed by a main database and a web-based interface, which integrates both types of data and satisfies requirements of good organization, simple accessibility, data security and multi-user support.
From the moment a patient arrives to a hospital or health centre until the processing and analysis of molecular sequences obtained from infectious pathogens in the laboratory, lots of information is collected from different sources. We have divided the most relevant data into 12 conceptual modules around which we have organized the database schema. Our schema is very complete and it covers many aspects of sample sources, samples, laboratory processes, molecular sequences, phylogenetics results, clinical tests and results, clinical information, treatments, pathogens, transmissions, outbreaks and bibliographic information. Communication between end-users and the selected Relational Database Management System (RDMS) is carried out by default through a command-line window or through a user-friendly, web-based interface which provides access and management tools for the data.
epiPATH is an information system for managing clinical and molecular information from infectious diseases. It facilitates daily work related to infectious pathogens and sequences obtained from them. This software is intended for local installation in order to safeguard private data and provides advanced SQL-users the flexibility to adapt it to their needs.
The database schema, tool scripts and web-based interface are free software but data stored in our database server are not publicly available. epiPATH is distributed under the terms of GNU General Public License. More details about epiPATH can be found at http://genevo.uv.es/epipath.
Infectious disease is defined as any illness caused by a specific pathogenic microorganism or its toxic product that results from the transmission of that agent or its product from an infected to a susceptible host, either directly or indirectly through an intermediate . There are many different infectious pathogens that cause this kind of diseases and also some of them can be present as multiple infections in the same individual, such as patients coinfected with Hepatitis C Virus (HCV) and Human Immunodeficiency Virus (HIV).
Molecular genetic information is being increasingly used in the epidemiological study of infectious diseases to understand transmission, virulence or resistance patterns of microorganisms , pathogen evolution and host-pathogen interactions [3, 4], pathogen resistance to treatments, susceptibility to disease, etc. In the study of pathogenic processes different kinds of demographical, epidemiological, clinical and molecular data are important to obtain a complete view and a more comprehensive understanding of the infectious disease. In most cases, this kind of studies are developed by different institutions, sometimes geographically disperse, collaborating with each other by contributing of different types of information.
To study the interaction between a patient and a pathogen from a population, evolutionary and epidemiological perspective, a huge amount of molecular sequences is needed and current molecular techniques provide, with increasing affordability, this kind of data. An essential component to perform this kind of studies is having all these data properly organized and accessible [5, 6] to all implicated researchers. In the case that a study is under development and unfinished, users may not want to have their own information being publicly accessible. These points, in addition to a multi-user support capable system, are central aspects that must be taken into account when choosing a software to perform these tasks.
Until now, programs to perform molecular epidemiology studies of infectious diseases were, on one hand, devoted to the storage of information on patients and, on the other, public databases repositories  or internal flat files to store and retrieve sequences of pathogens. Currently, technologies to develop an information system are available as free software and researchers are able to build their own data management system. But this is a time-consuming, costly process which requires considerable expertise. Consequently and since there is no public database schema that incorporates clinical and epidemiological information about patients and molecular information about pathogens, we have developed a very complete and flexible information system, composed by a main database and a web-based interface, that integrates both types of data and satisfies requirements of good organization, simple accessibility, data security and multi-user support system.
An information system consists of a group of elements related according to some rules, which provides the necessary information to fulfil its purposes. The goals of such an information system are the collection, processing, storage, production and presentation of data. Our information system is composed by a main MySQL  database and a web-based interface.
Conceptual modules of the database schema.
Bibliographic references, authors, researchers and research groups
Diseases, symptoms, signs, risk factors, protector factors, clinical information, pathogen's tests, vaccines and vaccinations
Laboratory processes, extractions, regions, primers, amplifications, sequencing and typing
Alignments and phylogenetic trees
Patients, environments, hospitals/health centers and patients' information
Samples and storage
Test and Results
Tests1, laboratories, test results and results
Treatment names and dates
In designing the database we have taken into account all the existing infectious pathogens, either virus or bacteria, so the system can store information from different organisms at the same time. This is very useful mainly in cases of multiple infections, when studies involve more than one pathogen per patient and all relevant information (clinical, epidemiological and molecular) should be available. In these situations, it is easier for users and system administrators to deposit and control all the information in the same database schema instead of using different or independent database schemas. Moreover, the infectious pathogen general design contributes to the versatility and suitability of this software that can be used in many different ways as needed by different users. For example, epiPATH could be used for a unique pathogen exclusively or multiple pathogens simultaneously related studies: a user may start studying one single pathogen, such as HCV, and later he starts a new project with HCV/HIV coinfected patients or with a completely new pathogen. In this case there is no need to modify this software and, obviously, the learning curve is reduced to a minimum. Furthermore, although this database schema has been designed for multiple pathogens, it can be used as a database for individual pathogens on a per user basis by means of the Views utility of MySQL. A view is a virtual or logical table composed for the result of a query. Using views allows different accesses to the same database schema, either for different users, pathogens or any database characteristic of interest. Views can be created as needed by the system administrator and they also contribute to the security and integrity of the data stored preventing a user from editing or deleting other user's data.
Conceptual modules, in which information is divided, add versatility to this software so that users choose which kind of information is relevant for their studies by storing that part of the database schema. Moreover, advanced SQL-users could adapt it to their needs since epiPATH is distributed under GNU license.
The molecular epidemiology database consists of 50 tables and was built using the MySQL RDMS running on a Linux Fedora system. We used InnoDB tables because they allow transactions. Currently, it is implemented in MySQL Server version 5 but it also has been implemented in earlier versions of MySQL Server without any problem. Database design and scripts to create the schema have been developed and obtained with DBDesigner 4. Scripts were debugged and implemented into the RDMS with MySQL Query Browser . These tools have been selected because of their free availability and wide use, as well as their reliability and flexibility to run under different environments.
Web-based interface design
There are two ways to introduce data into the database: users can upload a file containing all data to be inserted or they may introduce data by filling a specific web form. The former method is indicated when the user wants to add many records into a database table and the later when the number of records to insert is reasonably low. There is a specific web form for nearly every conceptual module of the database schema. Searches can be performed by specific web forms, as insertion tools, by entering a complex query, that allows a wide range of options to construct an appropriate query for the user, or obtaining reports of a single table from the database. Reports make possible to retrieve information already stored in each database table without knowing exactly what to look for. Updating and deleting data can be done only on one record at a time due to security reasons. When using the web-based interface, all data changes such as insertions, deletions and updates are recorded in another database that can only be accessed by the system administrator. This feature improves the security of our information system and it is also distributed with this software.
When working in molecular epidemiology, the information needed to perform some studies can originate from different sources, hospitals/health centres and laboratories. As a consequence, some inconsistencies in the data might appear, mostly related to the limits and genotype of sequences. To overcome this problem, we have developed and installed some scripts in the web-based interface that warn users about the existence of incoherent data.
Web-based interface architecture
A first level of security implemented in this software is the identification of the user to log in. Both through the command-line shell or the web-based interface, the user must be identified with a username and a password, both corresponding to those implemented in the local MySQL server. A second level of security is provided by the possibility of using Views to allow different users to access and manage different parts of the database, either by pathogen, project, or any other selective criterion. We distribute this software both with and without this utility implemented in the web-based interface and the database schema. With the Views utility it is possible to create more levels of virtual tables and therefore to increase the security level as desired. A third level is a log out script implemented on the web-based interface that alerts users after 25 minutes of inactivity and automatically ends the current session 5 minutes later if inactivity continues. A fourth level is a secondary database schema exclusively accessed by the system administrator that stores information about whom, where and when an insertion, deletion or update through the web-based interface was performed. A final security level for the data is a backup of the entire database that must be performed by the local system administrator.
Results and discussion
There is a need for an information system to manage data obtained from different molecular epidemiology studies related to infectious pathogens. We aim to overcome this lack in an efficient manner with our database schema and tools software. epiPATH is an information system for managing data about clinical and molecular information of infectious diseases, which is available for download and local installation, in order to help researchers in related fields in their daily tasks in a laboratory and collaborative projects with other institutions, preserving the information from public access given the sensitivity of the data under analysis. We provide both the database schema and web-based interface, with and without views utility implemented that are available from the epiPATH website. Additionally, a complete manual for users and administrators and a detailed documentation of the database schema can be found at the same site.
Once the information system was implemented, we performed some tests with data from the Hepatitis C Virus and its disease. Sequences and clinical data derived from previous studies in our research group . Tests included 73 samples, 11952 viral sequences and clinical information of 88 patients.
epiPATH is designed to be a local software for laboratories and institutions interested in molecular epidemiology, population genetics and evolutionary biology of infectious diseases. The database schema covers all relevant information of these fields and becomes a very complete and versatile tool. Due to its modular conceptual tables it can be used in a full manner or only in those parts of information that a researcher is actually interested in. Being a multi-pathogen designed database, it can be used as virtual database for a single pathogen or on a per user basis by means of the Views utility. At the same time, it incorporates all data in a unique schema for those cases when multiple infections and comparisons between different pathogens are relevant, and additionally, being a single software for many different users and projects contributes to the internal daily work and data organization of many different institutions. The web-based interface is a very easy-to-use tool that allows the management of users' data stored in a local MySQL database server. The database schema and the web-based interface of epiPATH are open source software under the GNU license and can be modified by end users as needed.
Summary of relevant features of the participants in the usability test of epiPATH.
Number of participants
100% PhD students in Biology
50% Male, 50% Female
Previous experience with databases
83.33% Yes, 16.66% No
Previous experience with SQL language
Previous experience with epiPATH
Previous experience in navigation through web sites
Summary of results in the usability test. Five tasks were considered: (1) entering into epiPATH interface, (2) adding data into epiPATH database through our interface, (3) searching data through epiPATH interface, (4) updating data through epiPATH interface, and (5) deleting data from epiPATH database.
Average time in sec (range)
Average number of errors
Average difficulty grade1
In the near future we plan to develop and integrate in the web-based interface new and already existing analytical tools for this kind of data in order to study thoroughly infectious pathogens from molecular, population, epidemiology and evolutionary points of view.
epiPATH is an open source information system for storing and managing data from clinical and molecular information of infectious diseases, which is available for download and local installation, in order to help researchers in related fields in their daily tasks in the laboratory and in collaborative projects with other institutions, preserving the information from public access given the sensitivity of the data under analysis. It is a very complete, suitable and versatile tool and an unique software for many different users and projects that satisfies requirements of good organization, simple accessibility, data security and multi-user support system.
Availability and requirements
Project name: epiPATH
Project home page: http://genevo.uv.es/epipath
Operating system(s): Client: a web broswer (Firefox, Mozilla, Explorer...); Server: a MySQL Server (epiPATH distribution with views requires MySQL 5 or higher) running on UNIX/Linux or Microsoft Windows systems.
Other requirements: Apache HTTP Server, MySQL client and MySQL Query Browser.
License: GNU General Public License.
The epiPATH software is freely available from the above web site. Data stored in our database server are not publicly available.
relational database management system
structured query language
hepatitis C virus
human immunodeficiency virus
hypertext markup language
php hypertext pre-processor
hypertext transfer protocol
We thank two anonymous reviewers and the participants in the usability study for their help, comments and suggestions. This work was sponsored by Conselleria de Sanitat, Generalitat Valenciana, and by project BFU2005-00503/BMC from Ministerio de Ciencia y Tecnología (Spain).
- Barreto ML, Teixeira MG, Carmo EH: Infectious diseases epidemiology. J Epidemiol Community Health. 2006, 60: 192-195. 10.1136/jech.2003.011593.View ArticlePubMedPubMed CentralGoogle Scholar
- Louie B, Mork P, Martin-Sanchez F, Halevy A, Tarczy-Hornoch P: Data integration and genomic medicine. Journal of Biomedical Informatics. 2007, 40: 5-16. 10.1016/j.jbi.2006.02.007.View ArticlePubMedGoogle Scholar
- Nuismer SL, Otto SP: Host-parasite interactions and the evolution of gene expression. PLoS Biology. 2005, 3 (7): e203-10.1371/journal.pbio.0030203.View ArticlePubMedPubMed CentralGoogle Scholar
- Editorial: Infection biology. Nature. 2006, 441: 255-256.
- Nelson MR, Reisinger SJ, Henry SG: Designing databases to store biological information. Biosilico. 2003, 1: 134-142. 10.1016/S1478-5382(03)02357-6.View ArticleGoogle Scholar
- Morris PJ: Relational database design and implementation for biodiversity informatics. Phyloinformatics. 2005, 2: 1-66.Google Scholar
- Galperin MY: The Molecular Biology Database Collection: 2007 update. Nucl Acids Res. 2007, 35: D3-D4. 10.1093/nar/gkl1008.View ArticlePubMedGoogle Scholar
- MySQL. 2007, [http://www.mysql.com]
- Codd EF: A relational model of data for large shared data banks. Communications of the ACM. 1970, 13: 377-387. 10.1145/362384.362685.View ArticleGoogle Scholar
- DBDesigner 4. [http://fabforce.net/dbdesigner4/]
- MySQL Query Browser. [http://www.mysql.com/products/tools/query-browser/]
- PHP. [http://www.php.net]
- The Apache HTTP Server. [http://httpd.apache.org/]
- Torres-Puente M: Variabilidad genetica y respuesta al tratamiento antiviral en el virus de la hepatitis C (VHC). 2004, Universitat de ValenciaGoogle Scholar
- Jiménez Hernández N: Evolución del virus de la Hepatitis C en muestras hospitalarias de la Comunidad Valenciana. 2004, University of ValenciaGoogle Scholar
- Nielsen J, Molich R: Heuristic evaluation of user interfaces. Proc ACM CHI´90 Conf. 1990, 249-256.Google Scholar
- Nielsen J: Usability engineering. 1994, Morgan Kaufmann PublishersGoogle Scholar
- Nielsen J, Mack R: Usability inspection methods. 1994, John Wiley & SonsView ArticleGoogle Scholar
- Administration USFD: Clinical Laboratory Improvement Amendments (CLIA). 2005Google Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2334/7/32/prepub