The scientific response to the COVID-19 pandemic has produced an abundance of publications, including peer-reviewed articles and preprints, across a wide array of disciplines, from microbiology to medicine and social sciences. Genomics and precision health (GPH) technologies have had a particularly prominent role in medical and public health investigations and response; however, these domains are not simply defined and it is difficult to search for relevant information using traditional strategies. To quantify and track the ongoing contributions of GPH to the COVID-19 response, the Office of Genomics and Precision Public Health at the Centers for Disease Control and Prevention created the COVID-19 Genomics and Precision Health database (COVID-19 GPH), an open access knowledge management system and publications database that is continuously updated through machine learning and manual curation. As of February 11, 2022, COVID-GPH contained 31,597 articles, mostly on pathogen and human genomics (72%). The database also includes articles describing applications of machine learning and artificial intelligence to the investigation and control of COVID-19 (28%). COVID-GPH represents about 10% (22983/221241) of the literature on COVID-19 on PubMed. This unique knowledge management database makes it easier to explore, describe, and track how the pandemic response is accelerating the applications of genomics and precision health technologies. COVID-19 GPH can be freely accessed via https://phgkb.cdc.gov/PHGKB/coVInfoStartPage.action.
The COVID-19 pandemic, caused by SARS-CoV-2, broke out at the start of 2020 . The global scientific community responded with extraordinary effort, sharing information in online databases, preprints, and scientific publications. Based on PubMed and preprint server searches, more than 200,000 scientific articles and preprints were published during two years of the pandemic. They report the results of basic, clinical, and population-based investigations, ranging from studies of the virus itself to the global impact of the pandemic on health, economics, and daily life. Rapid growth of the scientific literature on COVID-19 makes it difficult for scientists, clinical and public health professionals, and the community in general to keep up databases such as LitCovid  and the World Health Organization COVID-19 database , are a key resource for researchers, policy-makers, and the public.
Genomics and data science—including computational methods often referred to as machine learning or artificial intelligence—have been instrumental in many aspects of research on COVID-19. These methods have provided insights into SARS-CoV-2 and how it evolves and spreads in populations, as well as susceptibility to COVID-19 infection, risk of severe outcomes, and role of COVID-19 treatments. Genomic surveillance has documented the emergence and spread of the omicron and delta SARS-CoV-2 variants in Denmark  as well as the unique mutations that differ between these two variants . Other studies have described the role of human genetic polymorphisms in COVID-19 susceptibility , the upregulation of proinflammatory cytokine genes in severe COVID-19 patients , and the toxicity profiles of 90 possible COVID-19 treatments using machine learning .
To track and provide easier access to the application of genomics and precision health in the COVID-19 response, the CDC Office of Genomics and Precision Public Health launched the COVID-19 Genomics and Precision Health knowledge management system and database (COVID-19 GPH) on April 1, 2020. COVID-19 GPH is a component of the Public Health Genomics and Precision Health Knowledge Base (PHGKB). PHGKB features a suite of curated and continuously updated, searchable databases of published scientific literature, CDC resources, and other materials that address the translation of genomics and precision health discoveries into improved health care and disease prevention . Two databases that capture the broad spectrum of biomedical research on COVID-19 have been established by separate groups at the US National Institutes of Health (NIH): LitCOVID at the National Library of Medicine , and the iSearch COVID-19 Portfolio at the Office of Portfolio Analysis . In contrast to these databases, COVID-19 GPH was developed to select a subset of the technology-intense scientific literature on COVID-19 that is most relevant to public health and population medicine. Because COVID-19 GPH is curated, users can quickly identify information related to genomics and precision public health without having to compose a complex search query. COVID-19 GPH also links to news, reports, and other relevant information from CDC, NIH and other public health organizations, all updated daily. Thus, in addition to a searchable archive of scientific literature, COVID-19 GPH offers an easily accessible, online update that helps users keep abreast of the latest developments. Here we describe this unique database and its contribution to organizing the rapidly expanding knowledge base on COVID-19.
Construction and content
COVID-19 GPH is a web-based application based on J2EE technology  with Java open-source frameworks including Hibernate  and Strut . As a component of the PHGKB system, COVID-19 GPH has been built on and integrated into the overall architecture of PHGKB described previously [14, 15].
Data are collected mainly from PubMed, the NIH iSearch COVID-19 Portfolio , LitCovid , and common media sources by an automatic retrieval and text mining strategy , combined with manual curation by domain experts at the Centers for Disease Control and Prevention (CDC) (Fig. 1). Data are retrieved by four main approaches. First, the scientific publications are retrieved from PubMed daily by an automated script using NCBI Eutils  using two specifically designed queries (Additional file 1: Appendix I). Second, we use the same queries to search the NIH iSearch COVID-19 Portfolio website and download records retrieved in spreadsheet format which are subsequently uploaded to the database using an automatic script. Third, we automatically retrieve records classified to the epidemic forecasting category in the LitCovid database using the LitCovid RSS feed. Finally, CDC staff selects online news and other reports from our weekly horizon scan for the Genomics Health Impact Update  and Advanced Molecular Detection Clips  and other sources. The inclusion and exclusion criteria for these weekly scans are described in detail in the Additional file 1: Appendix II. The curation pipelines include a series of computer scripts for scheduled automatic data retrieval and uploading, along with a web-based curation interface that CDC domain experts use to select and curate important news, reports, and articles. The PubTator web service is used to annotate gene information in PubMed records. A text mining technique  is used to identify and standardize the country information associated with the authors in PubMed records. All data selection processes are performed daily. To prevent potential record duplication through multiple retrieval processes, we use a de-duplication mechanism based on unique PubMed IDs or publication titles.
Data are classified into two main groups: Genomics Precision Health and Non-Genomics Precision Health (Additional file 1: Appendix II). They are then further classified automatically into 10 different categories: eight based on the PubTator  classifier in the LitCovid database  (mechanism, treatment, prevention, diagnosis, forecasting, surveillance, transmission) by querying and parsing LitCovid RSS feeds, and three created using text mining scripts (vaccine, variant, health equity) using keyword searching (keywords in Additional file 1: Appendix III). Data are also classified to 12 topics with their own sub-databases in PHGKB (Cancer; Diabetes; Heart, Lung, Blood and Sleep Diseases; Rare Diseases; Health Equity; Family Health History; Reproductive and Child Health; Pharmacogenomics; Neurological Disorders; Primary Immune Deficiency; Environmental Health).
Evaluation of data retrieval performance
To validate our automated data retrieval process, we generated a 499-item random sample from the LitCovid database on April 23, 2021. These records were screened automatically as shown in Fig. 1 and classified as positive (included in the database) or negative (excluded from the database). The automatic query included 55 articles and excluded 444 articles. At the same time, two domain experts independently reviewed the same 499 records manually and classified them according to the database inclusion and exclusion criteria. They discussed all 23 instances of disagreement and arrived at a final classification by consensus. The experts included 50 articles and excluded 449 articles. The performance of the automated retrieval process was evaluated by calculating its specificity and sensitivity, using expert classification as the gold standard. The automatic curation process has an estimated sensitivity of 0.82 and specificity of 0.97 for PubMed articles (Table 1).
User interface and features
The COVID-19 GPH web-based user interface is shown in Fig. 2. The landing page of the site provides two main sections that list important publications picked by a CDC domain expert (Spotlight) and the most recent records added to the database (Latest News and Publications). Summary statistics are on the left side of the page. The user interface allows users to perform a free text search on any topic. The search results can be further stratified by five filters (Country, Journal, Gene, Publication Type and Publication Category). The filtering process can be repeated until a desired search result is achieved. Users can also perform a search on sub-datasets for 10 special topics in PHGKB. Two graphs can be drawn dynamically to summarize the search results: (1) Distribution of Publications by Month and (2) Distribution of Publication by Category. Users also can sign up for a COVID-19 GPH Weekly Update email newsletter that includes COVID-19 related items selected by CDC staff in these categories: Pathogen and Human Genomics Studies, Non-Genomics Precision Health Studies and News/Reviews/Commentaries.
Utility and discussion
COVID-19 GPH is an open access, online database containing links to original studies, reviews, commentaries, and news relevant to genomics, machine learning, or the use of big data in COVID-19 research. Although most records are extracted from PubMed, the database also contains preprints as well as selected online news, reports, and publications (Table 2). Included articles reference 845 human genes, with ACE2 being the most common.
The database contains information on the surveillance, investigation, diagnosis, treatment, prevention, and control of COVID-19. The contents are divided into two main sections, Genomics Precision Health (GPH) and Non-Genomics Precision Health (Non-GPH). GPH contains literature focused on applications of pathogen and human genomics. The literature in Non-GPH relates to the use of big data, data science, digital health, machine learning, predictive analytics and forecasting methods. As of February 11, 2022, the database contains 31,597 articles (22,597 GPH, 9,000 Non-GPH). Articles in both categories may be classified into one or more of 11 publication categories (Table 3). These categories are not mutually exclusive, and an article may be assigned to more than one. In the entire database, the largest category is “Variants” (n = 6735) and the smallest is “Health Equity” (n = 804); however, the relative sizes of these categories differ between the GPH and non-GPH groups (Fig. 3). Some common topics among articles included in the database are listed in Table 4, along with examples [16,17,18,19,20,21,22,23,24,25,26,27,28,29]. We estimated the fraction of scientific literature on selected for COVID-19 GPH by dividing the number of PubMed records in COVID-19 GPH by the number of PubMed records in LitCovid: 22983/221241 (10%) based on the data retrieved on February 11, 2022.
The database can be used to analyze publication trends by month (Fig. 4). After increasing rapidly in early 2020, the number of articles published per month has generally remained between 1000 and 1700. (Note that because of processing time at PubMed, the number for January 2022 may be incomplete). Trends by category tend to be consistent overall, except for prevention and forecasting which peaked 2020 (Fig. 5). Articles in several other categories (variants, vaccine, mechanism, and diagnosis) generally increased in 2021 (Fig. 5).
For each PubMed publication, the database also captures the Altmetric score, a numerical value indicating the amount of attention an article has received . Of the articles with the top 100 Altemetric scores, the vaccine category accounted for the largest share (26%) and the variant category was second (12%) (Fig. 6).
The database simplifies the search for COVID-19 and certain rare diseases, including articles related to 471 of the approximately 7,000 rare diseases on the NIH Genetic and Rare Diseases Information Center website . Users can also search for articles common to COVID-19 GPH and other specialized PHGKB databases. Of the specialized databases, rare disease has the most overlap, 6811 articles, with COVID-19 GPH while Family Health History shares the least number of articles, 10 (Table 5).
The COVID-19 pandemic has produced a surge of original studies, reviews, commentaries, and news available to the scientific community and the public. Two broad emerging technologies including genomics (pathogen and human) and precision health (big data, machine learning, artificial intelligence, and predictive analytics) have been widely used in COVID-19 research, surveillance and response. However, the contribution and evolution of these technologies may be difficult to discern in the midst of the rapid growth of COVID-19 publications. COVID-19 GPH is a an online, continuously updated database that captures the evolving contribution of genomics and digital technologies to the COVID-19 response. Its domain encompasses a wide range of topics, from phylogenetic analysis of SARS-CoV-2 to artificial intelligence for COVID-19 diagnosis. Overall, the contents of COVID-19 GPH represent about 10% of all the COVID-19 literature available from PubMed. Databases such as LitCovid, the World Health Organization COVID-19 database, iSearch COVID-19 Portifolio, CORD-19 and PubMed are comprehensive sources for published scientific articles on COVID-19.
The COVID-19 GPH database is designed for researchers interested specifically in the domains of genomics and precision health, providing several key advantages over more general COVID-19 databases or PubMed. The data are updated daily from multiple sources. The web interface allows users to search the data in a free-text manner and to stratify the search results with meaningful, pre-classified categories and types. For example, users can search by country, journal, gene, publication category, or publication type (PubMed, Preprint, or other), within either the GPH or Non-GPH category or overall. The database also allows users to follow publication trends and monitor online impact.
The emerging fields of precision medicine and precision public health are driven by advances in genomics and digital technologies . These approaches have found novel and urgent applications in the response to the COVID-19 pandemic and catalyzed international scientific collaboration. For example, international collaboration on pathogen genomics has been crucial for monitoring the emergence of SARS-CoV-2 variants . The COVID-19 Host Genetics Initiative has organized researchers from many countries to study human genetic variation in relation to COVID-19 . Beyond genomics, machine learning has played a role in the COVID-19 response by forecasting disease spread, monitoring public health recommendation adherence, diagnosis, and health equity [35, 36]. For example, using machine learning, a study in the United States was able to identify a disparity for COVID-19 infection and mortality for minority populations .
To our knowledge, ours is the only database focused on genomic and precision health for COVID-19. Although the combination of computer and manual curation processes improves quality, it is not perfect; our validation study, which found sensitivity of 0.82 and specificity of 0.97, was limited to PubMed records. We plan to rerun the validation study at the end of the year, after accumulating more data. Our classification by categories is also limited; eight of eleven categories are assigned by LitCovid and thus apply only to PubMed records. Other records are eligible only for the three categories we assign by keyword (vaccine, variant, and health equity), which increases the relative proportions of articles in these categories. In the future, we intend to conduct additional studies to explore the acceptability and functionality of the database for researchers interested in genomics and precision public health in relation to COVID-19.
COVID-19 GPH is a continuously updated, online database that captures publications describing the applications of genomics and digital technologies to control of the COVID-19 pandemic. Compared with larger, more wide-ranging databases, it simplifies searching and offers users additional tools for filtering and displaying search results, including charts to display trends over time.
Ito K, Piantham C, Nishiura H. Relative instantaneous reproduction number of Omicron SARS-CoV-2 variant with respect to the Delta variant in Denmark. J Med Virol. 2021. https://doi.org/10.1002/jmv.27560.
Aminpour M, Delgado W, Wacker S, et al. Computational determination of toxicity risks associated with a selection of approved drugs having demonstrated activity against COVID-19. BMC Pharmacol Toxicol. 2021;22(1):61.
Yu W, Yesupriya A, Wulf A, et al. An open source infrastructure for managing knowledge and finding potential collaborators in a domain-specific subset of PubMed, with an example from human genome epidemiology. BMC Bioinform. 2007;8:436.
Thompson C, Hughes S, Ngai S, et al. Rapid emergence and epidemiologic characteristics of the SARS-CoV-2 B.1.526 Variant—New York City, New York, January 1-April 5, 2021. MMWR Morb Mortal Wkly Rep. 2021;70(19):712–6.
Pati A, Padhi S, Panda D, et al. A cluster of differentiation 14 (CD14) polymorphism (C-159T rs2569190) is associated with SARS-CoV-2 infection and mortality in the European population. J Infect Dis. 2021;5:jiab180.
WY designed the infrastructure of the application, constructed the database, developed the application, and draft the manuscript. ED drafted the manuscript, defined the content definition, and performed the data analysis. MG was involved in writing the paper draft. MJK oversee the project, defined the scope of the data collection and revised the draft manuscript. WY, ED, MG and MJK are involved in curating the database. All authors read and approved the final manuscript.
PubMed complex queries. Appendix II. The inclusion and exclusion criteria. Appendix III. Keywords for searching categories.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Yu, W., Drzymalla, E., Gwinn, M. et al. COVID-19 GPH: tracking the contribution of genomics and precision health to the COVID-19 pandemic response.
BMC Infect Dis22, 402 (2022). https://doi.org/10.1186/s12879-022-07219-3