Using Baidu search values to monitor and predict the confirmed cases of COVID-19 in China: – evidence from Baidu index

Background New coronavirus disease 2019 (COVID-19) has posed a severe threat to human life and caused a global pandemic. The current research aimed to explore whether the search-engine query patterns could serve as a potential tool for monitoring the outbreak of COVID-19. Methods We collected the number of COVID-19 confirmed cases between January 11, 2020, and April 22, 2020, from the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU). The search index values of the most common symptoms of COVID-19 (e.g., fever, cough, fatigue) were retrieved from the Baidu Index. Spearman’s correlation analysis was used to analyze the association between the Baidu index values for each COVID-19-related symptom and the number of confirmed cases. Regional distributions among 34 provinces/ regions in China were also analyzed. Results Daily growth of confirmed cases and Baidu index values for each COVID-19-related symptom presented robust positive correlations during the outbreak (fever: rs=0.705, p=9.623× 10− 6; cough: rs=0.592, p=4.485× 10− 4; fatigue: rs=0.629, p=1.494× 10− 4; sputum production: rs=0.648, p=8.206× 10− 5; shortness of breath: rs=0.656, p=6.182× 10–5). The average search-to-confirmed interval (STCI) was 19.8 days in China. The daily Baidu Index value’s optimal time lags were the 4 days for cough, 2 days for fatigue, 3 days for sputum production, 1 day for shortness of breath, and 0 days for fever. Conclusion The searches of COVID-19-related symptoms on the Baidu search engine were significantly correlated to the number of confirmed cases. Since the Baidu search engine could reflect the public’s attention to the pandemic and the regional epidemics of viruses, relevant departments need to pay more attention to areas with high searches of COVID-19-related symptoms and take precautionary measures to prevent these potentially infected persons from further spreading. Supplementary Information The online version contains supplementary material available at 10.1186/s12879-020-05740-x.


Background
The outbreak of new coronavirus disease 2019 (COVID-19) was characterized by fever, cough, fatigue, sputum production, and shortness of breath, receiving people's attention globally [1,2]. As of April 22, 2020, COVID-19 had spread to more than 188 countries and regions, resulting in over 9.6 million confirmed cases and 490 thousand deaths worldwide [3]. The astonishing spread speed of the epidemic, to some extent, was failing to monitor and manage the potentially infected persons, which may pose a substantial infection control challenge [4]. Therefore, recognizing the potential quantity of infected persons timely and taking corresponding management measures to control the further spread of COVID-19 is in urgent need.
Because of the unpredictability of international public health emergency, novel methods for monitoring the epidemic's development are substantial. Network real-time data can be easily obtained from the web due to the quick availability of the Internet. According to the 45th China statistical report on internet development, there were over 904 million Internet users in China, with the penetration rate of search engine use reached 83% [5]. Among Internet users, 80% of them tended to use electronic devices to acquire the information they are interested in [6].
Recently, people can easily get health-related information via Internet search engines, which could greatly reflect the searches' physical condition or the relatives and friends the searchers concerned [7]. Moreover, to interrupt the transmission of the epidemic, the Chinese government had put in place strong quarantine measures, which also influences the routinely outpatient service process. Reported studies showed that public search behaviors have already been used to predict some epidemic diseases, such as influenza [8], epidemic erythromelalgia [9], dengue [10], and HIV/AIDS [11].
The surveillance of network searches about clinical symptoms of COVID-19 is more predictable and timely compared to previous detection surveillance (e.g., official announcements, news reports, and mass media) [12][13][14]. Baidu serves as the most popular search engine, occupies more than 90% of Internet users in China [15]. In this study, we obtained the Baidu index values of COVID-19-related symptoms and the data of confirmed cases of COVID-19 across China to analyze the association between these variables and explore whether the Baidu index could act as a novel tool for monitoring and predicting the epidemic of COVID-19 in China.

Data from Baidu index
More than 90% of Chinese search engine users tend to use Baidu to retrieve their interesting information [16,17]. The weighted sum of the Baidu search values can describe the characteristics of people's search behaviors [18]. Baidu Index value is obtained by calculating the number of searches of specific keywords input by the searchers [18]. Using the keywords analysis function, Baidu Index automatically matches its related words according to the keywords typed by users. Previous studies had reported that the top five most common symptoms of the COVID-19 were fever (which accounted for 88.7% of the confirmed cases during hospitalization), cough (67.8%), fatigue (38.1%), sputum production (33.7%), and shortness of breath (18.7%) [1]. Accordingly, we selected those symptoms as the keywords in the current study. Based on the keyword analysis function, 26 search terms that could represent the most common symptoms of COVID-19 were selected (Table S1). We added the search values of each symptom and its related keywords together to get the composite Baidu Index values to perform our research. Besides, we compared the search values of 5 keywords between 2011 and 2020 vertically to investigate whether the Baidu Index changes were an accidental event during the outbreak ( Figure S1). To explore whether people's search behaviors appear earlier than the epidemic of COVID-19, we defined a definition to examine our hypothesis: search-to-confirmed interval (STCI). The values of STCI can be obtained by calculating the time interval between the peak growth rate (daily Baidu Index values (DBIV) minus its previous day's values as the growth rate) of the Baidu Index and the peak daily growth of confirmed cases (DGCC). The top ten provinces/regions ranked by the cumulative confirmed cases were selected for STCI analysis.

Confirmed cases of COVID-19
We obtained the data of confirmed cases of COVID-19 from accessible official channels, including the official website of Hopkins University [2], the world health organization (WHO) [19], and the National Health Commission of the People's Republic of China [20]. Since China's epidemic had been gradually controlled after April 22, 2020, we divided the COVID-19 pandemic (January 11, 2020, to April 22, 2020) into a growth period and a decline period. February 10, 2020, was set as the cut-off date, when the government announced the road closures re-opened and fully production resumed [21].

Statistical analysis
Using SPSS (version 23.0), we applied a Spearman correlation analysis to explore the relationships between DGCC and DBIV of COVID-19-related symptoms from January 11, 2020, to April 22, 2020. Using the same statistical methods, we also explored the time lag pattern between DGCC and DBIV of COVID-19-related symptoms. P< 0.05 was set as the level of statistical significance (two-sided test). Besides, GraphPad Prism 8.2 was used to draw figures.
STCI analysis for people's search behaviors of COVID-19related symptoms and the epidemic of COVID-19 Figure 3 shows that the peak of the growth rate of the Baidu Index occurred 19-22 days earlier than the peak of DGCC across china (STCI for fever: 22 days; cough: 19 days; fatigue: 20 days; sputum production: 19 days; shortness of breath: 19 days). Moreover, the top 10 provinces/regions ranked by confirmed cases presented similar results except for sputum production (Fig. 3). Oddly, the peak of the Baidu Index's growth rate occurred 17 days later than the peak of DGCC in Heilongjiang.

Discussion
People with the travel and exposure history of high-risk areas with COVID-19 patients will be required quarantined to control the spread of the pandemic. Since the understanding of the new coronavirus's characteristics and the effective treatments remains uncertain, people usually compared COVID-19 with the SARS, which outbroke in 2003 in China with a mortality rate of 11% [19,22]. Due to the separate isolation precautions policy and the fear of an unknown virus, people with exposure history are likely to conceal their own and their family's high-risk behaviors, which undermines the government's early attempts to control the suspected cases of COVID-19 [23]. Using Internet search engines, we could predict the potential quantity of affected persons; and the realtime data of the Baidu Index helps monitor the epidemic development and formulates the corresponding government policies. China had achieved preliminary success in controlling the COVID-19 pandemic by April 22, 2020. The correlation analysis between Chinese public searches of COVID-19-related symptoms and the actual number of confirmed cases will be helpful for exploring the relationships between Internet search values and COVID-19 pandemic and provide novel insights for controlling the epidemic of COVID-19.
The current research shows that the related DBIV reached a peak earlier than the DGCC, and the dynamic changes of DBIV were also earlier than DGCC. We noticed that the higher the search values, the higher the cumulative confirmed cases will be during the growth period, which indicated that the searchers could be the potential infectors of the virus. Besides, DGCC and DBIV presented with a positive correlation during the whole observation period (even in the decline period), which implied the DBIV declined with a decreased number of DCGG. However, when DGCC was declining, the number of cumulative cases continued to increase instead, which could be an explanation for the negative correlation between cumulative cases and DBIV during the decline period. The public's search behaviors for health-related information can reflect their potential physical and psychological problem [7,24]. The declined searches of COVID-19-related symptoms indicated that the Fig. 1 Correlation and time plots among cumulative confirmed cases and each keyword of COVID-19-related symptoms. a-e represent Baidu searches for "fever", "cough", "fatigue", "sputum production", and "shortness of breath", repectively  public's mentality might be more relaxing in the decline period compared with the growth period. We can tell from Baidu's time plots for COVID-19related symptoms and the number of confirmed cases that the former dynamic changes appeared earlier than the latter. Among 34 provinces/regions in China, although most areas in this research showed statistical correlations among the DBIV and DGCC (except sputum production), Hong Kong, Macao, Taiwan, and Tibet did not present with such correlations. One possible reason could be that the Baidu search engine is not the primary search tool in these places [4]. Additionally, there was only one confirmed case in Tibet, which was insufficient to conduct the statistical analysis. Besides, there was no correlation between DGCC and DBIV of cough in Shanghai, which might owe to the incompleteness of search keywords related to cough. Of interest, no correlation between DBIV of sputum production and DGCC was observed. A reasonable explanation could be that sputum production is more common in the elderly with chronic respiratory diseases, and such searches might be correlated to seasonal influenza every year in the late autumn to early spring [25]. Based on our research, the increase in the DBIV of COVID-19-related symptoms could be treated as an abnormal signal worthy of government departments' corresponding action in advance.
The increased number of relevant searches indicates there are more potentially infected candidates. Around 97.5% of people with identifiable exposure history would develop symptoms within 11.5 days, and 1% of them had a more extended incubation period of more than 14 days [26]. We found that the average maximum of DBIV's growth rate was 20 days earlier than DGCC in most areas except Heilongjiang. On May 10, 2020, the Heilongjiang government reported that the pandemic had relapsed; thus, the apex of DBIV appeared later compared with other provinces [27]. Compared with the traditional diagnosis and treatment process, most potential patients are inclined to search the Internet for help, indicating the difference to publicly reported overrepresent severe cases of COVID-19 [7,28,29]. Those potential infectors were likely to use search engines (usually Baidu in China) to search for the related information, so the Baidu index could reflect the approximate number of these potential infectors. The mild potential infectors may possess a more extended incubation period theoretically on account of several days lags before being confirmed [30]. The soaring DBIV of COVID-19-related symptoms in a certain area might be a precursor for the future outbreak of the epidemic. The STCI analysis shows that the peak DBIV of COVID-19-related symptoms appeared 19-22 days earlier than the peak DGCC. However, the results of the time-lag correlation analysis delivered a shorter lag than STCI. Since the STCI study only compared the interval between the peak DBIV of COVID-19-related symptoms and DGCC, it did not take other data into account. Therefore, time-lag correlation analysis could be better to explore the lag patterns of DBIV and DGCC. We found that the optimal time lag of DBIV of fever, cough, fatigue, sputum production, and shortness of breath was 0, 4, 2, 3, 1 day/days, a-e represent Baidu searches for "fever", "cough", "fatigue","sputum production", and "shortness of breath", repectively respectively. According to Cuilian et al., the peak of Internet searches about COVID-19 appeared 10-14 days earlier than the peak of reported daily growth cases in China [31], and 10 days earlier in America [32]. People who searched the terms of "新冠" or "冠状病毒" (keywords in Cuilian's study) were more likely to experience the incubation period, while the searchers querying the COVID-19-related symptoms were likely those who were infected and had already experienced the incubation period. Moreover, there is no time-lag for "fever"; this may attribute to the body temperature reporting mechanism adopted by both the Chinese government and local institutions. This reporting system required that people with fever be actively isolated and quarantined immediately to prevent the potential further spread of COVID-19 [33,34]. Therefore, people with fever would be isolated and confirmed subsequently. As a result, no time lag was observed.

Limitations
There are some limitations needed to be recognized. Firstly, we only utilized Baidu's data to perform our research; other search engines, such as Weibo and Twitter, were not included. Secondly, some keywords related to the symptoms of COVID-19 were not included in the current study, and the keywords utilized in the current work could not guanine the consistency and efficiency of the long-term prediction in the future. Therefore, future studies are suggested to add or delete the corresponding keywords of COVID-19-related symptoms to confirm that the time lag patterns exist between DBIV and DGCC. Thirdly, the detailed information about the individual searchers remains unclear, so it is impossible to identify the specific potential infectors. Besides, there were several documented issues with the predictability of disease incidence trends using search engines. To avoid the failure of predicting an epidemic with the utilization of the Internet search engine, a random forest regression model is suggested in the future study to facilitate our observing results [35].

Conclusion
Our research suggested that there was a significant correlation between DBIV of COVID-19-related symptoms and DGCC. The dynamic changes of DGCC showed several days lags compared with the DBIV. Besides, DBIV of COVID-19-related symptoms could serve as a potential indicator for predicting the epidemic of emerging infectious diseases and guide targetable intervention and prevention of COVID-19 to further assist in the overall control of the pandemic.