Early detection of COVID-19 pandemic: evidence from Baidu Index

Background: New coronavirus disease 2019 (COVID-19) poses a severe threat to human life, and causes a global pandemic. The purpose of current research is to explore the onset and progress of the pandemic with a novel perspective using Baidu Index. Methods: We collected the conrmed data of COVID-19 infection between January 11, 2020, and April 22, 2020, from the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU). Based on known literature, we obtained the search index values of the most common symptoms of COVID-19, including fever, cough, fatigue, sputum production, and shortness of breath. Spearman's correlation analysis was used to analyze the association between the Baidu index values for each COVID-19-related symptoms and the number of conrmed cases. Regional differences among 34 provinces/ regions were also analyzed. Results: Daily growth of conrmed cases and Baidu index values for each symptoms presented a robust positive correlation during the outbreak (fever: r s =0.705, p=9.623×10 -6 ; cough: r s =0.592, p=4.485×10 -4 ; fatigue: r s =0.629, p=1.494×10 -4 ; sputum production: r s =0.648, p=8.206×10 -5 ; shortness of breath: r s =0.656, p=6.182×10 -5 ). The average search-to-conrmed interval is 19.8 days in China (fever: 22 days, cough: 19 days, fatigue: 20 days, sputum production: 19 days, and shortness of breath: 19 days). We discovered similar results in the top 10 provinces/regions, which had the highest cumulative cases. Conclusion: Search terms of COVID-19- related symptoms on the Baidu search engine can be used to early warn the outbreak of the epidemic. Relevant departments need to pay more attention to areas with high search index and take precautionary measures to prevent these potentially infected persons from spreading further. Baidu search engine can reect the public's attention to the pandemic and regional epidemics of viruses. Based on changes in the Baidu


Background
Since the outbreak of COVID-19 pandemic in late December 2019 [1], SARS-CoV-2 had attacked more than 188 countries and regions, resulted in over 9.6 million cumulative con rmed cases and 490 thousand deaths worldwide [2]. The astonishing spread speed of the epidemic, to some extent, is failing to monitor and manage potentially infected persons, which later con rmed, may pose a substantial infection control challenge [3]. Therefore, recognizing the potential quantity of infected persons timely and accurately for the control of COVID-19 is in urgent need.
Because of the unpredictability of international public health emergency, novel methods for monitoring the development of the epidemic disease is substantial. Network real-time data can be easily obtained from the web due to the quick availability of the Internet. According to the 45 th China statistical report on internet development, Chinese Internet users reached 904 million, and the penetration rate of search engine use reached 83 percent [4]. Big data shows that, among all the internet users, 80% of them tend to use electronic devices to acquire the information they are interested in [5]. The high inclusion ratio makes network data representative. People can easily get health-related information via Internet search engines, which, to some extent, could greatly re ect the physical condition of the searchers or the relatives and friends the searches concerned [6]. Search behavior has been used to predict some epidemic diseases, such as in uenza [7], epidemic erythromelalgia [8], dengue [9], HIV/AIDS [10]. Many approaches have been applied to achieve near real-time surveillance of the emergence and spred of COVID-19, including o cial announcements, news reports, and mass media [11,12] The surveillance of network searches about clinical Characteristics of COVID-19 are more predictable and timely compared to previous detection methods [13]. Baidu serves as the largest search market in China, and more than 90% of Internet users search for the information interested using Baidu search engine [14].
In this study, we monitored the Baidu index of COVID-19-related symptoms and the con rmed cases of COVID-19 across China to explore the association between these variables. Moreover, we conducted a search-to-con rm study to explore whether the Baidu index can warn the peak of con rmed cases earlier.
Therefore relevant departments have su cient time to make preparations to control the spread of the epidemic.

Data from Badu Index
More than 90% of Chinese search engines user tend to use Baidu, which occupies a considerable market share in China to retrieve their interesting information [15,16]. The weighted sum of the Baidu search values can describe the characteristics of people's search behaviors [17]. Baidu Index (BI) is de ned by calculating the number of search terms of speci c keywords input by the searchers [17]. Using the keywords analysis function, Baidu Index automatically matches its related words according to the keywords typed by users. According to previous researches, the top ve most common symptoms of the COVID-19 was fever (which accounted for 88.7% of the con rmed cases during hospitalization), cough (67.8%), fatigue (38.1%), sputum production (33.7%), and shortness of breath (18.7%) [18]. Thus, we also implemented those symptoms as the keywords in the current study. Based on the keyword analysis function, 26 search terms which represent the symptoms of COVID-19 were de ned (Table S1). We add the search value of each keyword and its related words as the Baidu Index values of the keyword. Besides, we compared the search volume of keywords in other years vertically to investigate whether the change of Baidu Index was an accidental event during the outbreak ( Figure S1).

Data of con rmed cases of COVID-19
We obtained information about con rmed cases of the COVID-19 from accessible o cial channels,  [20]. We conducted a statistical analysis of con rmed cases in 34 provinces/regions in China. Since the epidemic situation in China tended to be stable in the later period, we divided the COVID-19 situation into a growth period (GP) and a decline period (DP). GP and DP are separated by the dates on February 10, 2020, when o cials announce the road closures re-opened and resume production fully [21].

Statistical analysis
First, we compared the correlation between cumulative con rmed cases in China and the daily search volume during the pandemic, GP, DP, respectively. Spearman correlation analysis of SPSS (version 23.0) was applied to explore the relationships between daily growth of con rmed cases (DGCC) and daily Baidu index values (DBIV). DBIV for keywords was treated as an independent variable, while the cumulative con rmed cases and DGCC as the dependent variable, respectively. We analyzed the association between DBIV and DGCC nationwide and 34 provinces/regions, respectively. DBIV minus its previous day's value as the growth of the Baidu Index, which indicates that compared to the previous day, the more growth of the Baidu Index, the more searchers and potential affecters will be. In order to explore whether people's search behaviors are ahead of the outbreak, we compared the day when the growth of Baidu Index reached the apex, and the maximum of DGCC to explore whether the former is ahead of the latter. And we de ned this as a search-to-con rmed interval (STCI). We only selected the top ten provinces/regions ranked by the cumulative con rmed cases for STCI study due to the lack of accumulated con rmed cases in other provinces/regions. P < 0.05 was set as the signi cant statistical difference between variables (two-sided test). Besides, GraphPad Prism 8.2 was used to draw gures.
Table 2 and gure 2 shows that there is a strong statistically positive correlation between the DGCC and search values of Baidu-Index-related to fever (r s= 0.786, p=8.013 x10 -23 ), cough (r s =0.556, p=1.087 x10 -9 ), fatigue (r s =0.763, p=7.930 x10 -21 ), sputum production (r s =0.665, p=1.793 x10 -14 ), shortness of breath (r s= 0.780, p=2.673 x10 -22 ), nationwide. For 34 provinces/regions in China, we observed that the number of daily con rmed cases increased when the Baidu searches for terms related to fever, cough, fatigue, and shortness of breath increasing. Except for Hong Kong, Macao, Taiwan, and Tibet. However, DBIV of cough in Shanghai does not show correlations with DGCC (r s =0.133, p=0.184). Besides, the correlation between sputum production and DGCC in several provinces/regions is inconspicuous. 3.2 Baidu Index maximum growth rate earlier than DGCC Figure 3 shows that the peak of the growth rate of the Baidu Index occurred 19-22 days earlier than the peak of DGCC across china (STCI for fever: 22 days; cough: 19 days; fatigue: 20 days; sputum production: 9 days; shortness of breath: 19 days). And the top 10 provinces/regions ranked by con rmed cases presented similar results except for sputum production ( Figure 3). However, the peak of the growth rate of the Baidu Index related to fatigue in Heilongjiang lags behind the peak of DGCC by 17 days.

Discussion
The current study observed that Big data of the Internet could be used to warn the outbreak of epidemic diseases. In this research, analytical research into the correlation between search behavior of COVID-19related keywords and the number of con rmed cases is conducted according to the Internet's big data. We discovered that the search volume of several COVID-19-related keywords has a strong correlation to the number of con rmed cases. And STCI research predicts the onset of epidemic peaks earlier than previous big data monitoring (usually a week in advance), longer than the incubation period for epidemic diseases.
China, which had reached prideful successes for the control of the COVID-19 pandemic, as one of the few countries, had resumed production in the whole society. The search behavior of Chinese citizens during the epidemic help analyze the correlation between clinical symptoms of affected people and retrieved values. People with the travel history of highly regulated areas and exposure history with the con rmed patients will be required quarantined. Our research suggests that DGCC dynamic change lags behind the public Baidu index values of COVID-19-related symptoms. When the search volumes increased, the cumulative con rmed cases increased as well, which indicates that the searchers could be the potential infector of the virus. Although the number of con rmed cases is increasing, the public's attention has dropped signi cantly manifested by the declined Baidu Index value during DP. Those presents the related daily Baidu index values reached a peak earlier than the DGCC, and has a priority to decline in the later period. Based on this result, we can hypothesize related DBIV can be used as an indicator of epidemic development. Public search behavior can re ect potential physical and psychological problems [6,24]. The decline of search values also indicates that the public's attention to COVID-19 is lighter in the later stage of the pandemic compared with the former stage. We can use the Baidu index to supervise the epidemic situation as well as the public attention to COVID-19.
Overall, ve keywords of DBIV were positively correlated with DGCC during the outbreak. From dynamic uctuations, we can identify the coordinated changes of DBIV and DGCC, with the former keep ahead of the later. For 34 provinces/regions, although most areas in this research showed statistically essential correlations of the DBIV with DGCC (except sputum production), Hong Kong, Macao, Taiwan, and Tibet did not show that correlation. This is probably owing to the Baidu search engine is not the primary search tools in non-mainland areas, such as Hong Kong, Macao, Taiwan [4]. There are few cumulative con rmed cases in Tibet (only one cumulative case), which leads to insu cient cases to calculate the correlation using SPSS 23.0. However, there is no correlation between DGCC and DBIV for cough in Shanghai. This is probably owing to the incompleteness of search words related to keywords. Based on our research, the increase in the related DBIV value can be treated as an abnormal signal, compared with a period of past time, which is worthy of the corresponding action by government departments in advance. Sputum production is more common in the elderly with chronic respiratory diseases and tends to possess a strong connection with seasonal in uenza that occurs every year in the late autumn to early spring [25].
The growth rate of the Baidu Index represents the newly increased searchers compared to the previous day. The increased number of relevant searchers indicates more potentially exposed persons. Around 97.5% of people with identi able exposure history will develop symptoms within 11.5 days; more than 14 days occupy 2% (99th percentile) [26]. We found that the maximum of DBIV's growth rate was 20 days earlier than DGCC on average in most areas except Heilongjiang. The abnormality in Heilongjiang may suggest the possibility of insu cient preparation for the pandemic. People used the Internet to search for symptoms rather than going to the hospital, indicating the difference to publicly reported overrepresent severe cases [6,27,28]. Since the government implemented the isolation measures during the epidemic, the standard medical treatment process is slower and more complicated [29]. Moreover, many community hospitals cannot prescribe medicine for fever patients result in omission to potential patients with minor symptoms. These people are likely to use search engines (usually is Baidu) for related information, so the Baidu index provided an original way to re ect the number of these potential infecters. Those mild potential infectors may possess a more extended incubation period theoretically on account of a lag of several days in con rmed cases [30]. The longer search-to-con rmed interval, the more time for relevant departments to make adequately prepare. The results mean that the big data of public search behavior can detect the COVID-19 pandemic situation in advance, to some extent, highlighting the importance of including search engine data for follow-up prevention and control. We can derive a vital message that the network search value about Clinical Characteristics of COVID-19 using the Baidu Index can monitor the development of the epidemic. The results will be more convincing if all mainstream search engine data is included

Strengths And Limitations
Previous researches based on search engine data are focused on the assessment of epidemic burden [31] and progression [7-10, 27, 28]. This is the rst research to explore the public search behavior of search terms for COVID-19-related symptoms to warning the outbreak and popularity of COVID-19.
However, there remains some limitations need to be recognized. Other search engines also occupy a prominent market share, such as Weibo, Twitter. We did not combine the search values of all search engines to get a more representative database. People who never searched on the Internet are also worthy of incorporation. Search engine values are unavoidably disturbed by some mainstream media owing to public search behavior is primarily guided by the media promotion [32]. The Baidu Index does not provide speci c search information, such as gender, age, and position, so it is di cult to targeted monitor potential infectors. It is a substantial urgent need to nding a useful model to overcome the shortcomings of a single mainstream search engine and the unavailability of obtaining retrieval information.

Conclusion
The public search behavior shows that Baidu Index can provide practical real-time information to predict the development of the epidemic during the outbreak of COVID-19. Baidu Index could guide more effective and targetable intervention and prevention of COVID-19, assist in the overall control of this pandemic.

Declarations
Ethics approval and consent to participate: Not applicable.

Consent for publication:
Not applicable.

Data availability statement
The data that support the ndings of this study are available from the corresponding author upon reasonable request.

Con ict of interest statement
Author(s) declare(s) that there is no con ict of interest Author contribution JQ conceived the study idea. BZ collected the data. BZ, YY, and LF contributed to the analysis of the data as well as wrote the initial draft with all authors providing critical feedback and edits to subsequent revisions. All authors approved the nal draft of the manuscript. All authors are accountable for all aspects of the work in ensuring related questions accuracy or integrity. Any parts of the work are appropriately investigated and resolved. JQ is the guarantor. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.   Search-to-con rmed interval of Baidu Index for Covid-19-related symptoms in the top den provinces/regions with the most con rmed cases. Eg. The purple line represents the absolute value of a