Effect of Sociodemographic Factors on COVID-19 Incidence of 342 Cities in China: An Analysis Using Geographically Weighted Regression Model

Background: The coronavirus disease 2019 (COVID-19) has spread quickly among the crowd and brought serious global impact since December 2019. However, there were considerable geographical disparities in the distribution of the COVID-19 incidence among different cities. In this study, we aimed to explore the effect of sociodemographic factors on COVID-19 incidence of 342 cities in China from the geographic perspective. Methods: The o�cial surveillance data about the COVID-19 and sociodemographic information in the 342 cities of China were collected. Local GWPR model and global GLM Poisson regression model were compared to �nd the optimal one for analysis. Results: A signi�cantly lower AICc in the GWPR model was shown compared with the GLM Poisson regression model (43218.9 in GWPR vs. 61953.0 in GLM, respectively). Any spatial auto-correlations of residuals were not found in the GWPR model (global Moran’s I = -0.005, p = 0.468), representing the spatial auto-correlation had been captured by the GWPR model. These cities with higher GDP, limited health resources, and shorter distance to Wuhan, were at higher risk for COVID-19. As population density increased, the incidence of COVID-19 decreased for most of the cities, except parts of the southeastern cities. Conclusions: There are potential effects of the sociodemographic factors on the COVID-19 incidence. Furthermore, the �ndings and methodology in our study could be used as a guide to other countries to help understand the local transmission of COVID-19 and tailor site-specic intervention strategies.


Background
The coronavirus disease 2019 (COVID-19) pandemic caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), began in December 2019 and spread quickly among the crowd [1].Since the outbreak of COVID-19 occurred in Wuhan, Hubei Province, Chinese government has taken unprecedented measures in response to the serious public health issue [2].The epidemic of COVID-19 has been basically brought under control in China, with 80744 con rmed cases as of March 25, 2020 [3], after which almost all of the newly con rmed cases are the imported cases from abroad.However, COVID-19 has brought a truly global impact, and approximately 29 million con rmed cases and over 820,000 deaths among 188 countries by the end of September 14, 2020 [4].In order to prevent and control the epidemic, thus, it's of great signi cance to explore the essential features and potential risk factors of COVID-19.Although researchers have conducted some studies about epidemiological characteristics, clinical diagnosis and treatment methods of COVID-19 [5][6][7][8][9], there are few reports on the geographical distribution of COVID-19 in relation to the sociodemographic factors of different regions.
In most previous studies, to our knowledge, conventional regression models such as ordinary least square regression and generalized linear models(GLM), had been widely used in the eld of medical research [10][11][12].Conventional regression models produce average parameters over the whole study regions and don't take the potential geographical variations of rates into consideration, which may cause bias.Geographically weighted regression (GWR), however, is a powerful approach to explore the possible geographical variations of mortality and incidence of infectious disease and other health problems across space [13,14].Geographically weighted Poisson regression (GWPR), extended from GWR, was initially developed for modeling small-scale mortality that followed the Poisson distribution.Recently, GWPR is increasingly used to explore the relationships between incidence or mortality of disease and geographically changing factors [15][16][17][18].
Therefore, the main issues addressed in this study are as follows: a) to describe the geographical characteristics of COVID-19 incidence across different cities in China; b) to explore the spatially varying relationship between COVID-19 incidence and distances to Wuhan, gross domestic product (GDP), health resources, and population density.

Data sources and data setting
Using the available surveillance data on COVID-19 in China, we conducted a geographic epidemiological study with city as the basic geographical unit.Data contained the con rmed cases of COVID-19 as of March 25, 2020 in each city was extracted from reports of the National Health Commission of the People's Republic of China and Provincial health committees.From the 2019 China Statistical Yearbooks, we also extracted data on GDP, population of inhabitants, land area, and health resources indicators (including number of health personnel per 1000 people, number of hospital beds per 1000 people, and number of health institutions per 1000 people) in each city of China.
Population density of each city was calculated by dividing the population of inhabitants per year by local land area.Geographic information on China cities, including the longitude, latitude, and distance to Wuhan, was acquired from the Google earth.Due to lack of information on COVID-19 in Hong Kong, Macau, Taiwan, and Dongsha islands, therefore, a total of 342 cities were actually incorporated in the analysis.

Data Analyses
The incidence of COVID-19 in each city was measured as the number of con rmed cases per million people.To determine the comprehensive conditions of health resources for different cities in China, a principal component analysis combining three indicators related to health resources, was performed to extract a synthesized variable with a variance contribution of 81.12%.GDP was used as a proxy for the socioeconomic status of each study city.The synthesized health resources variable, GDP, population density, and distance to Wuhan of each city were de ned as explanatory variables in this study.The geographic distributions of COVID-19 incidence and explanatory variables were shown on the map using ArcGIS software.
Based on the assumption that the COVID-19 incidence follows the Poisson distribution, the traditional GLM Poisson regression model is tted and expressed as follows: lnO i = β 0 + β 1 (DEN)+β 2 (GDP)+β 3 (DIST)+β 4 (HEA)+ε i where O i denotes the incidence of COVID-19 in city i, β 0 is the global intercept, and β j (j = 1,2,3,4) are model parameters corresponding to explanatory variables.DEN represents the average population density (100 inhabitants/km 2 ) in each city.GDP is de ned as the gross domestic product (100 million Renminbi Yuan) in each city.DIST is the straight-line distance (100 km) between the municipal government of city i and the municipal government of Wuhan.HEA is the synthesized variable of health resources calculated through principal component analysis, and ε i is the error term of city i.Non-spatial GLM Poisson regression analysis was performed by R software.
In the GWPR model, the value of coe cient is allowed to change with different spatial sites, which means that the GWPR model can capture the instability of the spatial data and nd the local association between the dependent variable and explanatory variables.The formula of the GWPR model is expressed as follows: where (u i ,v i ) denotes the two-dimensional coordinates of each city.The de nitions of other parameters in the model are similar to that of the GLM Poisson regression model mentioned above.In the GWPR model, a distance-based weighting scheme was used to allocate weights to each city.The GWR4 software was used to calibrate the GWPR model.The selection of the optimal model was determined by the corrected Akaike Information Criteria (AICc) and Moran's I coe cient.Spatial auto-correlation is an important issue in the GLM Poisson regression model.In contrast, the spatial auto-correlation of each observation is expected to be removed, after adjusting for the non-stationary effect in the GWPR model.Moran's I coe cient, ranges from − 1 to 1, is usually used to assess the spatial auto-correlation [19].There will be no spatial autocorrelation, when Moran's I is equal to zero.The global Moran's I was used to test the spatial auto-correlation of both the GLM Poisson regression model and the GWPR model in our study.

Results
By the end of March 25, 2020, a total of 80744 con rmed cases of COVID-19 was diagnosed in 342 cities, and the incidence of 57.9 per million people was obtained all over China.Among the study cities, the highest incidence of COVID-19 was in Wuhan (4512.8/1000000people), Hubei province, and the lowest incidence was in the western cities(few con rmed cases) (Shown in Fig. 1).According to the global Moran's I statistic(Moran's I = 0.039, p < 0.05), the incidence of COVID-19 had positive auto-correlations or clustered patterns.
It was obviously observed that there were considerable geographical disparities in the distribution of our explanatory variables among the study cities.Compared with the western cities, the socioeconomic level and health resources were higher in the central and eastern cities of China (Fig. 2A and 2C).Similarly, population density was higher in the eastern and central cities in comparison with western cities (Fig. 2B).The distance between Wuhan and each study city was presented in Fig. 2D.A more detailed description of these study variables was provided in Table 1.Min minimum value, X 25% rst quantile, X 75% third quantile, Max maximum value It was found in the GLM Poisson regression model that the intercept and four explanatory variables were all at a signi cance level of 1% (Table 2).Distance to Wuhan of each city was negatively associated with the incidence of COVID-19; when the distance increased by 100 km, the incidence of COVID-19 would decrease approximately by a factor of 0.7818.Furthermore, local population density and health resources in each city had an inverse correlation with the incidence of COVID-19, suggesting that higher population density and more health resources might reduce the incidence of COVID-19.However, higher GDP was likely to increase the incidence, but the correlation was very weak (the coe cient is 0.0002).After controlling for all explanatory variables using the GLM Poisson regression model, residuals still exhibited positive spatial autocorrelations (global Moran's I = 0.128, p < 0.001), which indicated the spatial non-stationary relationships were not adequately addressed by the GLM Poisson analysis.The GWPR model with spatially varying intercept and explanatory variables was further tted (Table 3).A signi cantly lower AICc in the GWPR model was shown compared with the GLM Poisson regression model (43218.9 in GWPR vs. 61953.0 in GLM, respectively).Any spatial auto-correlations of residuals were not found in the model (global Moran's I = -0.005,p = 0.468), representing the spatial auto-correlation had been captured by the GWPR model.Corrected Aikake information criterion (AICc): 43218.9 Figure 3 presented the spatial varying coe cients of four explanatory variables in the GWPR model.The economic indicator GDP was positively associated with the incidence of COVID-19, with higher coe cients in central and northern cities (Fig. 3A).As population density increased, the incidence of COVID-19 for most of the cities decreased, except parts of the southeastern cities (Fig. 3B).Health resources also had a negative impact on the incidence of COVID-19, with higher coe cients in the central and eastern cities and lower coe cients in the western and northeastern cities (Fig. 3C).Higher distance between Wuhan and the study cities might decrease the risk of COVID-19, with the coe cient ranging from − 1.0596 to -0.6655 among different cities (Fig. 3D).

Discussion
To explore potential risk factors of COVID-19, GIS (Geographic Information System) was used to visualize the geographic distributions of COVID- According to the GLM Poisson regression model and the GWPR model, results revealed that study cities with higher GDP might have an increasing risk for COVID-19.A recent study found that the rapid spread of COVID-19 around the world tended to appear rst in the most economically developed regions where high-level international trade and commercial activities were prevalent.With initially spreading along the routes of international trade between the developed regions, the virus spread later to the developing parts [20].In our study, the higher coe cient was observed in the midlands and northern cities in comparison with the southern cities of China in the GWPR model.A possible explanation for this phenomenon is that southern cities have more robust economies compared with the northern cities, while the improvement of economy might produce a more extensive and signi cant in uence on northern cities, accordingly increasing the infection density of COVID-19 [21].More detailed causes required further investigation.
As the distance to Wuhan increased in our study, the incidence of COVID-19 decreased among all of the cities based on the GLM Poisson regression model and the GWPR model.The spatial varying coe cients represented a decreasing trend from the southeast to the northwest in the GWPR model.Before o cially sealing off Wuhan, more than 5 million people had left the city, which disenabled us to track where exactly these people go, and the distance to Wuhan could be used to in part represent this massive level of human movement.Obviously, cities positioned farther away from Wuhan enjoyed less or no access to contact with the infectious sources, which hindered the spread of COVID-19.On the contrary, in cities near Wuhan with convenient transportation and daily population movement, their residents had more opportunities to contact with the infectious sources, which will promote the COVID-19 to spread.Many studies had revealed the aggregation characteristics of the virus and reminded us of the importance of blocking the epidemic areas and isolating the infectious sources [22], which is consistent with our ndings in this paper and other studies [23].
According to the GWPR model, the coe cients of health resources were negative in 342 cities and showed a degressive trend from the southeast to the northwest, indicating that the higher level of health resources might mitigate the epidemic of COVID-19.A higher level of health resources could contribute to identifying the sources of infection and enabling suspected patients and close contacts to gain better access to quarantine measures, which will prevent the spread of the epidemic and reduce the COVID-19 incidence.Many studies had also emphasized the importance of controlling the sources of infection and cutting off the routes of transmission [24].However, it's worth noting that health resources were poorer in the western cities than the central and eastern cities of China.Previous reports had con rmed that the availability and accessibility of health resources in China had substantial regional disparities [25].Since the outbreak of COVID-19, Chinese government had made great efforts to construct new medical facilities, mobilize the country's large and robust medical forces and accelerate the delivery of medical supplies, the epidemic was quickly brought under control.Our study also suggested that the government should increase the input of medical and health resources in various regions for effectively controlling infectious diseases.The situation in China could provide guide to other countries on how to prepare for possible local outbreaks, especially for resource-limited countries [26].
From the GWPR model and the GLM Poisson regression model, results showed that the population density of study cities was negatively associated with the COVID-19 incidence.In the GWPR model, this effect decreased from the north with a lower population density to the south with a greater population density.The paradox was that the incidence of COVID-19 was higher in cities with lower population density.Considering the very special background of the virus spread in China, the possible reasons were as follows.Due to the particular period, i.e., the Spring Festival and Spring Transportation in China, the result was a considerable movement of people from the large cities to middle and small cities or even rural areas for family reunion.Therefore, many large cities are much less populated during this period.After the outbreak of COVID-19 in Wuhan, furthermore, many residents of some large cities, including Wuhan, take "evasive activity" to return small cities or even rural areas.During this particular period, the above two reasons might result in the more massive transmission risk of the COVID-19 in small cities or even rural areas than those in the big cities.A study from the US reported that household size, rather than overall population density, was more strongly associated with the prevalence of COVID-19 [27].In another study, moreover, the population density was considered as a more effective predictor of COVID-19 infections and mortality for metropolitan areas, not for rural areas [28].Thus, it is necessary to deeply explore the relationship between population density and the COVID-19 incidence.
Some limitations in the study should be acknowledged.First, the observed differences were subject to many unobserved confounding factors.For example, age, gender, nationality, and other natural factors were not available and thus could not be controlled in multivariate analysis.Because this research was based on surveillance data, second, the causal relationship between sociodemographic characteristics and the incidence of COVID-19 could not be demonstrated.Third, due to different policies and measures in response to COVID-19 in each country, the results in our study could not be extrapolated to other countries.To the best of our knowledge, nevertheless, this study is the rst to combine the COVID-19 surveillance and sociodemographic data into GIS and analyze possible risk factors of COVID-19 incidence in China from the spatial perspective, lling a gap in the knowledge of this geographical region.

Conclusions
In summary, our results found local GWPR model was the best tting model to investigate the effects of sociodemographic factors on COVID-19 compared with the global GLM Poisson regression model.These cities with higher GDP, limited health resources, and shorter distance to Wuhan, were at higher risk for COVID-19.Moreover, the relationship between the population density and COVID-19 incidence might be mediated by the peculiar background of the virus spread in China, i.e., the Spring Festival and Spring Transportation in China.These ndings might have an important public health policy implications for COVID-19 control and prevention in China, and could be used as a guide to other countries to help understand the local transmission of COVID-19. Figures 19 incidence in relation to sociodemographic factors including GDP, population density, distance to Wuhan, and health resources.The local GWPR model and global GLM Poisson regression model were compared to nd the optimal tting model exploring the association between the sociodemographic factors and COVID-19 incidence.Compared with the GLM Poisson regression model, the calibration of the GWPR model obviously improved in model tting.

Table 1
Summary of descriptive statistics of independent variables and dependent variable.

Table 2
Summary statistics of global GLM Poisson regression model.

Table 3
Summary statistics of local GWPR model.