Skip to main content

The fine-scale associations between socioeconomic status, density, functionality, and spread of COVID-19 within a high-density city



Motivated by the need for precise epidemic control and epidemic-resilient urban design, this study aims to reveal the joint and interactive associations between urban socioeconomic, density, connectivity, and functionality characteristics and the COVID-19 spread within a high-density city. Many studies have been made on the associations between urban characteristics and the COVID-19 spread, but there is a scarcity of such studies in the intra-city scale and as regards complex joint and interactive associations by using advanced machine learning approaches.


Differential-evolution-based association rule mining was used to investigate the joint and interactive associations between the urban characteristics and the spatiotemporal distribution of COVID-19 confirmed cases, at the neighborhood scale in Hong Kong. The associations were comparatively studied for the distribution of the cases in four waves of COVID-19 transmission: before Jun 2020 (wave 1 and 2), Jul–Oct 2020 (wave 3), and Nov 2020–Feb 2021 (wave 4), and for local and imported confirmed cases.


The first two waves of COVID-19 were found mainly characterized by higher-socioeconomic-status (SES) imported cases. The third-wave outbreak concentrated in densely populated and usually lower-SES neighborhoods, showing a high risk of within-neighborhood virus transmissions jointly contributed by high density and unfavorable SES. Starting with a super-spread which considerably involved high-SES population, the fourth-wave outbreak showed a stronger link to cross-neighborhood transmissions driven by urban functionality. Then the outbreak diffused to lower-SES neighborhoods and interactively aggravated the within-neighborhood pandemic transmissions. Association was also found between a higher SES and a slightly longer waiting period (i.e., the period from symptom onset to diagnosis of symptomatic cases), which further indicated the potential contribution of higher-SES population to the pandemic transmission.


The results of this study may provide references to developing precise anti-pandemic measures for specific neighborhoods and virus transmission routes. The study also highlights the essentiality of reliving co-locating overcrowdedness and unfavorable SES for developing epidemic-resilient compact cities, and the higher obligation of higher-SES population to conform anti-pandemic policies.

Peer Review reports


Although the COVID-19 pandemic seems to become gradually under control with available vaccines and public health measures, continuous control of COVID-19 and prevention of recurrent outbreaks appears a must. Herd immunity through vaccination can take a long time or even be unachievable [1], current vaccines have reduced effectiveness against certain new SARS-CoV-2 variants [2], including that against both symptomatic and asymptomatic infections [3]. Precise control measures, tailored to specific needs of different neighborhoods and population groups within a city, are pressingly needed if the great socioeconomic and medical costs concerning the long-term epidemic control are to be reduced. Also, urban design needs to be informed about how to make cities more resilient to the prospective future epidemics.

The precise epidemic control and epidemic-resilient urban design depend highly on the understanding of the complex associations between many urban characteristics and the spread of COVID-19. To understand these associations is the key to identify high-risk neighborhoods and population groups, to pinpoint COVID-19 transmission routes causing the high risk, and finally to determine epidemic control measures and urban designs pointed to these risk factors and transmission routes. Here, intra-city COVID-19 transmission routes may be roughly divided into (a) within-neighborhood transmissions, including those between family members and between residents in the same neighborhood, often due to failing to keep social distance or defected built environment (e.g., ventilation); and (b) cross-neighborhood transmissions, typically caused by cross-neighborhood activities and interactions between residents in different neighborhoods.

Studies on the associations between urban characteristics and the COVID-19 spread have resulted in many valuable findings [4,5,6,7,8,9,10,11,12,13]. However, the following issues are pending to be settled:

  1. (a)

    Limited by the granularity of data, most studies on these associations used cities/counties/towns as the spatial units. Only a couple of studies pioneered the intra-city associations [10, 11]. At a city or coarser scale, it is hard to observe the associations between intra-city COVID-19 spread and urban characteristics. As a result, experts consider some socioeconomic, density, and functionality characteristics as jointly and interactively contribute to COVID-19 spread [14], but there is a lack of empirical evidence for this argument at an intra-city scale.

  2. (b)

    There are insufficient comparative studies on different waves of COVID-19 outbreaks in specific countries or regions. Also, the associations for the cases imported from other countries/regions and for locally infected cases are not sufficiently compared.

  3. (c)

    By mainly using regressive models and cross-sectional analysis, the studies normally evaluated the associations between individual urban characteristics and the confirmed case distribution, or the combined association of all characteristics on the confirmed case distribution. There were few studies on more complex combined associations, for example, a characteristic A and the COVID-19 spread can be positively associated when another characteristic B has low values, while negatively associated when B has high values. These combined associations and the comparative analysis in (b) are very important for inferring the joint and interactive contributions of urban characteristics to each COVID-19 transmission route.

Motivated by these issues, this study investigates the joint and interactive intra-city associations between urban socioeconomic, density, connectivity, and functionality characteristics and the COVID-19 spread, through both within-neighborhood and cross-neighborhood transmission routes. The study took place in Hong Kong, China, a metropolis with the world’s most densely populated neighborhoods. A modified version of the association rule mining (ARM) algorithm DESigFAR [15] was used to investigate the associations between the urban characteristics and COVID-19 confirmed case rate as well as the waiting period (i.e., the time duration between symptom onset and diagnosis). Based on differential evolution (DE), DESigFAR can optimize the resultant rules in terms of the strength of associations and capture combined associations between any subsets of variables. The associations for the first four waves of COVID-19 in Hong Kong and for local and imported cases were comparatively studied.

The results of this study can be used to anticipate the intra-city spread pattern from early increases of the cases, thereby taking pointed countermeasures to prevent recurrent outbreaks. The results can also provide references to the development of precise intra-city anti-pandemic measures and the improvement of urban design corresponding to specific pandemic transmission routes. These results would be particularly useful for high-density cities, which are usually prone to COVID-19 spread and play key roles in the pandemic control, due to their high density, extensive traffic networks, and complex uses of urban space. The ARM method described in this study can also serve to investigate the intra-city epidemic transmissions in other cities.


Data and variables

The study investigated two sets of response variables: the rate of the COVID-19 confirmed cases (in ‰ of the total population), and the median/average waiting period (in number of days) from symptom onset to diagnosis of the symptomatic COVID-19 local cases, at the Tertiary Planning Unit (TPU) level in Hong Kong as of Feb 18th, 2021. The values of both response variables were computed from the government’s open confirmed cases data [16]. In the data, each case had available reporting date; the location the case stayed prior to diagnosis, mostly the residence address; and the type of the case, inluding local case, epidemiologically linked with local case, imported case, or epidemiologically linked with imported case. Most cases were symptomatic and had available symptom onset dates. In this study, the cases epidemiologically linked with local and imported cases were also classified as local and imported cases. The addresses of the reported locations of the cases were transferred to latitudes and longitudes by using Google Maps Geocoding API.

The COVID-19 spread in Hong Kong was divided into four waves: wave 1 and 2 (before Jun 2020), wave 3 (Jul–Oct 2020), and wave 4 (Nov 2020–Feb 2021). In wave 1, there were more imported cases than local ones, and most imported cases had travel histories in developed countries badly hit by COVID-19, for example, the UK, the US, and France [17]. Wave 3 started from the Kwai Tsing Container Terminal Cluster with 77 confirmed cases related to overseas crews [18, 19]. Wave 4 started from the Dancing/Singing Cluster with 734 cases related to visitors to 28 local dancing/singing venues [20]. TPUs with high rates of cases in the Dancing/Singing Cluster had a moderate tendency to have higher income and education level (Additional file 1: Table S1).

To link the confirmed cases to the socioeconomic status (SES) of the residents in the neighborhoods, imported cases that were confirmed on border ports upon entering Hong Kong and cases reported in hotels were excluded in this study. After these exclusions, there were 391 local cases and 442 imported cases in wave 1 and 2, 3305 local cases and 68 imported cases in wave 3, and 4634 local cases and 29 imported cases in wave 4.

Then, the confirmed case rate and waiting period were computed for each of the 214 TPU-level areal units covering the entire city defined by Hong Kong 2016 By-Census. The rates for local and imported cases were computed separately (Fig. 1; Table 1). The imported case rates in waves 3 and 4 were not computed, since the cases were too few to be analyzed based the 214 areal units of the study area.

Fig. 1
figure 1

COVID-19 confirmed case rate in Hong Kong as of Feb 18th, 2021. a The local confirmed case rate in wave 1&2 of the COVID-19 spread in Hong Kong (by May 2020). b The imported confirmed case rate in wave 1&2. c The local confirmed case rate in wave 3 (Jul–Oct 2020). d The local confirmed case rate in wave 4 (Nov 2020–Feb 2021). e The entropy-based land use index value, together with boundaries of the main urban area and New Towns (i.e., satellite towns). The choropleth maps are colored according to quantile classification, and “p” in the legend means the percentile range of the variable values in each class among all TPUs

Table 1 Explanatory and response variables of the study

A total of 38 explanatory variables, including urban socioeconomic, density, connectivity, and functionality characteristics, were also computed for the 214 TPU-level areal units by using governmental open data sources [21,22,23,24] (Table 1). For computing the density and connectivity characteristics, areas of TPUs were extracted from the Hong Kong official digital boundaries of TPUs and street blocks [25].

Since TPUs are usually of small areas and specialized functionalities, the daily activities of most people are across TPU boundaries. Thus, the POI explanatory variables (Table 1) for investigating the risk of COVID-19 related to people’s daily activities could not be isolated in a TPU. Instead, following the distance decay law of the trips, the accessibility of POIs of type O from a TPU u, was computed by:

$$accessbility\left( {O,u} \right) = \sum\limits_{i \in O} {p\left( {i,u} \right)} ,$$
$$p\left( {i,u} \right) = \left\{ \begin{gathered} 1,\quad i{\text{ is within }}u \hfill \\ \exp \left( { - \beta \cdot 1.3ED\left( {i,u} \right)} \right),\quad i{\text{ is out of }}u \hfill \\ \end{gathered} \right..$$

where i represents each POI of type O, β = 0.3·S−0.17 = 0.22688 is the empirically most probable value of β in a gravity model [26], S = 5.172 km2 is the average area of TPU-level areal units in Hong Kong. ED(i, u) is the Euclidean distance between i and the boundary of u, and 1.3ED(i, u) is the approximated road network distance between i and u [27]. The values of density and per-capita accessibility of POIs (Table 1d) were the value of accessibility(i, u) over the area and over the population of the TPU, respectively.

Investigating the associations between urban characteristics and COVID-19 incidences

The associations between the explanatory and response variables were investigated by a modified version of the ARM algorithm DESigFAR [15]. ARM aims to discover implicit association rules in the form of “antecedent → consequent” from data. In this study, ARM was used to discover association rules in the form of “interval(s) of explanatory variable(s) → interval of confirmed case rate/waiting period”. For example, a resultant rule “prop_higher_edu > 0.364 (p85) → rate_imported12 > 0.124 (p71)” suggested that TPUs with very high higher-education rates (above the 85th percentile among all TPU-level areal units) tended to have high wave-1 and 2 imported case rates (above the 71st percentile).

The main advantage of the DESigFAR algorithm is that it can discover highly informative rules with strong associations and high interestingness, as a result of optimization based on DE, one of the best-performing evolutionary computing techniques for solving real-world problems [28]. In ARM with numerical data, data discretization is the process to divide the range of each variable into intervals (e.g., the interval prop_higher_edu > 0.364 in the above exemplary rule). Then the association rules will be generated from these intervals. A data discretization scheme includes:

  1. (a)

    The number of intervals for each variable, for example, whether the range of higher-education rate should be divided into two or three intervals;

  2. (b)

    The numerical data value for each interval, for example, whether the “boundary” of the higher-education rate interval in the above exemplary rule should be at 0.364 or 0.2.

DESigFAR can optimize the data discretization schemes towards those containing strongest associations between the intervals of the variables. Thus, it can discover much stronger rules with summed rule interestingness measure (RIM) values up to 10 times as high as the results of conventional, non-optimized ARM [15]. Also, the resultant rules of DE-based optimization are automatically limited to only those with high RIM values, thus the workload to interpret the rules is greatly reduced, and no attributes need to be precluded to limit the number of rules. Consequently, DESigFAR can address the major challenges in the application of ARM in public health studies, including (a) the discovered rules can be too weak; (b) experts need to conduct tedious manual analysis on the interestingness of the rules; and (c) many data attributes are precluded to limit the number of rules, reducing the chance for obtaining novel findings [29].

The procedure of the modified DESigFAR algorithm in this study are outlined as follows. More algorithmic details can be found from the publication proposing DESigFAR [15].

Step 1. Population initialization. To prepare for the DE, a population P with NP individuals are initialized. Each individual is a vector which encodes a rule template and the corresponding data discretization scheme. The rule template took the form of “any interval(s) of explanatory variable(s) → any interval of the confirmed case rate/waiting period”. For example, the above exemplary rule belonged to the rule template “any interval of higher-education rate → any interval of the confirmed case rate”. DESigFAR adopts a Gaussian-curve-based fuzzy data discretization model (Fig. 2). Under this model, each numerical variable v is divided into a number of intervals (e.g., I1, I2, and I3. in Fig. 2). For each interval I, a fuzzy membership function μI(v) [0, 1] is defined to represent the degree to which each value in v belongs to I. In each interval of v where 0 < μI(v) < 1, μI is a Gaussian curve.

Fig. 2
figure 2

An example of the fuzzy data discretization in the modified DESigFAR algorithm. In each interval (a, c) of v where 0 < μI(v) < 1 (e.g., (0.1, 0.2) between I1 and I2), μI is a Gaussian curve with a standard deviation equal to (c-a)/2.473

Step 2. DE. Three operators, namely mutation, crossover, and generation jumping, are applied to alter the individuals. Then the selection operator is used to select the individual that represents the better data discretization scheme from the original and altered individuals. This step repeats for G generations, to let the individuals continuously evolve to containing better data discretization schemes.

Step 2.1. Mutation. Given a mutation scale F, NP mutant vectors V1, …, VP are created. Each mutant vector is generated by using three randomly selected individuals, Ma, Mb, and Mc. In the t-th generation,

$$V_{i}^{t} = M_{a}^{t} + F\left( {M_{b}^{t} - M_{c}^{t} } \right),\quad i = 1 \ldots N_{P} .$$

Step 2.2. Crossover. Given a crossover rate Cr [0, 1], each individual is recombined with a mutant vector obtained from the mutation operation into a trial vector U:

$$u_{j,i}^{t} = \left\{ {\begin{array}{*{20}c} {v_{j,i}^{t} \quad {\text{if }}rand_{i} [0, \, 1] \le Cr{\text{ or }}j = j_{rand} } \\ {m_{j,i}^{t} \quad {\text{otherwise, }}} \\ \end{array} } \right.$$

where \(m_{j,i}^{t} ,u_{j,i}^{t} {\text{ and }}v_{j,i}^{t}\) are the sub-vectors that contain the encoding for the j-th variable in \(M_{i}^{t} ,U_{i}^{t} {\text{ and }}V_{i}^{t}\); randi[0,1] is a random number selected from [0, 1]; and jrand is a random index to ensure that the trial vector includes at least one variable from the mutant vector.

Step 2.3. Selection. In each pair of individual and trial vector obtained from the crossover operation, the vector having a higher fitness value will survive to the (t + 1)-th generation:

$$M_{i}^{t + 1} = \left\{ {\begin{array}{*{20}c} {U_{i}^{t} {\text{ if }}fitness\_value\left( {U_{i}^{t} } \right) \ge \, fitness\_value\left( {M_{i}^{t} } \right)} \\ {M_{i}^{t} {\text{ otherwise}}{. }} \\ \end{array} } \right.$$

In this study, the fitness value was defined as a combination of two RIMs, leverage [30], lev and improvement [31], imp:

$$fitness\_value\left( {M_{i}^{t} } \right) = \sum\limits_{r} {\left[ {{{lev\left( r \right)} \mathord{\left/ {\vphantom {{lev\left( r \right)} {30}}} \right. \kern-\nulldelimiterspace} {30}} + imp\left( r \right)} \right]} ,r{\text{ belongs to }}M_{i}^{t} ,lev\left( r \right) > 0,imp\left( r \right) > 0.$$

That is, the fitness value is equal to summed leverages over 30 plus summed improvements of all rules belonging to \(M_{i}^{t}\) that have both a positive leverage and a positive improvement. The same applies to \(U_{i}^{t}\). The leverage and improvement are computed by:

$$lev\left( {r:X \to Y} \right) = supp\left( {X \to Y} \right) - {{supp\left( X \right)supp\left( Y \right)} \mathord{\left/ {\vphantom {{supp\left( X \right)supp\left( Y \right)} {\left| D \right|}}} \right. \kern-\nulldelimiterspace} {\left| D \right|}},$$
$$imp\left( {X \to Y} \right) = conf\left( {X \to Y} \right) - \mathop {\max }\limits_{Z \subset X} \left( {conf\left( {Z \to Y} \right)} \right) > 0,$$


$$supp\left( {X \to Y} \right) = supp\left( {X \cup Y} \right) = \left| {R \in D:X \cup Y \subseteq R} \right|,$$
$$conf\left( {X \to Y} \right) = {{supp\left( {X \to Y} \right)} \mathord{\left/ {\vphantom {{supp\left( {X \to Y} \right)} {supp\left( X \right)}}} \right. \kern-\nulldelimiterspace} {supp\left( X \right)}}.$$

X and Y are the antecedent and consequent of the rule r; supp and conf denote support and confidence, two basic RIMs in ARM. |D│is the number of records in the dataset D, │D│ = 214 in this study. Let \(X = \left\{ {{}^\backprime v_{1} = I_{1}^{\prime } , \ldots ,^{\prime}v_{m} = I_{m}^{\prime } } \right\}\), where v1,…,vm are a series of variables, and I1,…,Im are the intervals of v1,…,vm in X. Then

$$supp\left( X \right) = \sum\limits_{R \in D} {\mu_{{I_{1} }} \left( {v_{R1} } \right) \ldots \mu_{{I_{m} }} \left( {v_{Rm} } \right)} ,$$

where vR1,…,vRm are numerical values for v1vm in record R of dataset D. The same applies to supp (Y) and supp (X → Y).

A positive leverage means that there were more TPUs with the values of all variables contained by the rule falling in their value ranges specified in the rule, compared with if the antecedent and consequent of the rule is independent. A positive improvement indicates that every explanatory variable increased the confidence of the rule. For example, if the rule “prop_higher_edu > 0.364 (p85) → rate_imported12 > 0.124 (p71)” has lev = 15.8 and imp = 0.49, then among the TPUs with higher-education rate above 0.364, about 15.8 more TPUs had wave-1&2 imported case rates above 0.124‰, compared with if the higher-education rate and confirmed case rate were irrelevant. Also, the probability for TPUs with high higher-education rate to have high wave-1&2 imported case rates was 49% higher than average TPUs. Leverage is a portion of the data size and normally much larger than 1, while improvement is normally only a small fraction of 1. Therefore, leverage was divided by 30 in (6), to balance the weight of the two RIMs in the computation of the fitness values.

Step 2.4. Opposition-based generation jumping. This step is to prevent the population from being trapped in local optima and, thus, failing to search for better data discretization schemes. Each generation has a probability Jr to conduct the generation jumping, instead of mutation and crossover. From each individual in current population P, an opposite individual is generated, by replacing each number x in the original individual with \(\mathop x\limits^{ \cup }\):

$$\mathop x\limits^{ \cup } = rank^{ - 1} \left( {1 + \left| D \right| - rank\left( x \right)} \right),$$

where rank(x) is the rank of x among all data values of the variable (e.g., elderly rates of all different TPUs); and rank−1(r) is the data value with rank r among all data values of this variable. All the NP opposite individuals form an opposite population OP, and NP individuals with the highest fitness values in OPP are selected to survive to the next generation.

The following values of the DE parameters were used in this study: P = 300 and 100 for rules about the confirmed case rate and the waiting period, respectively; G = 3000; Cr = 0.5; F = 0.5; Jr = 0.04. The P and G values were such determined that the optimization result generally converged, that is, the number of rules and fitness values almost stayed the same with more generations. The Cr, F, Jr values were such determined to speed up the convergence of the optimization result. The maximum number of variables in antecedent of a rule, maxL, was set to 3 to allow the combined association of up to three explanatory variables on the response variables to be analyzed. The minimum fraction of transition in the fuzzy sets, ftmin, was set to 0.5, following the study proposing DESigFAR [15]. The explanation of ftmin is detailed in [15]. Also, the relative RIM values among the rules were not sensitive to the ftmin value.

Step 3. Statistical evaluation and result output. After the DE, rules with positive improvement and leverage values are extracted from the optimized individuals as the ARM result. In this study, chi-square test was conducted on the statistical significance of positive improvement of each rule X → Y, that is, imp(X → Y) > 0. Following [15], a simplified test was conducted with

$$\begin{gathered} {\text{Null hypothesis }}H_{{0}} :\exists x_{m} \in X{, }\Pr \left( {Y|X} \right) \le \Pr \left( {Y|X\backslash \left\{ {x_{m} } \right\}} \right) \\ {\text{Alternative hypothesis }}H_{{1}} :\forall x_{m} \in X{, }\Pr \left( {Y|X} \right) > \Pr \left( {Y|X\backslash \left\{ {x_{m} } \right\}} \right) \\ \end{gathered}$$

For each fuzzy value interval of explanatory variable \(I_{m} \in X\),

$$\chi_{m}^{2} = \frac{{\left( {ad - bc} \right)\left( {a + b + c + d} \right)}}{{\left( {a + b} \right)\left( {c + d} \right)\left( {a + c} \right)\left( {b + d} \right)}},$$


$$\begin{gathered} a = supp\left( {X \cup \left\{ Y \right\}} \right) \hfill \\ b = supp\left( {X \cup \neg \left\{ Y \right\}} \right) \hfill \\ c = supp\left( {\left( {X\backslash \left\{ {I_{m} } \right\}} \right) \cup \neg \left\{ {I_{m} } \right\} \cup \left\{ Y \right\}} \right) \hfill \\ d = supp\left( {\left( {X\backslash \left\{ {I_{m} } \right\}} \right) \cup \neg \left\{ {I_{m} } \right\} \cup \neg \left\{ Y \right\}} \right), \hfill \\ \end{gathered}$$

and ¬ means to that the corresponding explanatory variable value of the TPU was not in the interval defined in Im or Y. For each \(\chi_{m}^{2}\), a P value was looked up from the chi-square table with one degree of freedom. The final P value of the rule was equal to the largest P value resultant from all \(I_{m} \in X\).

To make the resultant rules more readable, each fuzzy interval I of variable v is represented by a crisp interval of v, where μI(v) is the largest among the membership degrees of different fuzzy intervals in v. For instance, the fuzzy interval for “low higher-education rate” in Fig. 2 is represented as “higher-education rate < 0.15” in resultant rules.

Two datasets were generated to contain each of the five confirmed case rate response variables, one with the POI density variables and the other with per-capita POI accessibility, together with all other explanatory variables. This was to avoid the possible confusion caused by the appearance of both the density and per-capita accessibility of a POI type in the same rule. Also, two datasets were generated to contain the response variables of average and median waiting period, the per-capita POI accessibility, and other explanatory variables. This resulted in a total of 12 datasets. Due to the randomness in DE, DESigFAR results in slightly different rules each time it is applied on the same dataset. Thus, on each of the 12 datasets, DESigFAR was ran for 10 times and output 10 sets of resultant rules. The set of rules containing the largest number of rules for “high confirmed case rate” or “long waiting period” was selected as the final result.


The rules resulting from the modified DESigFAR algorithm, together with their RIM and P values, are shown in Table 2. The strength of the rules was evaluated by two RIMs, leverage and improvement. As stated in Methods, all resultant rules had positive values for both the RIMs. In this case, two variables had an overall positive association, if a high value of one variable was associated with a high value of the other and the same went for the low value. Two variables had an overall negative association, if a high value of one variable is associated with a low value of the other and vice versa.

Table 2 Selected resultant rules from the ARM algorithm DESigFAR

Demographic and socioeconomic characteristics

Among all urban characteristics, high values of higher-education rate, median monthly income, and average accommodation area had the strongest and most significant associations with a high wave-1&2 imported confirmed case rate, in terms of the largest leverage and smallest P values (rule 1, 3, 5, Table 2b). Among all explanatory variables, higher-education rate and median monthly income also had the largest positive Spearman rank-order correlation coefficient values between the wave-1 imported case rate, which were 0.52 and 0.48, respectively. Corresponding to rule 5, Table 2b, all 21 TPUs with median accommodation areas over 25.4 m2/person and imported case rates over 0.169‰ had median incomes of at least 25,000 HKD/month (78th percentile in all TPUs), showing that this association also came from high-income population, instead of large housings in low-density rural areas. The wave-1&2 local case rate had similarly positive but weaker associations with these three variables, in terms of smaller leverage and improvement values (rule 1–3, Table 2a).

The wave-3 local case rate, on contrary, was negatively associated with higher-education rate, income, and accommodation area (rule 1–6, Table 2c), showing that the cases tended to be occur in lower-SES population. In wave 4, the accommodation area continued being negatively associated with the local case rate (rule 4–5, Table 2d), but the association between income and the local cases rate became much weaker and involved only the 7% TPUs with the lowest income (rule 1, Table 2d). Combined with indicators of urban area (e.g., low vegetation coverage, high building density), income and the local cases rate became negatively associated in more TPUs, as reflected by the larger supports of rule 2–3 than rule 1, Table 2d. Higher-education rate was negatively associated with the wave-4 local case rate only if excluding the cases from the Dancing/Singing Cluster and in urban area (rule 1–2, Table 2e). This shows that the wave-4 local cases were less concentrated in lower-SES population than the wave-3 cases but more concentrated in the urban area.

A high gender ratio of over 94.8, which seemed also related to a high income, was associated with a high wave-1&2 imported case rate of over 0.207‰ (rule 6, Table 2b). In the 26 TPUs fulfilling rule 6, Table 2b, 23 TPUs had the income higher than Hong Kong median of HK$16,250/month (Table 1a). In wave 4, a very high gender ratio was associated with a low local case rate (rule 6, Table 2d), reflecting very sparsely populated TPUs. The 11 TPUs fulfilling rule 6, Table 2d had an average population density of 2,124 persons/Km2, much lower than the Hong Kong median of 18,402 persons/km2 (Table 1b).

Elderly rate showed a negative association with the wave-1&2 imported case rate (rule 7–8, Table 2b) but a positive association with wave-3 local case rate (rule 7–8, Table 2c). Meanwhile, the elderly rate had a considerable negative correlation with the monthly income, with a Spearman’s r value of -0.49 between the two variables. In wave 4, a very high elderly rate was associated with a low local case rate (rule 7, Table 2d), mostly reflecting TPUs with low population densities below 5,000 person/Km2.

A small average household size below 2.6–2.7 was associated with high local case rates (rule 9, Table 2c; rule 8, Table 2d). Oppositely, a high wave-1&2 imported case rate was associated with a large average household size above 3.25 (rule 9, Table 2b) which also tended to co-occur with a high income. The income of all 21 TPUs fulfilling rule 9, Table 2b was above the Hong Kong median of HK$16,250/month, with an average of HK$38,952/month.

Density and connectivity characteristics

All four density and connectivity variables, namely the densities of population, buildings, roads, and public transport stations, showed generally positive associations with the wave-3 and wave-4 local case rates (rule 11–17, Table 2c; rule 9–16, Table 2d). Judged by larger leverage and improvement of the rules, the wave-3 local case rate was more strongly associated with population density, while the wave-4 rate was more strongly associated with road and building densities (rule 11–14, Table 2c; rule 9–12, Table 2d). The wave-4 rate was most statistically significantly associated with road density, that is, connectivity (P value in rule 9 compared with rule 10–12, Table 2d).

A high wave-1&2 imported case rate was, instead, associated with a low-to-medium population density. Also, very low density and connectivity were associated with low confirmed case rates (rule 10–12, Table 2b), which represented low-density rural TPUs with few imported cases.

Functionality: urban residence and related variables

The proportion of private residential LU generally showed positive associations with the confirmed case rates in all waves (rule 6–7, Table 2a; rule 13–14, Table 2b; rule 18–19, Table 2c; rule 17–18, Table 2d). A low proportion of industrial land, which was usually far from major residential areas, was also associated with high confirmed case rates in all waves (rule 8, Table 2a; rule 16, Table 2b; rule 22, Table 2c; rule 20, Table 2d).

Public housings in Hong Kong were reserved for lower-income residents. A high proportion of public residential LU was associated with a high wave-3 local case rate but a medium wave-4 local case rate (rule 20–21, Table 2c; rule 19, Table 2d). A low proportion of public residential LU, indicating a higher income, was again associated with a high wave-1&2 imported case rate (rule 15, Table 2b).

The LU mix index value was negatively associated with the rates of wave-1&2 imported cases as well as wave-1&2 and wave-4 local cases (rule 9–10, Table 2a; rule 17–18, Table 2b; rule 21–22, Table 2d). This echoed the association between residential LU and the confirmed case rate, since TPUs with high proportions of residential area tended to have low LU mix index values. The TPUs ranked the lower half in terms of LU mix index had an average of 17.2% residential area, while the Hong Kong average was 11%. The negative association between the local case rate and LU mix index disappeared in wave 3, during which some New Towns and rural areas with high LU mix index values also had high local case rates (Fig. 1c, e). These areas had high LU mix because they contained both typical urban LUs (e.g., residential and business area) and typical suburban or rural LUs (e.g., rural settlement and agricultural land).

Average per-capita floor area (ave_area_all) was equal to the total floor area of buildings divided by the number of residents in the TPU. In the experimental data, high ave_area_all values appeared in industrial or hotel area, remote rural TPUs, and high-income TPUs, while lower ave_area_all values mainly occurred in densely populated lower-income TPUs. This variable showed overall negative associations with wave-3 and wave-4 local case rates (rule 23–24, Table 2c; rule 23–24, Table 2d). TPUs fulfilling these rules were mainly densely populated lower-income TPUs with higher local case rates. A mid-to-high ave_area_all value was associated with a high wave-1&2 imported rate, while a high ave_area_all value was associated with a low imported case rate (rule 19–20, Table 2b). Looking into the data, rule 19 and 20 largely corresponded to high-income TPUs and remote rural TPUs, respectively.

Functionality: rural area, urban area, and POI density

In general, the proportion of LUs concentrating in rural area, including rural settlement, agriculture land, and vegetations (woodland, shrubland, and grassland), had negative associations with the confirmed case rates in all waves (rule 11, Table 2a; rule 21, Table 2b; rule 25–32, Table 2c; rule 25–32, Table 2d). As an exception, vegetation coverage had no obvious associations with the wave-1&2 confirmed case rates. Indeed, some high-income TPUs with high wave-1&2 imported case rate, especially those in hilly urban areas on Hong Kong Island, also had high vegetation coverage.

In contrast, the five types of POIs in this study, as well as business, recreation, governmental, institutions and facilities, and transport LUs, concentrated in urban areas. High densities of all five types of POIs and high proportions of all four LUs were associated with high wave-3 and wave-4 local case rates (rule 33–36, Table 2c; rule 33–36, Table 2d; Table 2f, g). Yet since these POIs and LUs also concentrated in densely populated areas, it was not very clear whether these associations were more related to the high level of activities brought by these POIs and LUs, or instead to the high population density.

Functionality: POI accessibility

High per-capita accessibilities to all five types of POIs were associated with high wave-1&2 imported case rates (rule 24–28, Table 2b). These associations mostly reflected the wealthy areas in or around the downtown, which had convenient access to a great number of POIs in the downtown and also mid-to-low population density (Fig. 3a). Low per-capita POI accessibilities were associated with medium wave-1&2 imported case rates (rule 29–34, Table 2b), which mainly reflected some New Towns with mid-to-high population density and relatively limited access to POIs due to the farness to the main urban area (Fig. 3b).

Fig. 3
figure 3

Exemplary rules. a TPUs fulfilling rule 27, Table 2b, i.e., TPUs with POI_pp_mall_mkt > 10.941 (p75) and rate_imported12 > 0.202 (p80), together with median monthly income. b TPUs fulfilling rule 32, Table 2b, i.e., TPUs with POI_pp_mall_mkt < 1.922 (p22) and rate_imported12 = 0.003–0.083 (p36-66), together with population density. c TPUs fulfilling rule 38, Table 2d, i.e., TPUs with med_income < 24,082.337 (p79), den_road > 10.205 (p48), POI_pp_mall_mkt > 6.459 (p57), and rate_local4 > 0.895 (p81), together with the median monthly income and entertainment venues involved in the super-spread of Dancing/Singing Cluster [32]

Low or mid-to-low accessibilities to all five types of POIs, alone or combined with a relatively low income or higher-education rate, were associated with mid-to-high wave-3 local case rates (rule 37–41, Table 2c). Looking into the data, TPUs involved in these rules were largely densely populated lower-income urban areas, where the per-capita POI accessibility was low due to the large populations.

In wave 4, mid-to-low-income urban TPUs with high accessibilities to mall and market, sports, and transport POIs, in contrast to low accessibilities in wave 3, were associated with very high local case rate (rule 37–39, Table 2d). The associations mainly reflected the mid-to-low-income TPUs located in major commercial and entertainment areas and in adjacent to high-income TPUs (Fig. 3c). These TPUs also contained most entertainment venues involved in the super-spread of the Dancing/Singing Cluster (Fig. 3c), and their associations with very high wave-4 local case rate disappeared when the Dancing/Singing Cluster were excluded (Additional file 1: Table S2f).

In addition, very high per-capita POI accessibilities were associated with very low local case rates in all waves (rules 222, 324, 326, 329 and 332, Additional file 1: Table S2a; rules 169, 176, 180, 181 and 185, Additional file 1: Table S2c; rule 40–44, Table 2d). These rules mainly covered some remote TPUs with high per-capita POI accessibilities due to their small populations.

Associations for the waiting period

A high median accommodation area and a high median income were associated with longer average and median waiting periods (Table 2h, i). Long average waiting periods were also associated with high per-capita POI accessibilities (Table 2h). As stated earlier, large median accommodation areas and high per-capita POI accessibilities tended to occur in high-income TPUs and sparsely populated rural TPUs. It is of note that high-income TPUs still contributed to a minority of cases with long waiting periods, since they had much smaller populations and number of confirmed cases than lower-income TPUs. For example, the TPUs with median incomes of at least HK$20,000/month (top 40% TPUs) contributed 1226 out of the 8,238 (14.9%) local cases with available waiting periods, and contributed 76 out of 357 cases (21.2%) with waiting periods of 12 or more days.


Interpreted from the ARM results, main characteristics of the first four waves of COVID-19 in Hong Kong are as follows. The first and second waves (by May 2020) tended to spread among higher-SES population, represented by higher income, higher education level, and more spacious accommodations. The rules related to high income and education level (rule 1 and 3, Table 2b) had higher RIM values and statistical significance than the rules related to other demographic or density characteristics, making SES more likely to be the driving force of the distribution of imported cases. The results, in contrary to some previous results at coarser spatial scales [7, 8] (Table 3), could be explained by that most wave-1&2 imported cases, who had studied or lived in developed countries, tended to have higher SES. Wave-1&2 imported cases also tended to distribute in neighborhoods with mid-to-low population density, agreeing to the fact that higher-SES people in Hong Kong tend to live in mid-to-low-density neighborhoods. The association between a high gender ratio, a low elderly rate, and a high confirmed case rate appeared to link to the higher male proportions and lower elderly rate in higher-SES TPUs, instead of gender or age difference in physiological susceptibility. These high-SES neighborhoods also had high accessibility to POIs, mainly attributed to POIs in nearby higher-density commercial and entertainment areas, since the POI density within high-SES neighborhoods was not usually high. The associations between urban characteristics and the local case rate were similar to but weaker than those for the imported case rate. These weaker associations might be shaped by the distribution of imported cases who were the infection sources for local cases, since local transmission was limited during the first two waves.

Table 3 Selected literature on associations between urban characteristics and COVID-19 confirmed case rate

The wave-3 spread (Jul–Oct 2020) appeared to be more driven by the within-neighborhood transmission that was severer in densely populated neighborhoods, especially lower-SES ones. The wave-4 spread (Nov 2020–Feb 2021) appeared to be more driven by the cross-neighborhood transmission due to high activity level and connectivity. The spread patterns in both waves reconfirmed the vulnerability of lower-SES population against COVID-19 infections. High wave-3 local case rates were associated with high population density and connectivity (e.g., road density), low income and education level, crowded accommodations, small households, public residences for lower-income population, low vegetation coverage, and high density but low per-capita accessibility of POIs. Starting from the Dancing/Singing Cluster which heavily involved high-SES population, wave-4 local cases were less associated with SES or population density than wave-3 ones, but more concentrated in urban area and area with high connectivity and activity level. A high rate of wave-4 cases was most strongly and statistically significantly associated with road density, building density, and a low vegetation coverage (rule 13, 14, 27, 29, Table 2d). The associations for wave-4 cases excluding the Dancing/Singing Cluster became more similar to those for wave-3 cases, confirming that lower-SES population was still more vulnerable. The mid-to-low-income, high-density neighborhoods in main commercial and entertainment areas, with high activity level of lower-income residents as well as higher-income visitors from nearby neighborhoods, was worst hit.

In addition, higher-income population was associated with longer waiting periods between symptom onset and diagnosis, which might be attributed to their higher concern on the economic loss due to seeking medical advice related to COVID-19 and less anxiety for being infected. Since wealthier people normally have better medical resources, their longer waiting periods were likely due to longer delays in seeking medical advice, rather than slower diagnoses. Such delays might not be due to the privacy concern about health data, which were reported to be similar among people in different income levels, or even lower for higher-income people [33, 34]. Instead, higher-income people were reported to concern more about the economic impact of COVID-19 but less about being personally infected [35]. Therefore, wealthier people could have higher concern about the economic loss, such as being quarantined and unable to work, and higher confidence that they were well protected and did not really get COVID-19, which might have delayed their hospital or clinic visits.

Household size played different roles in household and community transmission. A high rate of wave-1&2 imported cases, including the cases directly infected by imported cases, was associated with a large average household size. This might be due to aggravated household transmission in larger households [36], since quarantine at home was allowed until November 2020. Also, high-SES families tended to have larger household sizes by including FDHs. Oppositely, a small average household size below 2.6–2.7 was associated with high local case rates, which might be linked with possibly more activities outside homes for one-person or two-person households (e.g., couples without children).

A high proportion of private residential LU was associated with high confirmed case rates in all waves. Since the reported locations of the cases were mostly their residences, these associations are expected and do not indicate a higher risk of infection in residential area. Rural area was generally associated with lower confirmed case rates, likely because of its much lower density and connectivity.

By comparing previous studies on the same area (e.g., the US) but different time periods, at the city/county scale, an urban characteristic may have variant associations with the COVID-19 spread in different waves of outbreaks [5, 6, 8, 9, 12, 13] (Table 3). This study shows that such temporally variant associations also exist at the intra-city scale, and, further, relates these associations with the intra-city distribution of specific population groups and activities. The study also reveals the intra-city local variations of COVID-19 transmissions in main urban areas with different SES levels and densities, satellite towns, and rural areas. These findings may provide references to investigate the local variations of the associations between urban characteristics and the COVID-19 spread at a coarser spatial scale (e.g., in different counties of a country) [12].

The study results have the following further implications for long-term pandemic control. First, the study reveals the joint and interactive contribution of density, connectivity, and functionality to COVID-19 spread within and across neighborhoods, especially in lower-SES neighborhoods. As a result, to relieve both overcrowdedness and overconcentration of facilities at the neighborhood scale is likely a critical task to improve the epidemic resilience in high-density cities. At the city scale, a significant causal effect of high population and employment density on the confirmed case distribution has been reported unable to be identified [7]. However, within a city, at least a high-density one, the particularly densely populated areas often indicate an overcrowded life with reduced quality. Residents in such areas, therefore, tend to be lower-income ones who do not afford a more spacious, higher-quality life. Overcrowdedness and low SES are linked with multiple conditions which could jointly or even synergistically contribute to extensive within-neighborhood transmission. These conditions include, for example, the difficulty to keep social distance in crowded accommodations and facilities, the tendency to spend more time outside less comfortable homes, the lower feasibility for manual labors to work from home, and the worse ventilation in old apartments.

Meanwhile, concentrated facilities and increased connectivity (e.g., transport hubs, easily accessible locations) can mutually attract and thus tend to co-locate, jointly leading to a high infection risk within the neighborhoods with concentrated facilities and high connectivity. Worse still, in many cities, high-SES people tend to reside in high-income neighborhoods, while the nearby commercial and entertainment areas they frequently visit often have lower SES. Those commercial and entertainment areas can suffer from an extreme risk of infection dually caused by intensive within-neighborhood transmissions due to high density and low SES, as well as intensive cross-neighborhood transmissions due to the high level of activities conducted by visitors of diverse SES. In the Hong Kong case, this was particularly reflected by the suffering of lower-SES TPUs in the major commercial and entertainment areas from the singing/dancing super-spread event. These TPUs also contained most entertainment venues involved in the super-spread of the Dancing/Singing Cluster (Fig. 3c). Their nearby high-SES TPUs contained almost no such entertainment venues but had even higher rate of confirmed cases in the Dancing/Singing Cluster (Additional file 1: Table S1), meaning that the cases in those TPUs should have visited the entertainment venues in the nearby lower-SES TPUs.

Second, higher-SES population, if infected, may have a higher potential to infect others and contribute to super-spread events than the lower-SES one. This brings the wealthy more obligation to conform anti-pandemic policies. In wave 3, higher-income neighborhoods appear less affected by the outbreak initiated in lower-income ones, which might be attributed to the self-segregation of the wealthy in higher-income neighborhoods [37]. Yet the super-spread event in wave 4 which heavily involved high-income population diffused to lower-income neighborhoods shortly afterwards. Such asymmetric effect may relate to that higher-SES people have higher mobility to more diverse area and higher accessibility to POIs outside their neighborhoods. Thus, on average, they contact more persons in a larger geographic scope, leading to a higher risk of cross-neighborhood transmission and super-spread. Lower-SES population, in contrast, tend to contact less diverse people in fewer places, leading to more localized transmission. High-SES population is also obliged to seek medical advice faster when showing COVID-19 symptoms, to avoid infecting others during longer waiting periods.

Third, by referring to the study results, pointed countermeasures to early increases of the cases may be developed to forestall recurrent outbreaks. Intra-city COVID-19 spread patterns, major transmission routes, and their interrelations with urban characteristics varied greatly in different waves of the pandemic. This study has identified such transmission routes and interrelationships for different sources of outbreaks: imported cases from developed countries (wave 1&2), localized transmission concentrating in lower-SES neighborhoods (wave 3), and super-spread events which considerably engage higher-SES population (wave 4). Facing an early increase of the cases, the study result can be used to pre-estimate the confirmed case distributions and transmission routes from the likely sources of such increase. Pointed countermeasures to specific neighborhoods or transmission routes could be further developed to prevent the increase from developing into a recurrent outbreak.

This study has multiple limitations. First, this fine-scale study has an advantage in revealing and reasoning intra-city associations between urban characteristics and COVID-19 transmission. Yet at such a fine scale, reported locations of the cases tended to be their residences and deviated from where they got infected. Such deviation limited the discovery of the infection risk for different activities outside the residences. Massive fine-scale human mobility data which is relatively representative for the whole population, such as smartphone tracking data from the carriers, may help identify people’s daily activity areas and lead to more accurate evaluations on the infection risks for different LUs. Also, while statistical tests have been performed in this and many other studies on the associations between the spatial patterns of various factors and COVID-19 spread, the statistical evaluation results need to be interpreted with caution. Parametric statistical tests generally assume the independence between observations, but spatial autocorrelation is prevalent in geographically distributed data, including the data for the factors investigated and the COVID-19 spread. The authors propose to further tackle this issue by exploring the use of non-parametric tests in future association studies, which may allow the data to be spatially autocorrelated. In addition, this study did not involve factors that were relatively homogeneous within a city at a certain time or had no available data at neighborhood level. These factors include, for example, environmental factors (e.g., relative humidity, temperature, and pollution) [38], non-pharmaceutical interventions (e.g., closure of schools and entertainment venues) [39], human behaviors (e.g., wearing face masks) [40], and COVID-19 testing rate [41]. In particular, the potential impact of seasonal climate and change in non-pharmaceutical interventions on the intra-city COVID-19 spread pattern is very much worth investigation. Effective investigations into these factors at an intra-city scale, again, require these factors to be properly measured for the venue where the cases exposed to the virus, instead of their reported residences.


This study explores the intra-city associations in a high-density city between SES, density, functionality, and spread of COVID-19. Leveraging the advantage of DE-based ARM in studying optimized and complex associations, the associations were comparatively investigated for four waves of the pandemic in Hong Kong and for local and imported confirmed cases. Further analyzed based on these associations was how the urban characteristics might have jointly and interactively shaped the spatiotemporal patterns of COVID-19 cases, through different epidemic transmission routes within and across neighborhoods.

The study result showed that the first two waves of COVID-19 in Hong Kong (by May 2020) was mainly shaped by imported cases from developed countries. The high confirmed case rate was associated with high SES and related characteristics, such as mid-to-low population density and high accessibility to facilities. In the third (Jul–Oct 2020) and fourth (Nov 2020–Feb 2021) waves, densely populated and built neighborhoods, usually also lower-SES ones, were worse hit. The distribution of the wave-3 cases appeared more strongly shaped by the within-neighborhood transmission and lower SES. The patterns of wave-4 cases showed a stronger link to cross-neighborhood transmission and people’s activity level, likely due to the super-spread in dancing/singing venues. In particular, a diffusion was observed from the super-spread which considerably involved high-SES population to lower-SES neighborhoods and again the within-neighborhood transmission. Also, higher-SES population was found to be associated with mildly longer waiting periods.

The findings of this study provide potentially important references for precise control of COVID-19 at a neighborhood scale, as well as the pandemic-resilient design of compact cities. The usually co-locating overcrowededness and unfavored SES of residents can synergistically increase the vulnerability to epidemic of lower-SES neighborhoods and result in extensive within-neighborhood transmission. Lower-SES neighborhoods with concentrated facilities and non-residential activities can suffer from an extreme risk of infection dually caused by intensive within-neighborhood transmissions as well as intensive cross-neighborhood transmissions brought by visitors of diverse SES. To improve the epidemic resilience in high-density cities, it is, therefore, essential to relieve both overcrowdedness and overconcentration of facilities at the neighborhood scale. Also, higher-SES population is more obliged to conform anti-pandemic policies, due to their higher potential to participate extensive transmission and super-spread events. Facing early increases of the cases in a city, the study results may be used to develop pointed countermeasures against the likely sources of such increase and related specific neighborhoods or transmission routes, to forestall recurrent outbreaks.

Availability of data and materials

All the dataset used in this study are publicly available and cited in this article. These include: Hong Kong COVID-19 confirmed cases:, Hong Kong 2016 Population By-census:, Hong Kong 1:1000 and 1:5000 Digital Topographic Maps:, bus stop locations in Hong Kong:, Land Utilization in Hong Kong:!, TPU boundaries: The corresponding author could be contacted if the data needs to be accessed upon reasonable request.



Association rule mining


Differential evolution


Foreign domestic helpers


Geographically weighted regression


Ordinary least square


Rule interestingness measure

SAC model:

Spatial autoregressive combined model


Structural equation modeling


Socioeconomic Status


Tertiary Planning Unit


  1. Aschwanden C. Five reasons why COVID herd immunity is probably impossible. Nature. 2021;16:520–2.

    Article  Google Scholar 

  2. Bian L, Gao Q, Gao F, Wang Q, He Q, Wu X, Mao Q, Xu M, Liang Z. Impact of the Delta variant on vaccine efficacy and response strategies. Expert Rev Vaccines. 2021;20(10):1201–9.

    CAS  Article  Google Scholar 

  3. Tang P, Hasan MR, Chemaitelly H, Yassine HM, Benslimane FM, Al Khatib HA, AlMukdad S, Coyle P, Ayoub HH, Al KanaaniAl Kuwari ZKE. BNT162b2 and mRNA-1273 COVID-19 vaccine effectiveness against the SARS-CoV-2 Delta variant in Qatar. Nature Med. 2021;27:1–8.

    Article  Google Scholar 

  4. Yang C, Sha D, Liu Q, Li Y, Lan H, Guan WW, Hu T, Li Z, Zhang Z, Thompson JH, Wang Z. Taking the pulse of COVID-19: a spatiotemporal perspective. Int J Digital Earth. 2020;13(10):1186–211.

    Article  Google Scholar 

  5. Sun F, Matthews SA, Yang TC, Hu MH. A spatial analysis of the COVID-19 period prevalence in US counties through June 28, 2020: where geography matters? Ann Epidemiol. 2020;52:54–9.

    Article  Google Scholar 

  6. Zhang CH, Schwartz GG. Spatial disparities in coronavirus incidence and mortality in the United States: an ecological analysis as of May 2020. J Rural Health. 2020;36(3):433–45.

    Article  Google Scholar 

  7. Hamidi S, Sabouri S, Ewing R. Does density aggravate the COVID-19 pandemic? Early findings and lessons for planners. J Am Plann Assoc. 2020;86(4):495–509.

    Article  Google Scholar 

  8. Karmakar M, Lantz PM, Tipirneni R. Association of social and demographic factors with COVID-19 incidence and death rates in the US. JAMA Netw Open. 2021;4(1):e2036462.

    Article  Google Scholar 

  9. Ahmad K, Erqou S, Shah N, Nazir U, Morrison AR, Choudhary G, Wu WC. Association of poor housing conditions with COVID-19 incidence and mortality across US counties. PLoS ONE. 2020;15(11):e0241327.

    CAS  Article  Google Scholar 

  10. Kan Z, Kwan MP, Wong MS, Huang J, Liu D. Identifying the space-time patterns of COVID-19 risk and their associations with different built environment features in Hong Kong. Sci Total Environ. 2021;772:145379.

    CAS  Article  Google Scholar 

  11. Kwok CY, Wong MS, Chan KL, Kwan MP, Nichol JE, Liu CH, Wong JY, Wai AK, Chan LW, Xu Y, Li H. Spatial analysis of the impact of urban geometry and socio-demographic characteristics on COVID-19, a study in Hong Kong. Sci Total Environ. 2021;764:144455.

    CAS  Article  Google Scholar 

  12. Karaye IM, Horney JA. The impact of social vulnerability on COVID-19 in the US: an analysis of spatially varying relationships. Am J Prev Med. 2020;59(3):317–25.

    Article  Google Scholar 

  13. Ulimwengu J, Kibonge A. Spatial spillover and COVID-19 spread in the U.S. BMC Public Health. 2021;21(1):1765.

    CAS  Article  Google Scholar 

  14. Lakshmi Priyadarsini S, Suresh M. Factors influencing the epidemiological characteristics of pandemic COVID 19: a TISM approach. Int J Healthcare Manage. 2020;13(2):89–98.

    Article  Google Scholar 

  15. Zhang A, Shi W. Mining significant fuzzy association rules with differential evolution algorithm. Appl Soft Comput. 2020;97:105518.

    Article  Google Scholar 

  16. Hong Kong Information Services Department. – COVID-19. 2020. (2020). Accessed 26 Mar 2021.

  17. Center of Health Protection, HKSAR Government. Latest situation of cases of COVID-19 (as of 24 May 2020). 2020. Accessed 26 Mar 2021.

  18. Siu GK, Lee LK, Leung KS, Leung JS, Ng TT, Chan CT, Tam KK, Lao HY, Wu AK, Yau MC, Lai YW. Will a new clade of SARS-CoV-2 imported into the community spark a fourth wave of the COVID-19 outbreak in Hong Kong? Emerg Microb Infect. 2020;9(1):2497–500.

    CAS  Article  Google Scholar 

  19. To KK, Chan WM, Ip JD, Chu AW, Tam AR, Liu R, Wu AK, Lung KC, Tsang OT, Lau DP, To WK. Unique clusters of severe acute respiratory syndrome coronavirus 2 causing a large coronavirus disease 2019 outbreak in Hong Kong. Clin Infect Dis. 2021;73(1):137–42.

    CAS  Article  Google Scholar 

  20. Center of Health Protection, HKSAR Government. Latest situation of cases of COVID-19 (as of 12 March 2021). 2021. Accessed 26 Mar 2021.

  21. Census and Statistics Department, HKSAR Government. District Profiles | 2016 Population By-census. 2016. Accessed 20 Dec 2021.

  22. Surveying and Mapping Office, HKSAR Government. Maps and Services. 2016. Accessed 20 Dec 2021.

  23. Hong Kong Transport Department. Coordination of bus stop location in Hong Kong. 2021. Accessed 26 Mar 2021.

  24. Hong Kong Planning Department. Land Utilization in Hong Kong. 2021.!. Accessed 26 Mar 2021.

  25. HKU GIS Research Centre. Major Data Sources in Hong Kong: Planning Department - Digital boundary of Town Planning Unit (TPU) and Street Block (SB). 2016. Accessed 26 Mar 2021.

  26. Lenormand M, Bassolas A, Ramasco JJ. Systematic comparison of trip distribution laws and models. J Transp Geogr. 2016;51:158–69.

    Article  Google Scholar 

  27. Yang H, Ke J, Ye J. A universal distribution law of network detour ratios. Transp Res C. 2018;96:22–37.

    CAS  Article  Google Scholar 

  28. Telikani A, Gandomi AH, Shahbahrami A. A survey of evolutionary computation for association rule mining. Inf Sci. 2020;524:318–52.

    Article  Google Scholar 

  29. Altaf W, Shahbaz M, Guergachi A. Applications of association rule mining in health informatics: a survey. Artif Intell Rev. 2017;47(3):313–40.

    Article  Google Scholar 

  30. Piatetsky-Shapiro G. Discovery, analysis, and presentation of strong rules. Knowl Discov Databases. 1991;18:229–38.

    Google Scholar 

  31. Bayardo RJ, Agrawal R, Gunopulos D. Constraint-based rule mining in large, dense databases. Data Min Knowl Disc. 2000;4(2):217–40.

    Article  Google Scholar 

  32. HKSAR Government. Compulsory testing notice gazetted (November 22, 24, 26, 29 and December 1 2021). 2021.;;;; Accessed 12 May 2021.

  33. Kaufman DJ, Murphy-Bollinger J, Scott J, Hudson KL. Public opinion about the importance of privacy in biobank research. Am J Hum Genet. 2009;85(5):643–54.

    CAS  Article  Google Scholar 

  34. Ermakova T, Fabian B, Kelkel S, Wolff T, Zarnekow R. Antecedents of health information privacy concerns. Procedia Comput Sci. 2015;63:376–83.

    Article  Google Scholar 

  35. McPhillips D. Coronavirus Survey: Worry About the Economy Is Highest Among the Wealthy. In: U.S. News & World Report. 2021 Accessed 14th May 2021.

  36. Jarvis CI, Van Zandvoort K, Gimma A, Prem K, Klepac P, Rubin GJ, Edmunds WJ. Quantifying the impact of physical distance measures on the transmission of COVID-19 in the UK. BMC Med. 2020;18(1):1.

    Article  Google Scholar 

  37. Ng MK, Lau YT, Chen H, He S. Dual Land Regime, Income Inequalities and Multifaceted Socio-Economic and Spatial Segregation in Hong Kong. Urban Socio-Economic Segregation and Income Inequality. 2021;113.

  38. Fontal A, Bouma MJ, San-José A, López L, Pascual M, Rodó X. Climatic signatures in the different COVID-19 pandemic waves across both hemispheres. Nat Comput Sci. 2021;1(10):655–65.

    Article  Google Scholar 

  39. Cowling BJ, Ali ST, Ng TW, Tsang TK, Li JC, Fong MW, Liao Q, Kwan MY, Lee SL, Chiu SS, Wu JT. Impact assessment of non-pharmaceutical interventions against coronavirus disease 2019 and influenza in Hong Kong: an observational study. Lancet Public Health. 2020;5(5):e279–88.

    Article  Google Scholar 

  40. Howard J, Huang A, Li Z, Tufekci Z, Zdimal V, van der Westhuizen HM, von Delft A, Price A, Fridman L, Tang LH, Tang V. An evidence review of face masks against COVID-19. Proc Natl Acad Sci. 2021;118:4.

    Article  Google Scholar 

  41. Russell TW, Golding N, Hellewell J, Abbott S, Wright L, Pearson CA, van Zandvoort K, Jarvis CI, Gibbs H, Liu Y, Eggo RM. Reconstructing the early global dynamics of under-ascertained COVID-19 cases and infections. BMC Med. 2020;18(1):1–9.

    CAS  Article  Google Scholar 

Download references


We sincerely thank Prof. Michael F. Goodchild for his valuable comments that have greatly helped improving this article.


This study was supported by National Key R&D Program of China (2019YFB2103102) and the Hong Kong Polytechnic University (Otto Poon Charitable Foundation Smart Cities Research Institute Work Program CD03, 1-99XK, P0035181).

Author information

Authors and Affiliations



AZ developed the computation model, interpreted the results, wrote the manuscript, and helped designed the study. WS conceived and designed the study and helped develop the computation model. CT collected the data and helped wrote the manuscript. XZ, YL, ZL, YY, and ZS collected and preprocessed the data.

Corresponding author

Correspondence to Wenzhong Shi.

Ethics declarations

Ethics approval and consent to participate

No clinical data, experiment on humans, or the use of human tissue samples is included in the manuscript. All methods were carried out in accordance with relevant guidelines and regulations.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Table S1 (SES of the TPUs with different rates of cases in the Dancing/Singing Cluster) and Table S2 (Full sets of resultant rules).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, A., Shi, W., Tong, C. et al. The fine-scale associations between socioeconomic status, density, functionality, and spread of COVID-19 within a high-density city. BMC Infect Dis 22, 274 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Spatial association rule mining
  • Geographic knowledge discovery
  • COVID-19