- Research
- Open access
- Published:
A novel method to select time-varying multivariate time series models for the surveillance of infectious diseases
BMC Infectious Diseases volume 24, Article number: 832 (2024)
Abstract
Background
Describing the transmission dynamics of infectious diseases across different regions is crucial for effective disease surveillance. The multivariate time series (MTS) model has been widely adopted for constructing cross-regional infectious disease transmission networks due to its strengths in interpretability and predictive performance. Nevertheless, the assumption of constant parameters frequently disregards the dynamic shifts in disease transmission rates, thereby compromising the accuracy of early warnings. This study investigated the applicability of time-varying MTS models in multi-regional infectious disease monitoring and explored strategies for model selection.
Methods
This study focused on two prominent time-varying MTS models: the time-varying parameter-stochastic volatility-vector autoregression (TVP-SV-VAR) model and the time-varying VAR model using the generalized additive framework (tvvarGAM), and intended to explore and verify their applicable conditions for the surveillance of infectious diseases. For the first time, this study proposed the time delay coefficient and spatial sparsity indicators for model selection. These indicators quantify the temporal lags and spatial distribution of infectious disease data, respectively. Simulation study adopted from real-world infectious disease surveillance was carried out to compare model performances under various scenarios of spatio-temporal variation as well as random volatility. Meanwhile, we illustrated how the modelling process could help the surveillance of infectious diseases with an application to the influenza-like case in Sichuan Province, China.
Results
When the spatio-temporal variation was small (time delay coefficient: 0.1–0.2, spatial sparsity:0.1–0.3), the TVP-SV-VAR model was superior with smaller fitting residuals and standard errors of parameter estimation than those of the tvvarGAM model. In contrast, the tvvarGAM model was preferable when the spatio-temporal variation increased (time delay coefficient: 0.2–0.3, spatial sparsity: 0.6–0.9).
Conclusion
This study emphasized the importance of considering spatio-temporal variations when selecting appropriate models for infectious disease surveillance. By incorporating our novel indicators—the time delay coefficient and spatial sparsity—into the model selection process, the study could enhance the accuracy and effectiveness of infectious disease monitoring efforts. This approach was not only valuable in the context of this study, but also has broader implications for improving time-varying MTS analyses in various applications.
Introduction
Infectious disease surveillance is crucial in the early stages of a disease epidemic or outbreak [1,2,3]. Previous studies have demonstrated the advantages of using multivariate time series (MTS) models to construct cross-regional infectious disease transmission networks, particularly in terms of interpretability and predictive performance [4,5,6]. As a result, MTS models, along with mechanistic models, have emerged as significant research tools in the field of infectious disease surveillance [7, 8]. However, despite the potential of MTS models, significant challenges arise when applying them to infectious disease control efforts. Specifically, traditional MTS models often rely on a central assumption: parameters such as coefficients and disturbance variances are constant [9]. In other words, these models implicitly assume that the transmission rate of the disease or the transmission pattern among different regions and populations does not change over time. This assumption clearly struggles to accurately reflect the dynamic characteristics of infectious disease transmission. Furthermore, these models ignore the influence of neighboring incidence rates and other factors, making it difficult to ensure the accuracy and effectiveness of early warnings [10, 11]. More importantly, there may be extensive time-lag correlation effects among variables, meaning the incidence of local infectious diseases could be influenced by various factors from past periods and their own historical values [12, 13]. Consequently, due to their constant parameter settings, traditional MTS models cannot describe the dynamic characteristics of real-world situations and the changes in system structure over a certain period. This limitation may result in significant discrepancies between the model's predictions and the actual spread of the disease.
In the field of infectious disease surveillance, the consideration of time-varying parameters is gradually gaining attention. Previous research has employed various time-varying parameter models, such as Bayesian spatiotemporally varying coefficient models, varying coefficient distributed lag non-linear models, and quasi-Poisson varying coefficient regression models [12,13,14,15,16]. These models primarily focus on how a single independent variable (e.g., weather conditions or air pollutants) affects a dependent variable (e.g., infectious disease incidence) over time. However, when aiming to explore the interrelationships and changes in the incidence sequences of infectious diseases across regions, these models may not provide a comprehensive perspective. To comprehensively analyze the transmission dynamics of infectious diseases across different regions, we can draw upon two primary models widely used in fields like economics, finance, and psychology, despite their unprecedented application in infectious disease research: the time-varying model based on the Bayesian framework and the time-varying model based on the Generalized Additive Model (GAM) framework. These two models are common methodologies in Granger causality exploration. Firstly, the Bayesian framework model, represented by the TVP-SV-VAR model, stands out for its ability to flexibly capture temporal variations in data [17, 18]. This model assumes that the coefficients, variances, and covariances are time-varying, with all parameters following a random walk process assumption. This allows the model to effectively capture the asymptotic changes in the underlying structural and swiftly respond to any sudden shifts. Nevertheless, the downside of this model lies in its complexity, which may lead to overfitting issues. On the other hand, the tvvarGAM model, as a representative of the time-varying model using the GAM framework, excels in its flexible handling of time-lag coefficients [19, 20]. This model only assumes time-varying lag coefficients, thus reducing empirical assumptions about the data and lowering the risk of overfitting. However, this model may not be as sensitive as the TVP-SV-VAR model in capturing instant relationships and random volatility. These two representative models possess distinct characteristics, while many other models are variations based on these two representatives.
The great potential of time-varying multivariate time series models for infectious disease surveillance is still not fully exploited. This is primarily due to the fact that the epidemic of infectious disease exhibits its own temporal and spatial variational features, but so far there has been a lack of model comparison as well as model selection indicators tailored to such data features. In infectious disease research, it is imperative to consider the spatial variations induced by multiple complex factors such as the natural environment, pathogenesis, population transmission, and interaction. This makes the spatiotemporal distribution of infectious diseases more complex and variable than economic and financial data. Consequently, it is necessary to explore and validate the applicability conditions of both the TVP-SV-VAR and tvvarGAM models for infectious diseases surveillance.
In order to solve the above problem, we referred to the statistical framework proposed by Held [21] to help build model selection indicators. The original Held’s framework decomposes the disease incidence into an “endemic” and an “epidemic” part (e.g., an epidemic with region and epidemic among regions), which can monitor the spread of infectious disease and quantify the effect of different components. Furthermore, in our previous application of this framework to explore the spatio-temporal dynamics and potential ecological drivers of acute respiratory infectious diseases [22], we observed significant findings. Specifically, it was revealed that the auto-correlation within each area (time delay coefficient) and the cross-correlation between different areas (spatial sparsity) could indicate the pattern of disease-specific transmission. On this basis, this study further proposed both of them as indicators for time-varying MTS model selection to explore the infectious disease on surveillance data.
In this study, we first presented the details of the TVP-SV-VAR model and tvvarGAM model respectively before comparing their differences in theory. Since each model has its own advantages in handling either temporal or spatial variation, we proposed the time delay coefficient and spatial sparsity indicators for model selection. Furthermore, we used the simulation study adopted from real-world infectious disease surveillance to compare model performances under various scenarios of spatio-temporal variation as well as random volatility. Finally, we illustrated how the modelling process could help the surveillance of infectious diseases with an application to the influenza-like case in Sichuan Province, China.
Method
Related theoretical research
TVP-SV-VAR model
The impacts of random fluctuations and drifts are usually accompanied by disease data, and would lead to biased coefficient estimation if models with constant volatility are used. Hence, the time-varying parameters in the TVP-SV-VAR model with the random volatility assumption could effectively reveal the mutual influence relationships and characteristics among different time series of incidence data. Specifically, the general expression of the TVP-SV-VAR model is:
where \({\varvec{Y}}_{t}\) is a k-dimensional time series (e.g., the time series of incidence data from k different areas), s is the maximum time lag, k stands for dimensionality, and \({\varvec{Y}}_t^{*} { = }{\varvec{I}}_{k} \otimes ({1},{\varvec{Y}}_{{t{ - 1}}}^{\prime} , \cdots ,{\varvec{Y}}_{{t - {s}}}^{\prime} )\). Besides, let \(\beta_{t}\) be the time-varying auto-regression coefficients, \({\varvec{A}}_{t}\) be the innovation of one variable that will affect the others with different degrees during the temporal-dynamic process, and \({\varvec{h}}_{t} { = }({\varvec{h}}_{{{1}t}} , \cdots ,{\varvec{h}}_{kt} )^{\prime}\) be the logarithmic random volatility with \(h_{ik} = {\text{log }}\gamma_{it}^{2}\),\(j{ = }{1}, \cdots k,t{ = }s{ + }{1}, \cdots ,n\). As followed by the work of Nakajima [23,24,25] and Primiceri [26], all the non-zero and non-one elements in the lower triangular matrix \({\varvec{A}}_{t}\) were stacked as a new vector \({\varvec{a}}_{t} { = }(a_{{{21}}} ,a_{{{31}}} ,a_{{{32}}} ,a_{{{41}}} , \cdots ,a_{{k,k{ - 1}}} )^{\prime}\). Furthermore, the random walk process was adopted to characterize the changing process of time-varying parameters as below. The series of coefficient changes is \(\beta_{t}\), the series of structural information changes is \(a_{t}\), and the series of volatility changes is \(h_{t}\),
where \(u_{\beta t}\), \(u_{at}\) and \(u_{ht}\) are disturbance terms. Then the structural impact including \(\varepsilon_{t}\) is
where \(\beta_{s + 1} \sim N(u_{{\beta_{0} }} ,\Gamma_{{\beta_{0} }} ),a_{s + 1} \sim N(u_{{a_{0} }} ,\Gamma_{{\beta_{0} }} )\) and \(h_{s + 1} \sim N(u_{{h_{0} }} ,\Gamma_{{\beta_{0} }} ).\)
Since both \(\beta_{t}\) and \(a_{t}\) are subject to the random walk process, and the current change of parameters is determined according to the previous period values and random errors, it is reasonable that the process is a short-term constraint. At the same time, because \(h_{t}\) also satisfies the random walk process, the model can reflect the random wave characteristics in the time series.
Parameter estimation was carried out by the Markov chain Monte Carlo (MCMC, it is a powerful statistical simulation method that can approximate the numerical value of a target probability distribution through random sampling without knowing the specific form of the distribution).
After parameter estimation, time-varying impulse response analysis was used to explore the relationships between incidence time series among different areas both simultaneously and chronologically. Specifically, it describes the evolution of a model’s variables in reaction to a shock (i.e., one standard deviation unit applied to the random error term) in one or more variables. To this end, we rewrote (1) as follows:
where \({\varvec{B}}(L){ = }{\varvec{I}}_{n} { - }{\varvec{B}}_{{1}} L{ - }{\varvec{B}}_{{2}} L^{{2}} { - } \cdots { - }{\varvec{B}}_{s} L^{s}\), \({\varvec{\eta}}_{t} { = }{\varvec{A}}^{{{ - 1}}} {\sum }\varepsilon_{t}\), and \({\varvec{B}}(L)\) is the time-lag operator. Assuming \({\varvec{y}}_{t}\) is the covariance-stationary time series with mean \({\varvec{\mu}}\), the expectation on both sides of Eq. (4) yields,
From the model presented above, it could be seen that the TVP-SV-VAR model relies on the random fluctuation assumption of the time-varying parameters to capture the information of the fluctuations of first and second moments, so as to characterize the changes of the model parameter coefficients. It could deal with both auto-correlations within one area and cross-correlations between different areas in a unified framework. However, the TVP-SV-VAR model comes at the cost of more assumptions and higher degrees of freedom, which makes it liable to problems such as model misspecification and overfitting, especially when there is also a lack of appropriate model selection indicators. See Appendix A for more details about the TVP-SV-VAR model.
TvvarGAM model
The most important difference between the tvvarGAM and TVP-SV-VAR model is that the former does not concentrate on the higher moments of time series data (i.e., concurrent relationships among different time series, random volatility, and so on), so that it reduces to a relatively simple form.
where \({\varvec{Y}}_{t}\) is a \(k \times {1}\) vector with \(k\) observations (i.e., each observation comes from one area) at time t, \({\varvec{\beta}}_{0,t}\) is the intercept term, \({\varvec{B}}_{t}\) is the time-lag operator, and \({\varvec{\varepsilon}}\) is the random disturbance term that follows a normal distribution. The Generalized Additive Model (GAM) framework was used to model the time-varying parameters as splines of time to estimate the time-varying parameters. The independent variables were divided into several continuous intervals, and each interval used a separate spline function to generate a smooth curve. In this study, thin plate regression spline basis, ten basis functions, and wiggliness penalty scheme were adapted [20, 27, 28]. Finally, a 95% Bayesian confidence interval (CI) was used to estimate the uncertainty of the smoothing function. See Appendix B for more details about the tvvarGAM model.
Compared with the TVP-SV-VAR model, the tvvarGAM model relies on fewer assumptions to avoid the potential issue of overfitting, which is mainly reflected as fluctuations in the second moment. Based on its theoretical advantages, the tvvarGAM model was used in this study to estimate the time-varying time-lag correlation coefficients between regional infectious diseases. However, the tvvarGAM model may prove inadequate when both temporal and spatial variations are present in the data structure. In such scenarios, the key to selecting an optimal model lies in striking a balance between bias and variance. To achieve this equilibrium, it is crucial to identify suitable metrics that can guide the model selection process.
Simulation study
In the simulation study, we explored the performance of the time-varying multivariate time series model under different degrees of spatiotemporal heterogeneity by using the two key indicators of time delay coefficient and spatial sparsity. We first used the real percentage of Influenza-like Illness (ILI%) time series of Sichuan province and adopted the common VAR model to conduct a preliminary exploration. On this basis, the specific parameters boundary values of the simulation scene were set as follows, and more details are provided in Appendix C.
Simulation scenarios based on the time delay coefficient and spatial sparsity
As mentioned in the introduction part, since time delay coefficient and spatial sparsity are important indicators when characterizing the epidemics of infectious diseases and choosing the appropriate time-varying MTS models, the simulation scenarios were constructed on these two indicators.
The time delay coefficient measures the auto-correlation of incidence data between different time points within one area. In this study, a time-varying VAR model is generated by randomly assigning a time series to each non-zero parameter with randomness obeying a linear increasing, linear decreasing, Sigmoid increasing, Sigmoid decreasing, and normal distribution. If an edge is missing from the initial matrix, all entries in the parameter time series are set to zero. The maximum value of the time delay coefficient (noted as θ) was set for the different time-lag correlation coefficients, i.e., the maximum lagged correlation between regional incidence was 0.1, 0.2, and 0.3.
Meanwhile, the spatial sparsity indicates the probability of having cross correlations among incidence time series of different areas, in response to the spatial correlation of incidence. In this study it was set based on the range of magnitude of the time lag coefficients between incidences of neighbouring areas, and the time-lag correlation between the time series was set to varying degrees of spatial sparsity (noted as v_sparse): 0.1, 0.3, 0.6, and 0.9.
Specifically, a null matrix of VAR parameters initialized with p × p dimensions (the approximate number of ILI% variables in an analytical study of the dynamic hazard identification of actual neighboring municipal (prefecture) ILI%) was first incorporated into the VAR model, setting all p auto-correlation terms to be non-zero, since auto-correlation was considered to be present in most natural and social phenomena and can be observed in essentially any application. Next, randomly set x of the p × (p-1) non-diagonal elements to be present, i.e., there was a cross-lagged effect, corresponding to the probability that its edge was present as p = v_sparse.
By setting different maximum values of the time-varying time-lag coefficients and different degrees of spatial sparsity values, as well as whether the time series has random volatility, we could further explore and compare the performances of TVP-SV-VAR model and tvvarGAM model under various scenarios of spatio-temporal variation.
Performance evaluation
This study used the RMSE to evaluate the fitted residual of the two models. To evaluate the estimation error of the time-lag correlation coefficient of the model, the standard error of each time-lag correlation coefficient in the target city (prefecture) equation was used, and the mean values of different time points and different parameters were calculated successively for comparison.
Sensitivity analysis on the order of time vector elements
When applying MTS models to the surveillance of infectious diseases, it was used to set the target area (i.e., the area of interest) as the last element of the time vector, while its neighbouring areas were put into other elements. In order to verify the robustness of time delay coefficient estimates under different orderings of time vector elements, we further conducted sensitivity analysis by reshuffling the elements (i.e., area) in the time vector and checking whether changing the time vector orderings would affect the estimates of time delay coefficients among different areas. The results of the robustness analysis of this study show that the order of the variables does not have a large effect on the magnitude of the time-lag correlation coefficient of interest in the study, and the results are robust (see more details in Appendix C.3).
Case study
General description
ILI% is widely used for influenza surveillance to reflect the intensity of local influenza epidemics [29, 30]. In this study, the influenza-like case data came from Sichuan Provincial Center for Disease Control and Prevention and included sentinel surveillance data from 2015 to 2019 in Sichuan Province. The identification criteria for ILI patients were based on the World Health Organization's standard case definition [31]. Sichuan Province is a representative province with typical multi-ethnic genetic background, multi-cultural customs and multi-type natural landforms. This study takes Sichuan Province, which has obvious diversity in economic development, natural environment, cultural customs and biological genetics. As a key region for influenza surveillance in China, Sichuan's representativeness guarantees the applicability of this method in these regions and offers evidence for future applications.
Statistic analysis
-
Step 1. Data preprocessing
The original data used in this study comprised weekly counts of influenza cases and the total numbers of outpatient and emergency visits, segmented by age group, at each surveillance site. Subsequently, the ILI% was determined by calculating the proportion of ILI patients relative to the total outpatient and emergency department visits in designated sentinel hospitals. Influenza-like case surveillance is a routine surveillance activity, and the analysis of ILI data is not considered human subject research and does not require the associated ethical review.
In this study, the original data set was systematically collated and transformed to generate high-quality data samples suitable for subsequent analysis. The data sample presented a five-column structure (see more details in Appendix D). Specifically, the first column listed the monitoring points of each city, and the second column accurately calculated and recorded the percentage of ILI corresponding to each monitoring point, providing a solid data basis for in-depth analysis of the spatial distribution and dynamic changes of influenza epidemic. The third column represents this point in time and lays the foundation for subsequent construction of time series data. In addition, the fourth and fifth columns indicate the latitude and longitude of the monitoring points in each city, respectively.
-
Step 2. Model selection
Before conducting the time-varying MTS analysis, it is necessary to select appropriate city clusters for the estimation of the TVP-SV-VAR model or the tvvarGAM model. This selection is based on the size range of the ILI% delay coefficient between cities (prefectures) and their geographical location.
First, we separately took each city as the target city, so that its neighboring cities together constituted a cluster. In terms of the time series vector setting, the target city was set at the end of the vector, while the order of its neighboring cities was arranged from farthest to closest to the target based on the lengths of the boundary with the target city.
Second, in order to avoid pseudo-regression, the Augmented Dickey-Fuller (ADF) test and Johansen cointegration test were used (see more details in Appendix E.2).
Furthermore, to determine the optimal lag order, we built an ordinary VAR model within each cluster. The number of neighboring cities (prefectures) was used as a measure of spatial sparsity, and the time delay coefficient set in the simulation study corresponded to the lag order. The AIC, BIC and HQ information criteria were calculated in the model, and the optimal lag order should minimize the average value of the three criteria.
Lastly, we carefully selected an appropriate time-varying multivariate time series model for modeling the clusters based on the number of neighboring cities (prefectures) surrounding the central cities and the optimal lag order determined through calculations. Specifically, the quantity of adjacent cities (prefectures) serves as an indicator of spatial sparsity, reflecting the relative isolation or connectivity of the target city in space [4]. Meanwhile, the optimal lag order functions as a metric for the time delay coefficient. Collectively, these parameters form the foundation for our model selection process.
-
Step 3. Model estimation
Applying the time-varying multivariate series model to sentinel monitoring ILI% data of Sichuan Province from 2015 to 2019. According to the time-varying MTS model selection framework suggested by the simulation findings, the case analysis was carried out region-by-region from the perspective of infectious disease surveillance in the real world. Specifically, we employed the aforementioned time-varying MTS methods to visualize the dynamic risk of ILI% among neighboring cities. All data analyses in this study were based on R 4.2.1, the “bvarsv” package is used to estimate the TVP-SV-VAR model [32], the tvvarGAM model use the “tvvarGAM” package to evaluate the data [19], and the statistical test level was 0.05.
Result
Simulation result
Constant variance of time series residuals
When the variance of the time series residuals was set to be constant, the comparison results of the fitted residuals and the standard errors of the parameter estimates of the two models were shown in Table 1. The results showed that the fitted RMSE of the TVP-SV-VAR model was smaller under relatively low time delay coefficients and spatial sparsity values. Under all the settings of time delay coefficients and spatial sparsity values, the TVP-SV-VAR model consistently exhibited a smaller standard error for the estimated time delay coefficients. In addition to considering the time delay, the disturbance term of the TVP-SV-VAR model also included the influence of the simultaneous pulses between time series. When the time delay coefficient and spatial sparsity value were large, the other time series had a greater influence on the time delay of the central time series. In this case, the simultaneous pulses captured by the perturbation term in the TVP-SV-VAR model played a more prominent role in shaping the estimated time delay coefficient. Therefore, the RMSE of the TVP-SV-VAR model was larger than that of tvvarGAM when the time delay coefficient and spatial sparsity value were larger. On the contrary, the TVP-SV-VAR model was better with lower RMSE and standard error. In the case of large time delay coefficients and spatial sparsity values, if the concurrent relationships between time series were not priority for consideration, then the fitted residuals of the tvvarGAM model were smaller than those of the TVP-SV-VAR model.
Time-varying variance of time series residuals
According to the results in Table 2, the fitting RMSE of the TVP-SV-VAR model was smaller with a low time delay coefficient and low spatial sparsity value when the variance of the time series residuals was set to be time-varying. Under all the settings of time delay coefficients and spatial sparsity values, the TVP-SV-VAR model consistently demonstrated smaller standard errors for the estimated time delay coefficients. According to the characteristics of the model, for a series with random volatility, the tvvarGAM model with a time-varying delay coefficient and constant fluctuation might ignore the changes in the disturbance term when the time delay coefficient and spatial sparsity value were large, resulting in biased estimated coefficients. Therefore, for time series with random volatility, it is advisable to employ the TVP-SV-VAR model for estimating the time delay coefficient to avoid such bias.
Case study result
Descriptive analysis
Figure 1 showed the geostatistical heat map of the annual mean ILI% of neighboring cities from 2015 to 2019. According to the results, the changes in Aba, Guangyuan and Bazhong were relatively stable over the five years, while Bazhong had been showing a lower level of ILI% and Yibin had been showing a higher level of ILI%. See Appendix E.1 for more details on descriptive analysis.
Model selection
Table 3 showed the lag orders for each cluster model. Given the combined performance of sparsity and lag order, we chose Luzhou, Guangyuan, and Leshan as representative examples for TVP-SV-VAR model estimation in scenarios characterized by minimal delay coefficients and sparsity. In contrast, Chengdu, Nanchong and Meishan were selected as examples for tvvarGAM model estimation under the scenario of large delay coefficients and spatial sparsity.
Dynamic risk identification when time delay coefficients and spatial sparsity were small
According to the model selection results, Luzhou, Guangyuan and Leshan cities were selected as examples to perform the TVP-SV-VAR model estimation. More details of Luzhou and other cities were shown in Appendix F.
Analysis and comparison of the standard deviation of model residuals
Figure 2 compared the RMSE between TVP-SV-VAR model and VAR model in target cities (i.e., Luzhou, Guangyuan, and Leshan). Among them, ILI% of Luzhou had a large volatility in 2016, ILI% of Guangyuan had two volatility peaks at the beginning of 2017, and ILI% of Leshan had a large volatility peak at the end of 2017.
In addition, Table 4 showed that, on average, the standard deviation of shocks estimated by the TVP-SV-VAR model was smaller than that of the ordinary time-invariant parameter VAR model.
Dynamic change analysis of the relationship between stochastic volatility and the same period and detection of parameter anomalies
According to Fig. 3, the random volatility of ILI% in Luzhou generally presented cyclical fluctuations, with the peak appearing in the first half of each year and the bottom generally appearing in the second half of each year.
Dynamic change analysis of time delay correlation coefficients and detection of parameter anomalies
Figure 4 could help check whether the time delay coefficients over time exceeded the upper limit of the previous epidemic standards and the warning threshold. Based on infectious disease surveillance, any outliers in time delay coefficients could be monitored in real time. When the parameter value exceeded the upper limit of the epidemic standard or the warning threshold, the surveillance system would issue a real-time warning signal, focusing on the prevention and control of the risk of epidemic sources among the corresponding neighborhood.
Time-varying impulse response analysis
Figure 5 showed that one unit of standard deviation shock on the ILI% of Neijiang would cause a significant positive impulse response on that of Luzhou around the 193rd week, with the highest value close to 2, and a long response duration until after 26 weeks of lagging. In the 101st week, the ILI% of Luzhou showed the second-largest impulse response, with the positive and negative responses oscillating in the first period and soon converging to a positive response until it converged to zero. In the 57th week, the ILI% of Luzhou showed a relatively obvious negative impulse response with a maximum negative response of about -0.94. As for the ILI% of Yibin, the response for one unit of standard deviation shock was negative in the remaining period except for the early impulse response that oscillated positive and negative until the 110th week, and gradually converged to zero around the 26th week. In contrast, the ILI% of Zigong responded to one unit of standard deviation shock in a positive way, with the highest value close to 0.70 during the study period.
Prediction comparison
In this study, the rolling one-step-ahead forecasting was performed with 26 time periods. Gibbs sampling was set at 10,000 times with preheating 1,000 times. The prediction results were shown in Fig. 6 and Table 5.
The results from Fig. 6 and Table 5 showed that the average values of the prediction error evaluation index of TVP-SV-VAR model of ILI % in the above three cities (prefectures) were less than that of the VAR model. Therefore, the prediction performance of the TVP-SV-VAR model was better than that of the VAR model without time-varying parameters.
Dynamic risk identification when time delay coefficients and spatial sparsity were large
Based on the model selection results, the tvvarGAM model estimation under this scenario was selected for Chengdu, Nanchong and Meishan. More results of Chengdu and other cities were shown in Appendix G.
Figure 7 showed the suspected transmission network of ILI % among neighboring cities centered on the ILI% of Chengdu at nine equal intervals. At the first four time points, ILI % of Aba had a positive correlation with that of Chengdu. Meanwhile, during the first seven time points, ILI% of Chengdu had a positive correlation with that of Deyang, and the parameter gradually decreased over time. At the last three time points, ILI% of Chengdu had a positive correlation with that of Meishan, and the relationship between ILI% of Meishan and Chengdu changed from positive to negative.
Discussion
This study explored the performances of TVP-SV-VAR model and tvvarGAM model in the scenario of infectious diseases surveillance and early warning. The simulation results showed that the TVP-SV-VAR model had better predictive performance when the time delay coefficient and spatial sparsity were small. On the contrary, in the case of large time delay coefficient and spatial sparsity, the tvvarGAM model outperformed the TVP-SV-VAR model with smaller residuals. In addition, for time series with random volatility, the TVP-SV-VAR model should be used to estimate the time delay coefficients because its standard error was smaller, to avoid the estimation coefficient bias. In addition, based on our proposed model selection framework, the case study found some interesting time-lag relationships between Chengdu-the megacity with more than 10 million population-and its neighbors (e.g., Aba and Meishan), which deserved further attention in practical surveillance of infectious diseases.
Changes in the temporal and spatial distribution of infectious diseases cases are particularly important for epidemic monitoring and warning, which can help find key areas of infectious disease and judge the spread of the epidemic situation, to take timely prevention and control measures. Therefore, this study aims to discuss the characteristics and robustness of the time-varying parameters estimated by the two models as a means of exploring the applicability of the two models in different infectious disease surveillance and warning scenarios. As shown by the results of the simulations, there was a significant difference in the performance of the two models under different scenarios. Furthermore, this study proposed model selection for different scenarios under a statistical framework, i.e., using the time delay coefficient to capture temporal trends within regions and the spatial sparsity to capture the spatial trends between regions. On this basis, model selection indicators were constructed to indicate the transmission pattern of infectious diseases so that an appropriate model could be chosen for surveillance practice.
The two main models have strong practicability and flexibility, and can be widely applied to a variety of real-time surveillance and rapid response scenarios of infectious diseases with spatial and temporal heterogeneity. This includes, but is not limited to, acute respiratory infectious diseases such as SARS-CoV-2 [33], influenza [34] and pertussis [35], natural focal endemic diseases such as dengue fever [36], hemorrhagic fever with renal syndrome [37] and malaria [38], and intestinal infectious diseases such as bacillary dysentery [39] and hand, foot and mouth disease [40].
Meanwhile, our case study illustrated that the proposed model selection framework could benefit the surveillance of infectious diseases in several ways. First of all, the spatial–temporal correlations indicated by the time-varying MTS models could help identify some possible transmission paths of infectious diseases, which was extremely meaningful for local authorities to make plans for disease control and prevention. For example, for Luzhou, ILI% of Zigong lagging one phase had the greatest correlation to ILI% of Luzhou in the current phase (average time delay coefficient was 0.172), ILI% of Zigong had a positive impact on ILI% of Luzhou and the highest value was close to 0.70. Therefore, it is of great significance for the spread of the epidemic from Zigong to Luzhou, and more attention should be paid to the prevention and control of ILI transmission in Zigong. Besides, according to the time delay coefficient, it can also be reminded to maintain a long enough temporal window for early warning. For example, in the lag-2 phase, the correlation between ILI% of Luzhou and ILI% of Zigong turned into an obvious negative correlation (the average time delay coefficient was -0.226), which also indicated that the epidemic monitoring in Zigong should expand its monitoring window to one week in advance. The results also suggested that the time delay coefficient and spatial sparsity of infectious diseases between regions were very important. It showed that the interaction between regions plays an important role in the spread of infectious diseases. In our study area, a joint prevention and control mechanism for infectious disease events can be established to implement risk management, monitoring and early warning, information exchange, local linkage and collaborative disposal of major infectious disease events under specific circumstances.
There were some limitations of this study. From the strict perspective of infectious diseases surveillance, statistical analysis was only one part of the overall process. In other words, even some statistically significant spatio-temporal correlations were found by this study, it was still mandatory to take further steps (e.g., epidemiological investigation, etiology and laboratory studies) to confirm the transmission of diseases. In addition, since our model is a spatiotemporal model, we need to use historical data from each region, and how to ensure the timeliness and quality of the data reported by each region is an important challenge when implementing the model. In order to meet such challenges, establishing a solid data integration platform and strict data quality control specifications are necessary prerequisites for the successful use of the model. Finally, this study proposed the indicator of the time-varying MTS models in infectious disease monitoring from the spatial and temporal dimensions, respectively. However, considering that infectious disease data also have the characteristics of spatio-temporal interaction, more sophisticated spatial–temporal joint indicators could be considered in the future to help select appropriate time-varying MTS model for the surveillance of infectious diseases.
Availability of data and materials
The data analyzed in this study is subject to the following licenses/restrictions: the data that support the findings of this study are available from Sichuan Center for Disease Control and Prevention but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Sichuan Center for Disease Control and Prevention. Requests to access these datasets should be directed to TZ, statzhangtao@scu.edu.cn.
References
Yazidi R, Aissi W, Bouguerra H, Nouira M, Kharroubi G, Maazaoui L, Zorraga M, Abdeddaiem N, Chlif S, El Moussi A. Evaluation of the influenza-like illness surveillance system in Tunisia, 2012–2015. BMC Public Health. 2019;19:1–9.
Mao K, Zhang K, Du W, Ali W, Feng X, Zhang H. The potential of wastewater-based epidemiology as surveillance and early warning of infectious disease outbreaks. Curr Opinion Environ Sci Health. 2020;17:1–7.
Donaldson AL, Hardstaff JL, Harris JP, Vivancos R, O’brien SJ. School-based surveillance of acute infectious disease in children: a systematic review. BMC Infect Dis. 2021;21:1–10.
Wang H, Qiu J, Li C, Wan H, Yang C, Zhang T. Applying the Spatial Transmission Network to the Forecast of Infectious Diseases Across Multiple Regions. Front Public Health. 2022;10:774984.
Li H, Ge M, Wang C. Spatio-temporal evolution patterns of influenza incidence and its nonlinear spatial correlation with environmental pollutants in China. BMC Public Health. 2023;23(1):1685.
Qiu J, Wang H, Hu L, Yang C, Zhang T. Spatial transmission network construction of influenza-like illness using dynamic Bayesian network and vector-autoregressive moving average model. BMC Infect Dis. 2021;21:1–9.
Ray EL, Wattanachit N, Niemi J, Kanji AH, House K, Cramer EY, Bracher J, Zheng A, Yamana TK, Xiong X. Ensemble forecasts of coronavirus disease 2019 (COVID-19) in the US. MedRXiv 2020:2020.2008. 2019.20177493.
Osthus D, Hickmann KS, Caragea PC, Higdon D, Del Valle SY. Forecasting seasonal influenza with a state-space SIR model. Ann Appl Stat. 2017;11(1):202.
Lütkepohl H. Vector autoregressive models. In: Handbook of research methods and applications in empirical macroeconomics. edn. Cheltenham, UK: Edward Elgar Publishing; 2013. p. 139–64.
Osthus D, Moran KR. Multiscale influenza forecasting. Nat Commun. 2021;12(1):2991.
Pei S, Kandula S, Yang W, Shaman J. Forecasting the spatial transmission of influenza in the United States. Proc Natl Acad Sci. 2018;115(11):2752–7.
Chen Y, Zheng M, Lv J, Shi T, Liu P, Wu Y, Feng W, He W, Guo P. Interactions between ambient air pollutants and temperature on emergency department visits: analysis of varying-coefficient model in Guangzhou. Chin Sci Total Environ. 2019;668:825–34.
Wang F, Duan C, Li Y, Huang H, Shia B-C. Spatiotemporal varying coefficient model for respiratory disease mapping in Taiwan. Biostatistics. 2024;25(1):40–56.
Zhao X, Chen F, Feng Z, Li X, Zhou X-H. Characterizing the effect of temperature fluctuation on the incidence of malaria: an epidemiological study in south-west China using the varying coefficient distributed lag non-linear model. Malar J. 2014;13:1–10.
Wu Y, Qiao Z, Wang N, Yu H, Feng Z, Li X, Zhao X. Describing interaction effect between lagged rainfalls on malaria: an epidemiological study in south–west China. Malar J. 2017;16:1–10.
Song C, Shi X, Bo Y, Wang J, Wang Y, Huang D. Exploring spatiotemporal nonstationary effects of climate factors on hand, foot, and mouth disease using Bayesian Spatiotemporally Varying Coefficients (STVC) model in Sichuan. Chin Sci Total Environ. 2019;648:550–60.
Del Negro M, Primiceri GE. Time varying structural vector autoregressions and monetary policy: a corrigendum. Rev Econ Stud. 2015;82(4):1342–5.
Zhou D, Siddik AB, Guo L, Li H. Dynamic relationship among climate policy uncertainty, oil price and renewable energy consumption—findings from TVP-SV-VAR approach. Renewable Energy. 2023;204:722–32.
Bringmann LF, Ferrer E, Hamaker EL, Borsboom D, Tuerlinckx F. Modeling nonstationary emotion dynamics in dyads using a time-varying vector-autoregressive model. Multivar Behav Res. 2018;53(3):293–314.
Haslbeck JMB, Bringmann LF, Waldorp LJ. A tutorial on estimating time-varying vector autoregressive models. Multivar Behav Res. 2021;56(1):120–49.
Santillana M, Nguyen AT, Dredze M, Paul MJ, Nsoesie EO, Brownstein JS. Combining search, social media, and traditional data sources to improve influenza surveillance. PLoS Comput Biol. 2015;11(10):e1004513.
Kogan NE, Clemente L, Liautaud P, Kaashoek J, Link NB, Nguyen AT, Lu FS, Huybers P, Resch B, Havas C. An early warning approach to monitor COVID-19 activity with multiple digital traces in near real time. Sci Adv. 2021;7(10):eabd6989.
Nakajima J. Time-varying parameter VAR model with stochastic volatility: An overview of methodology and empirical applications. 2011.
Nakajima J, Watanabe T. Bayesian analysis of time-varying parameter vector autoregressive model with the ordering of variables for the Japanese economy and monetary policy. In: Institute of Economic Research, Hitotsubashi University. 2011.
Nakajima J, Kasuya M, Watanabe T. Bayesian analysis of time-varying parameter vector autoregressive model for the Japanese economy and monetary policy. J Japan Int Econ. 2011;25(3):225–45.
Primiceri GE. Time varying structural vector autoregressions and monetary policy. Rev Econ Stud. 2005;72(3):821–52.
Wood SN. Generalized additive models: an introduction with R: chapman and hall/CRC. 2017.
Wood SN, Augustin NH. GAMs with integrated model selection using penalized regression splines and applications to environmental modelling. Ecol Model. 2002;157(2–3):157–77.
Lii A, Temte J, Barlow S, Goss M, Temte E, Zaborek J, Uzicanin A. Assessment and comparison of the ILI case definition in clinical and school-based community settings: ORCHARDS/IISP. In: Annals Family Med. 2023.
Baltrusaitis K, Vespignani A, Rosenfeld R, Gray J, Raymond D, Santillana M. Differences in regional patterns of influenza activity across surveillance systems in the United States: comparative evaluation. JMIR Public Health Surveill. 2019;5(4):e13403.
Fitzner J, Qasmieh S, Mounts AW, Alexander B, Besselaar T, Briand S, Brown C, Clark S, Dueger E, Gross D. Revision of clinical case definitions: influenza-like illness and severe acute respiratory infection. Bull World Health Organ. 2018;96(2):122.
Krueger F. bvarsv: Bayesian analysis of a vector autoregressive model with stochastic volatility and time-varying parameters. R package version 2015; 1.
Benimana TD, Lee N, Jung S, Lee W, Hwang S-S. Epidemiological and spatio-temporal characteristics of COVID-19 in Rwanda. Global Epidemiol. 2021;3:100058.
Lee SS, Wong NS. The clustering and transmission dynamics of pandemic influenza A (H1N1) 2009 cases in Hong Kong. J Infect. 2011;63(4):274–80.
Huang X, Lambert S, Lau C, Magalhaes RJS, Marquess J, Rajmokan M, Milinovich G, Hu W. Assessing the social and environmental determinants of pertussis epidemics in Queensland, Australia: a Bayesian spatio-temporal analysis. Epidemiol Infect. 2017;145(6):1221–30.
Liu K, Sun J, Liu X, Li R, Wang Y, Lu L, Wu H, Gao Y, Xu L, Liu Q. Spatiotemporal patterns and determinants of dengue at county level in China from 2005–2017. Int J Infect Dis. 2018;77:96–104.
Chen JJ, Guo TC, Song SX, Shao ZJ, Liu K. Epidemiological characteristics and the development of spatiotemporal analysis models on hemorrhagic fever with renal syndrome in China. Zhonghua Liu Xing Bing Xue Za Zhi. 2020;41(10):1735–40.
Seyoum D, Yewhalaw D, Duchateau L, Brandt P, Rosas-Aguirre A, Speybroeck N. Household level spatio-temporal analysis of Plasmodium falciparum and Plasmodium vivax malaria in Ethiopia. Parasit Vectors. 2017;10:1–11.
Zhang T, Zhang X, Ma Y, Zhou XA, Liu Y, Feng Z, Li X. Bayesian spatio-temporal random coefficient time series (BaST-RCTS) model of infectious disease. Math Biosci. 2014;258:93–100.
Tian L, Liang F, Xu M, Jia L, Pan X, Clements ACA. Spatio-temporal analysis of the relationship between meteorological factors and hand-foot-mouth disease in Beijing, China. BMC Infect Dis. 2018;18:1–10.
Acknowledgements
The authors would like to thank institute of New Productive Forces in Health for financial support.
Funding
This work was supported by the Sichuan Province Science and Technology Support Program (2021YFS0001-LH, 2022YFS0229, 2022YFS0641), National Natural Science Foundation of China (81602935), Chongqing Science and Technology Program (cstc2020jscx-cylhX0003), Sichuan Tianfu New Area Publice Health Center (H230821).
Author information
Authors and Affiliations
Contributions
JY collected the data, consulted the literature, analyzed the data, and was the major contributor in writing the manuscript. HMW consulted the literature, wrote the manuscript, and checked the full manuscript. MSC and XYH consulted and sorted out the literature. QD and CY made a review of the relevant literature. WHZ, YM, FY, YW wrote part of the manuscript and checked the full manuscript. CHY made a partial review of the relevant literature. TZ has made important contributions to analysis and manuscript preparation. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Yu, J., Wang, H., Chen, M. et al. A novel method to select time-varying multivariate time series models for the surveillance of infectious diseases. BMC Infect Dis 24, 832 (2024). https://doi.org/10.1186/s12879-024-09718-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12879-024-09718-x