 Research
 Open access
 Published:
Study on the prediction effect of a combined model of SARIMA and LSTM based on SSA for influenza in Shanxi Province, China
BMC Infectious Diseases volumeÂ 23, ArticleÂ number:Â 71 (2023)
Abstract
Background
Influenza is an acute respiratory infectious disease that is highly infectious and seriously damages human health. Reasonable prediction is of great significance to control the epidemic of influenza.
Methods
Our Influenza data were extracted from Shanxi Provincial Center for Disease Control and Prevention. Seasonaltrend decomposition using Loess (STL) was adopted to analyze the season characteristics of the influenza in Shanxi Province, China, from the 1st week in 2010 to the 52nd week in 2019. To handle the insufficient prediction performance of the seasonal autoregressive integrated moving average (SARIMA) model in predicting the nonlinear parts and the poor accuracy of directly predicting the original sequence, this study established the SARIMA model, the combination model of SARIMA and LongShort Term Memory neural network (SARIMALSTM) and the combination model of SARIMALSTM based on Singular spectrum analysis (SSASARIMALSTM) to make predictions and identify the best model. Additionally, the Mean Squared Error (MSE), Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) were used to evaluate the performance of the models.
Results
The influenza time series in Shanxi Province from the 1st week in 2010 to the 52nd week in 2019 showed a yearbyyear decrease with obvious seasonal characteristics. The peak period of the disease mainly concentrated from the end of the year to the beginning of the next year. The best fitting and prediction performance was the SSASARIMALSTM model. Compared with the SARIMA model, the MSE, MAE and RMSE of the SSASARIMALSTM model decreased by 38.12, 17.39 and 21.34%, respectively, in fitting performance; the MSE, MAE and RMSE decreased by 42.41, 18.69 and 24.11%, respectively, in prediction performances. Furthermore, compared with the SARIMALSTM model, the MSE, MAE and RMSE of the SSASARIMALSTM model decreased by 28.26, 14.61 and 15.30%, respectively, in fitting performance; the MSE, MAE and RMSE decreased by 36.99, 7.22 and 20.62%, respectively, in prediction performances.
Conclusions
The fitting and prediction performances of the SSASARIMALSTM model were better than those of the SARIMA and the SARIMALSTM models. Generally speaking, we can apply the SSASARIMALSTM model to the prediction of influenza, and offer a legup for public policy.
Background
Influenza, whose incidence often ranks first among notifiable infectious diseases, is an acute respiratory infectious disease caused by the influenza virus, seriously damaging human health [1]. The main clinical manifestations comprise acute high fever, physical pain, and fatigue and are accompanied by respiratory symptoms such as cough or sore throat. However, some special groups, such as infants, pregnant women, the elderly and those with chronic basic diseases are prone to complications and even death. Since the influenza virus is easy to mutate and features easy infection, transmission and diffusion, which leads to outbreaks and epidemics, the occurrence and development of influenza often causes major public health problems [2]. In recent years, global integration has greatly promoted the mobility of groups, which increases the risk of influenza pandemics [3]. Accurate and reasonable prediction provides reliable information and basis for prevention and control, which can enable people to detect abnormal trends in time and contain the epidemic at an early stage, thus reducing human health hazards and economic burdens [4].
In the past, linear models such as the grey prediction model [5], the exponential smoothing method [6], the autoregressive integrated moving average (ARIMA) model and the SARIMA model [7] were often used to predict infectious diseases. The SARIMA model, one of the classical prediction models of infectious diseases, is often used as a benchmark to evaluate many new modelling methods [8]. In 2021, Song used a SARIMA model to predict the incidence of influenzalike illness in highrisk regions in the United States from 2011 to 2020. The results showed that the SARIMA model was suitable for forecasting the ILI incidence of Mississippi [9]. However, these models are constructed by linear information, which show certain limitations in the nonlinear part [10]. Moreover, SARIMA model needs to make the unstable sequence stationary by difference, which will lose certain information and reduce the accuracy of prediction. Therefore, nonlinear models based on machine learning theory, such as Support Vector Machines (SVM) [11], Multivariate Adaptive Regression Splines (MARS) [12], Random Forest (RF) [13] and Recurrent Neural Networks (RNN) [14], are widely used in the field of time series prediction. In 2022, Dai used a hybrid model combining XGBoost, four GARCH models and MLP model (XGBoostGARCHMLP) to predict PM concentration values and volatility. The results showed that the combined model based on machine learning was more accurate in predicting PM values [15]. Compared with SARIMA model, RNN has strong nonlinear, mapping and adaptive characteristics, which can effectively improve the prediction accuracy [14]. Moreover, compared with other machine learning models, RNN has a deeper hidden layer and learning ability, which not only ensures the ability to express the nonlinearity of time series, but also considers the time correlation. However, when the time series is long, RNN will suffer from gradient explosion and lack of longterm memory [8]. LongShort Term Memory Neural Networks (LSTM) introduces a unique memory unit structure, which can make up for the deficiency of RNN and is more suitable for processing long time series data [16]. This property makes it one of the most powerful tools for predicting nonlinear time series in practical applications. In 2017, Li used SVM, Naive Bayes, Decision Tree, Multiplelayer Perceptron, RNN and LSTM to predict the stock data of China from 2008 to 2015. The results showed that Multiplelayer Perceptron, RNN and LSTM were better [17]. In recent years, LSTM and other neural network models have been gradually applied in the field of public health. In 2022, Zhu established a LSTM model to predict the incidence of influenza in Fujian Province, China from 2019 to 2021. The results showed that LSTM had good predictive performance [18]. In 2021, Dai established a deep learning model for an atmospheric pollutant memory network (LSTM) by both applying the onedimensional multiscale convolution kernel (ODMSCNN) and a LSTM on the basis of temporal and spatial characteristics. The results showed that the air pollutant concentration prediction model based on ODMSCNNLSTM had a better prediction effect compared with multilayer perceptron (MLP), CNN, and LSTM models [19].
Time series are usually considered to consist of both linear and nonlinear components [20]. Neither linear model nor nonlinear model can fully capture all the information of time series. Based on this, many scholars have proposed combination models composed of linear and nonlinear models. In 2016, Oliveira proposed the ARIMASVR combination model, and the results proved that the hybrid model can effectively improve the prediction accuracy [21]. In 2021, Zhai used the combination model of ARIMAERNN and ARIMABPNN to predict brucellosis in Shanxi Province, China. The results showed that combination models were better than the single ARIMA model [10]. Nevertheless, the above models all predicted the original series. When the characteristics of the original series are complex, the accuracy of using the combined model directly to predict the original sequence is still insufficient [22].
To solve this problem, this research proposed a combination model construction strategy based on decomposition and recombination. Singular spectrum analysis (SSA) can decompose the complex original sequence into some simple and regular subsequences [23]. The prediction model can be indirectly established by modeling and superimposing the subsequences, which can improve the prediction accuracy of the model. In recent years, the prediction model based on SSA has been gradually applied to public health, stock price prediction and mechanical engineering. In 2021, Mahdi used the SSA method to analyze and predict COVID19. The results showed that the combined prediction model enjoyed significantly higher accuracy than the single model [24].
In 2019, Xiao et al. used SSA to decompose and reconstruct the stock price and forecast it. The results showed that the performance of the combined prediction model was better than the single prediction method [25]. In 2019, Zhang et al. used SSA to decompose and reconstruct the shortterm wind power, and modeled and predicted the decomposition sequence respectively to improve the prediction accuracy [26].
In this study, the SARIMA model, SARIMALSTM model and SSASARIMALSTM model were established based on the weekly InfluenzaLike Illness (ILI) patient ratio from 2010 to 2018 in Shanxi Province to evaluate the fitting effect of the three models. Three models were used to predict the 2019 influenza data, respectively, to evaluate the prediction performance of the three models.
The innovation of this paper lies in the establishment of the indirect combination prediction model SSASARIMALSTM through the idea of decomposition and combination. Compared with single prediction model and direct modeling prediction, SSASARIMALSTM is more accurate in predicting influenza. This study, more targeted in prevention, and more reasonable in medical resource allocation, will provide more effective theoretical support for the prevention and control of influenza in Shanxi Province, China to effectively reduce the health hazards and economic burden caused by influenza.
Methods
Data sources
In this study, a total of 520Â weeks of influenza data from the 1st week of 2010 to the 52nd week of 2019 were obtained from Shanxi Provincial Center for Disease Control and Prevention, China. All cases were diagnosed under the â€˜Diagnostic criteria for influenza (WS 2852008)â€™ [27]. Influenza data include the number of ILI patients and the total number of outpatient and emergency cases in the same period. To eliminate the difference [28], ILI patient ratio was calculated as
The influenza cases from the 1st week in 2010 to the 52nd week in 2019 were assembled as weekly counts. The weekly ILI% from the 1st week in 2010 to the 52nd week in 2018 were used to build the SARIMA model. The fitted data of the SARIMA model were taken as the input of neural networks, which were divided into two sections: a training set and a verification set. The data from the 1st week of 2010 to the 52nd week of 2017 were used as the training set to construct the neural network, and the data from the 1st week of 2018 to the 52nd week of 2018 were used as the verification set to validate the neural network. The weekly ILI% from the 1st week to the 52nd week of 2019 were used as the test set to test the prediction performance of the three models.
Analysis of influenza sequence characteristics
STL [29] can be used to analyze the longterm trend, seasonal trend and random effect of influenza in Shanxi Province from the 1st week of 2010 to the 52nd week of 2019 as follows:
where X_{t} is the actual value of ILI% at time t and T_{t}, S_{t} and I_{t} are the longterm trends, seasonal trends and random effects, respectively.
SARIMA model
SARIMA, a classic model in many time series analyses, is usually constructed as SARIMA (p, d, q) (P, D, Q) _{s} as follows [30]:
where \(\Theta_{P}\), \(\theta_{p}\), \(\Phi_{Q}\) and \(\phi_{q}\) are polynomials of order P, order p, order Q and order q, respectively. p and q represent the order of autoregressive and moving average. d and D represent the order of trend differencing and seasonal differencing. P, Q and s represent the order of seasonal autoregressive, seasonal moving average and seasonal periodicity, respectively. In this study, the weekly ILI% from the 1st week in 2010 to the 52nd week in 2018 were used to build the SARIMA model, and the process included the following steps. First, the stationarity of the sequence was checked by using the Augmented Dickeyfuller (ADF) test, the KwiatkowskiPhillipsSchmidtShin (KPSS) test and the autocorrelation function (ACF) plot. If the p value of the ADF test was less than the significance level, and the autocorrelation coefficient decayed rapidly to 0, the sequence was considered to be stationary. If the p value of the KPSS test was less than the significance level, the sequence was considered to be nonstationary. The LjungBox test was used to test whether the sequence was the white noise sequence, and if the p value was less than the significance level, the sequence had no randomness. Second, when the original sequence was stationary and nonrandom, the model can be directly constructed. When the original sequence was not stable, d or Dorder difference was used to make the sequence stable and then constructed the model. Afterwards, Python Grid Search was used to automatically fit the SARIMA model. According to the minimum Akaike information criterion (AIC), the optimal model was selected, and the success of model fitting was judged by the residual white noise test. Maximum likelihood estimation (MLE) was used to perform the parameter test of the model [8]. Finally, the data from the 1st week to the 52nd week of 2019 were predicted by this model, and the prediction effect of the model was tested.
LSTM model
LSTM hidden layer module, also known as Memory module (A), was shown in Fig.Â 1. It consists of a cell and three gates: INPUT GATES, FORGET GATES, and OUTPUT GATES [16].
The mathematical formulas of LSTM used in this study are shown as follows [31]. By receiving the output value h_{t1} of the previous state and the input value X_{t} of the current moment, the forgetting gate uses the sigmoid function to determine the retention degree f_{t} of the transmitted information.
The input gate updates the current state C_{t} by using sigmoid and Tanh functions to pass to the next memory cell.
The output gate outputs the value h_{t} at the current time.
The model controls the flow of information in memory units and neural networks through the gates. W is the weight matrix. b is the bias term. \(\sigma\) is the sigmoid function. In this paper, LSTM was set to 1000 iterations, batch size was set to 256, the learning rate was 0.001, time step was set to 1, the number of hidden layers was 1, Adam algorithm was adopted to optimize parameters, and other parameters were default. The value range of hidden layer neurons were calculated by the following empirical formula, where m and n are the number of neurons in the input layer and output layer respectively, which we set to 1, and k is a constant between 1 and 10:
SARIMALSTM model and prediction process of influenza
The SARIMA model is suitable for extracting the linear part of the original time series, but it shows certain limitations in the nonlinear part [20]. The LSTM model has the characteristics of strong nonlinearity, mapping and adaptability, which can reduce the error of the SARIMA model. Therefore, the combined model of SARIMALSTM was constructed in this study, which can comprehensively improve the prediction accuracy of the model. FigureÂ 2 showed the prediction process of influenza and the construction framework of SARIMALSTM model, including four parts: data preprocessing, SARIMA model construction, SARIMALSTM model tuning and data prediction.

(1)
Data processing preparation. Data preprocessing was performed in the original data of influenza, and the data set was divided into training set and test set. The training set, from the 1st week of 2010 to the 52nd week of 2018, was used to construct the optimal SARIMA model. The test set was used to verify the performance of the model.

(2)
SARIMA model construction. Input the training set into SARIMA to build the optimal SARIMA model. The fitting value was obtained, and the error was calculated by the following formula:
$${\text{e}}_{t} = y_{t}  \mathop {L_{t} }\limits^{ \wedge }$$(11)where \(y_{t}\) is the actual value of the original series, \(\mathop {L_{t} }\limits^{ \wedge }\) is the fitting value of the optimal SARIMA, and \({\text{e}}_{t}\) is the error, also known as the residual. The ILI% data of the first 53Â weeks were lost in this step due to a firstorder difference and a seasonal difference in the construction of the optimal SARIMA model. The established SARIMA model was used to obtain the fitting values from the 2nd week of 2011 to the 52nd week of 2018. The data from the 2nd week of 2011 to the 52nd week of 2017 were used as the training part of the LSTM model, and the data from the 1st week of 2018 to the 52nd week of 2018 were used as the validation part of the LSTM model.

(3)
SARIMALSTM model tuning. The training part was used as the input, and ILI% at the same time point was used as the output to construct the SARIMALSTM model, and the verification part was used to optimize the model. In order to improve model training speed and prediction accuracy, the Minâ€“Max normalization method [30] was used to normalize the original data.
$$X^{*} = \frac{{X  X_{\min } }}{{X_{\max }  X_{\min } }}$$(12)X^{*} is the normalized value of the data, X is the original data, X_{max} and X_{min} are the maximum and minimum values respectively. Since hidden layer nodes have a great impact on the performance of the model, we chose MSE, MAE and RMSE as the evaluation indexes of network performance. Through experiments, the hidden layer neurons were selected when the smallest MSE, MAE and RMSE to construct the optimal SARIMALSTM model.

(4)
Data prediction. The established SARIMA model was used to predict influenza data from the 1st week to the 52nd week in 2019, and the predicted values were used as the input values of the SARIMALSTM model to obtain the output values. The inverse normalization method was used to restore the output values into meaningful data. The predicted values were compared with the real values of the test set to evaluate the Prediction performance of the model.
Singular spectrum analysis
Singular spectrum analysis, proposed by Broomhead and King in 1986, has been widely used in the field of time series decomposition in recent years. By transforming the original sequence into a trajectory matrix for decomposition and reconstruction, SSA can decompose it into the longterm trend, periodic trend and noise, to further forecast. The specific decomposition process is as follows:

(1)
Embedding. In this paper, the original sequence Xâ€‰=â€‰(X_{1}, X_{2},â€¦, X_{N}) was transformed into a sequence of K vectors with length L(2â€‰â‰¤â€‰Lâ€‰â‰¤â€‰2/N). L is an integer value called window length and K is an integer such that the trajectory matrix includes all values, Kâ€‰=â€‰NLâ€‰+â€‰1. When the time sequence data has obvious periodic characteristics, the window length is set to an integer multiple of the period which is less than onethird of the total length [23].
$$X = \left[ {\begin{array}{*{20}c} {X_{1} } & {X_{2} } & \cdots & {X_{K} } \\ {X_{2} } & {X_{3} } & \cdots & {X_{K + 1} } \\ \vdots & \vdots & {} & \vdots \\ {X_{L} } & {X_{L + 1} } & \cdots & {X_{N} } \\ \end{array} } \right]$$(13) 
(2)
Singular value decomposition. Let \(S = XX^{T}\),\(U_{1} ,U_{2} , \ldots ,U_{L}\) be the eigenvectors of S, and \(\lambda_{1} \ge \lambda_{2} \ge \cdots \ge \lambda_{L}\), its corresponding eigenvalues. Let \(V_{i} = X^{T} U_{i} /\sqrt {\lambda_{i} }\),\(U_{i}\) and \(V_{i}\) be the left and right singular vectors of matrix X respectively, and \(\sqrt {\lambda_{i} } (i = 1,2, \ldots L)\), its corresponding singular values. At this time, X can be expressed as Xâ€‰=â€‰E_{1}â€‰+â€‰E_{2}â€‰+â€‰â€¦â€‰+â€‰E_{L}, and \(E_{i} = \sqrt {\lambda_{i} } U_{i} V_{i}^{T} \left( {i = 1,2, \ldots ,L} \right)\).

(3)
Grouping and diagonal averaging. We divided X into r disjoint subsets according to the contribution rate of singular values (\(X = E_{{I_{1} }} + E_{{I_{2} }} + \ldots + E_{{I_{{\text{p}}} }}\)). Then, using antidiagonal averaging, we transformed the new trajectory matrix (\(E_{{I_{1} }} ,E_{{I_{2} }} , \ldots ,E_{{I_{p} }}\)) into new sequences of length N and total number p. Finally, the original sequence X was decomposed into p subsequences with length N, and the sum of subsequences was X.
$$e_{k} = \left\{ {\begin{array}{*{20}c} {\frac{1}{k}\sum\limits_{p = 1}^{k} {e_{p,k  p + 1}^{ * } } } & {{\text{for }}(1 \le k \le m^{ * } )} \\ \begin{gathered} \frac{1}{{m^{ * } }}\sum\limits_{p = 1}^{{m^{ * } }} {e_{p,k  p + 1}^{ * } } \hfill \\ \frac{1}{N  k + 1}\sum\limits_{{p = k  n^{ * } + 1}}^{{N  n^{ * } + 1}} {e_{p,k  p + 1}^{ * } } \hfill \\ \end{gathered} & \begin{gathered} {\text{for }}(m^{ * } \le k \le n^{ * } ) \hfill \\ {\text{for }}(n^{ * } \le k \le N) \hfill \\ \end{gathered} \\ \end{array} } \right.$$(14)E is a matrix(mâ€‰Ã—â€‰n), \(m^{*} = \min \left\{ {m,n} \right\}\),\(n^{*} = \max \left\{ {m,n} \right\}\),N is the total number of inverse diagonals Nâ€‰=â€‰mâ€‰+â€‰n1, kâ€‰=â€‰1,2,â€¦,N. According to the above formulas, E is transformed into a onedimensional time series \(e_{1} ,e_{2} , \ldots ,e_{N}\) [32].
SSASARIMALSTM model and prediction process of influenza
It is difficult for a single model to capture the comprehensive characteristics of signals for accurate prediction. Therefore, we proposed an indirect prediction method based on SSA. Compared with the SARIMALSTM model, the construction of the SSASARIMALSTM model added the sequence decomposition step and the sequence combination step (Fig.Â 3).

(1)
Data processing preparation and decomposition. First, we used SSA to transform the original sequence from the 1st week of 2010 to the 52nd week of 2018 into multiple simple subsequences. The partitioning process of data sets was consistent with the SARMILSTM model.

(2)
SARIMALSTM model building. Second, we constructed the SARIMALSTM model for each subsequence.

(3)
Data prediction. Third, the optimal SARIMA model prediction values of each subsequence from the 1st week to the 52nd week in 2019 were taken as the input values of the model to obtain the output prediction values, and the inverse normalization method was used to restore the output prediction values of the subseries to meaningful data.

(4)
Sequence combination. Finally, the predicted values of each subsequence were added to obtain the predicted values of the SSASARIMALSTM model from the 1st week to the 52nd week in 2019. The predicted values were compared with the real values of the test set to evaluate the Prediction performance of the model.
Indicators of model performance
Three performance indexes, MSE, MAE and RMSE, were used to assess the fitting and prediction effects of those models.
\(X_{{\text{k}}}\) is the actual value at time k. \(\mathop {X_{{\text{k}}} }\limits^{ \wedge }\) is the predicted value of the model. N is the sample size.
Data analysis
Excel software version 2021 was used for data collection and collation, Anaconda software version 4.10.3 was used to establish STL, the SARIMA model, the SARIMALSTM combined model and the SSASARIMALSTM combined model. MATLAB software version 2019 was used for SSA.
Results
Seasonal characteristics of influenza
STL was used to study the time series of ILI% in Shanxi Province from the 1st week of 2010 to the 52nd week of 2019, and the results were shown in Fig.Â 4. The original data, longterm trends, seasonal trends and residuals were shown from top to bottom. Based on the longterm trends, Influenza in Shanxi Province decreased year by year. The seasonal trends revealed that the Influenza in Shanxi Province showed obvious seasonality and periodicity, with a cycle of 1Â year (52Â weeks). In a cycle, the peak of influenza in Shanxi Province, China was mainly at the beginning and end of the year.
SARIMA model
Weekly ILI% from the 1st week in 2010 to the 52nd week in 2018 in Shanxi Province were used to build the SARIMA model. The ACF of the original series showed obvious seasonal characteristics (Fig.Â 5). The ADF test: tâ€‰=â€‰âˆ’â€‰5.249, Pâ€‰<â€‰0.001, and the KPSS test: Ï‡^{2}â€‰=â€‰1.251, Pâ€‰=â€‰0.010, and the LjungBox test: Ï‡^{2}â€‰=â€‰352.724, Pâ€‰<â€‰0.001. Therefore, the original sequence was nonstationary and nonrandom. The original series became stationary after the firstorder difference and a seasonal difference, and the adjusted sequence was not a random effect (Table 1). Finally, SARIMA (p,1,q) (P,1,Q) _{52} was preliminarily selected.
By using Python grid to search the minimum AIC, we finally determined the SARIMA (2,1,1) (2,1,0) _{52} model (AICâ€‰=â€‰95.163). The parameters of the SARIMA (2,1,1) (2,1,0) _{52} model were statistically significant and that the residual sequence of the model was a random sequence (Ï‡^{2}â€‰=â€‰0.020, Pâ€‰=â€‰0.900) (Table 2).
SARIMALSTM model
The fitted values of the SARIMA (2,1,1) (2,1,0) _{52} model from the 2nd week of 2011 to the 52nd week of 2017 were used as inputs, and the actual values at the same time were used as outputs to establish the SARIMALSTM model. The data from weeks 1 through 52 of 2018 were used to verify the neural network. According to the formula (10), the hidden layer neurons of the SARIMALSTM were between 2 and 12. Through experiments, when the hidden layer neuron was 4, the MSE, MAE and RMSE of the verification set reached the minimum; that is, the number of hidden layers of the model was set to 1 and the nodes were set to 4, (Table 3). Finally, the predicted values of the SARIMA (2,1,1) (2,1,0) _{52} model from weeks 1 through 52 of 2019 were used as the inputs. The established SARIMALSTM model was used to obtain the output predicted values, and the inverse normalization was performed.
SSASARIMALSTM model
The original influenza sequence in Shanxi Province was complex, and the accuracy of direct prediction was insufficient. In this study, SSA was used to decompose the ILI% of 468Â weeks in Shanxi Province from the 1st week of 2010 to the 52nd week of 2018, and multiple simple and regular subsequences were obtained. The SSASARIMALSTM model was obtained by building the SARIMALSTM model for the subsequences.
SSA
L (window length) and r (reconstruction number) should be determined, before the SSA decomposition. First, L was set to 52 due to the cyclical nature of influenza. Afterwards, we obtained 52 singular values which were from large to small by using SSA. The contribution rate of the first singular value was the largest (90.86%), 2â€“6 followed (6.93%), and the contribution rate of 7â€“52 was the smallest (2.18%). The matrices corresponding to the singular values of 1, 2â€“6 and 7â€“52 were grouped and reconstructed into three subsequences RC_{1}, RC_{26} and RC_{752} by diagonal averaging. The three subsequences showed different characteristics: RC_{1} showed a gradual downward trend, RC_{26} showed periodic fluctuation, and RC_{752} fluctuated around the mean with no obvious trend (Fig.Â 6).
Constructing SSASARIMALSTM model
SSA was used to reconstruct the original influenza sequence into three subsequences with different periodicity and stability. Afterwards, we tested the stationarity and white noise of each subsequence. The three original subsequences were nonrandom. RC_{1} became stationary after the secondorder difference, and RC_{2â€“6} became stationary after the seasonal difference. RC_{752} was a stationary sequence. By using Python grid to search the minimum AIC, SARIMA (2,2,0) (0,0,0)_{52}, SARIMA (2,0,2) (0,1,0)_{52} and SARIMA (2,0,2) (2,0,0)_{52} were used to fit RC_{1}, RC_{2â€“6} and RC_{7â€“52}, respectively (Table 4). Consistent with the construction method of the SARIMALSTM model, the number of hidden layers of the models was set to 1 and the hidden layer neurons of RC_{1}, RC_{2â€“6} and RC_{7â€“52} sequences were 11, 11 and 7, respectively (Table 5). The SARIMALSTM models were used to predict the ILI% of subsequences from weeks 1 through 52 of 2019, and the predicted values of subsequences were added to obtain the predicted values of the SSASARIMALSTM model.
Comparison of the three models
The SARIMA model, the SARIMALSTM model and the SSASARIMALSTM model were used to predict the ILI% in Shanxi Province from the 1st week of 2019 to the 52nd week. The predicted and fit values of the three models and the ILI% were shown in Fig.Â 7. To objectively evaluate the model performance, the fitting and prediction performances of the three models were compared by MSE, MAE and RMSE (Table 6). Compared with the SARIMA model, the MSE, MAE and RMSE of the SARIMALSTM model decreased by 13.75, 3.26 and 7.13%, respectively, in fitting performance; the MSE, MAE and RMSE decreased by 8.60, 12.36 and 4.39%, respectively, in prediction performances. Compared with the SARIMA model, the MSE, MAE and RMSE of the SSASARIMALSTM model decreased by 38.12, 17.39 and 21.34%, respectively, in fitting performance; the MSE, MAE and RMSE decreased by 42.41, 18.69 and 24.11%, respectively, in prediction performances. Compared with the SARIMALSTM model, the MSE, MAE and RMSE of the SSASARIMALSTM model decreased by 28.26, 14.61 and 15.30%, respectively, in fitting performance; the MSE, MAE and RMSE decreased by 36.99, 7.22 and 20.62%, respectively, in prediction performance.
Discussion
As the influenza virus is prone to mutation, it is highly vulnerable to an epidemic, even a worldwide pandemic, which will increase the burden on health services and economic losses [33]. Itâ€™s the key to preventing and controlling the harm of influenza which requires a timely understanding of the epidemic trend of influenza and early detection of the epidemic situation. ILI% in Shanxi Province showed a downward trend from the 1st week of 2010 to the 52nd week of 2019, and the analysis of Seasonal decomposition based on STL (Fig.Â 4) showed significant seasonal characteristics. The peak was mainly concentrated at the beginning and end of each year, which was a typical characteristic of the influenza epidemic in Northern China. The main reason may be related to the cold and dry weather in winter. Low temperature makes the virus stay alive longer, and low humidity makes the virus stay in the air longer, both of which increase the susceptibility of influenza to cause a high incidence of influenza [34]. Therefore, it is necessary to strengthen education attainment, raise awareness of prevention and encourage influenza vaccination to avoid the economic and disease burden caused by influenza.
Many factors influence the occurrence of influenza, and it is difficult to collect data on influencing factors. Facing this situation, time series prediction model attributes all the complex external factors to the time factors and predicts the future incidence to overcome the disadvantages of traditional mathematicalstatistical methods. The SARIMA model, one of the most classical time series methods in infectious disease prediction, has been proven to have high accuracy, and itâ€™s often used as the evaluation basic for new models [7]. Therefore, we established the optimal model SARIMA (2,1,1) (2,1,0) _{52} as the basic model to evaluate the performance of other models. However, the results showed that the prediction performance of the optimal model SARIMA (2,1,1) (2,1,0) _{52} still had some deficiencies. The possible reason was that influenza, like most infectious diseases, was a combination of linear and nonlinear sequences [10]. The SARIMA model can accurately extract the linear components of the time series, but it has some limitations when dealing with nonlinear information. The LSTM has strong nonlinearity, mapping and selfadaptability, which can effectively improve the prediction accuracy. In this paper, we used influenza data to compare the performances of the SARIMA, SARIMALSTM and SSASARIMALSTM models in fitting and prediction. The study found that compared with the SARIMA model, the MSE, MAE and RMSE of the SARIMALSTM and SSASARIMALSTM models had different degrees of decline in terms of fitting and prediction performances. The possible reason was that the combined models made up for the lack of nonlinear mapping ability of the SARIMA model and improved the prediction performance, consistent with the research results of other scholars [10, 30]. All the above models directly predicted the original series. However, the influenza series were nonstationary, with obvious seasonality, trend and other complex temporal characteristics, and the prediction accuracy of direct modeling was insufficient. Based on this, The SSA was used to decompose the original influenza sequence, and the SARIMALSTM model was used to predict the subsequences respectively, and the final predicted values were obtained by superposition to establish the SSASARIMALSTM model. The results showed that the fitting and prediction performances of the SSASARIMALSTM model were the best, which may be due to the fact that the original sequence was decomposed into relatively simple, stable and regular subsequences. It was easier for the model to fit the regular subsequences, thus improving the prediction accuracy, which was also consistent with the results of Kalantari [35].
To the best of our knowledge, we are the first one to explore the SSASARIMALSTM combination model based on SSA for predicting the incidence of Influenza. Its advantage is that the SSASARIMALSTM model combines the advantages of the SARIMA in linear features and a neuron network in nonlinear features and enhances the capability of a single SARIMA. At the same time, the SSASARIMALSTM model based on decomposition and recombination can make up for the lack of accuracy of direct use of the original sequence prediction. Second, we selected the optimal SARIMA model by using Python Grid Search to automatically search the minimum AIC, which made the model more accurate and suitable for analyzing influenza data. Third, the use of the SSASARIMALSTM model helps rationalize the allocation of limited public health resources and the early prevention to control influenza.
However, there are also some limitations. First, the patterns and the incidence of influenza vary from region to region [2]. It needs further study whether the SSASARIMALSTM model is suitable for other areas. Second, this study only established two combination models, and the superiority of SSASARIMALSTM model and other models remained to be verified. In the future, the influence factors of influenza will be incorporated into the model and we will compare the SSASARIMALSTM model with other models. In the future, we plan to use different signal decomposition methods and more neural networks to improve the accuracy of influenza prediction. At the same time, we will use this model to study the predictive performance of influenza in different regions.
Conclusions
In this study, the time series of influenza in Shanxi Province from 2010 to 2019 showed obvious seasonal characteristics and a trend of decreasing. The fitting and prediction performances of the SSASARIMALSTM model were better than those of the SARIMALSTM and SARIMA models, and the SARIMALSTM model was better than the SARIMA model. The SSASARIMALSTM model was more suitable for predicting the incidence of influenza than the SARIMALSTM and SARIMA models.
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Labella AM, et al. Influenza. J. Med Clin North Am. 2013;97:621â€“45.
Keilman LJ, et al. Seasonal Influenza (Flu). J Nurs Clin North Am. 2019;54:227â€“43.
SaundersHastings PR, et al. Reviewing the history of pandemic influenza: understanding patterns of emergence and transmission. J Pathogens. 2016;5:66.
Zheng Y, et al. Study on the relationship between the incidence of influenza and climate indicators and the prediction of influenza incidence. J Environ Sci Pollut Res Int. 2021;28:473â€“81.
Yang X, et al. The analysis of GM (1, 1) grey model to predict the incidence trend of typhoid and paratyphoid fevers in Wuhan City. China J Med. 2018;97:e11787.
Mahajan S, et al. ShortTerm PM2.5 forecasting using exponential smoothing method: a comparative analysis. J Sensors Basel. 2018;18:3223.
Brian K, et al. Time series analysis using autoregressive integrated moving average (ARIMA) models. J Acad Emerg Med. 1998;5:1553â€“2712.
Mbah TJ, et al. Using LSTM and ARIMA to Simulate and Predict Limestone Price Variations. J Min Metall Explor. 2021;38:913â€“26.
Song Z, et al. SpatioTemporal Analysis of InfluenzaLike Illness and Prediction of Incidence in HighRisk Regions in the United States from 2011 to 2020. J Int J Environ Res Public Health. 2021;18:7120.
Zhai M, et al. Research on the predictive effect of a combined model of ARIMA and neural networks on human brucellosis in Shanxi Province, China: a time series predictive analysis. J BMC Infect Dis. 2021;21:280.
Zhou J, et al. Establishment of a SVM classifier to predict recurrence of ovarian cancer. J Mol Med Rep. 2018;18:3589â€“98.
Menon R, et al. Multivariate adaptive regression splines analysis to predict biomarkers of spontaneous preterm birth. J Acta Obstet Gynecol Scand. 2014;93:382â€“91.
Yang L, et al. Study of cardiovascular disease prediction model based on random forest in eastern China. J Sci Rep. 2020;10:5245.
Dai X, et al. A recurrent neural network using historical data to predict time series indoor PM2.5 concentrations for residential buildings. J Indoor Air. 2021;31:1228â€“37.
Hongbin D, et al. PM2.5 volatility prediction by XGBoostMLP based on GARCH models. J Cleaner Product. 2022;356:131898.
Yu Y, et al. A review of recurrent neural networks: LSTM cells and network architectures. J Neural Comput. 2019;31:1235â€“70.
Wei L, et al. A Comparative Study on Trend Forecasting Approach for Stock Price Time Series. In: Proceedings of 2017 11th IEEE International Conference on Anticounterfeiting, Security, and Identification (ASID). Institute of Electrical and Electronics Engineers. p. 84â€“88.
Zhu H, et al. Study on the influence of meteorological factors on influenza in different regions and predictions based on an LSTM algorithm. J BMC Public Health. 2022;22:2335.
Hongbin D, et al. Prediction of Air Pollutant Concentration Based on OneDimensional MultiScale CNNLSTM Considering SpatialTemporal Characteristics: A Case Study of Xiâ€™an. China J Atmosphere. 2021;12:1626.
Peter G, et al. Time series forecasting using a hybrid ARIMA and neural network model. J Neurocomputing. 2003;50:159â€“75.
Oliveira C, et al. A hybrid evolutionary decomposition system for time series forecasting. J Neurocomputing. 2016;180:27â€“34.
Hibon M, et al. To combine or not to combine: selecting among forecasts and their combinations. J Int J Forecast. 2005;21:15â€“24.
Hossein H, et al. Singular spectrum analysis: methodology and comparison. J Data Sci. 2007;5:396.
Nader A, et al. Forecasting the COVID19 Pandemic in Saudi Arabia using a modified singular spectrum analysis approach: model development and data analysis. J JMIRx med. 2021;2:21044.
Jihong X, et al. A new approach for stock price analysis and prediction Based on SSA and SVM. Int J Inf Technol Decis Making. 2019;18:22.
Zhang Y, et al. A novel combination forecasting model for wind power integrating least square support vector machine, deep belief network, singular spectrum analysis and localitysensitive hashing. J Energy. 2019;168:558â€“72.
Ministry of Health of the Peopleâ€™s Republic of China. WS 285â€“2008 diagnostic criteria for innuenza. Beijing: Peopleâ€™s Health Publishing House;2008.
Zhang J, et al. A comparative study on predicting influenza outbreaks. J Biosci Trends. 2017;11:533â€“41.
SanchezVazquez MJ, et al. Using seasonaltrend decomposition based on loess (STL) to explore temporal patterns of pneumonic lesions in finishing pigs slaughtered in England, 2005â€“2011. J Prev Vet Med. 2012;104:65â€“73.
Wu W, et al. Time series analysis of human brucellosis in mainland China by using Elman and Jordan recurrent neural networks. J BMC Infect Dis. 2019;19:414.
Song W, et al. A Time Series Data Filling Method Based on LSTMTaking the Stem Moisture as an Example. J Sensors. 2020;20:5045.
Sanei S, et al. A new adaptive line enhancer based on singular spectrum analysis. J IEEE Trans Biomed Eng. 2012;59:428â€“34.
Horm SV, et al. Epidemiological and virological characteristics of influenza viruses circulating in Cambodia from 2009 to 2011. J PLoS One. 2014;9:e110713.
Zhang Y, et al. The complex associations of climate variability with seasonal influenza A and B virus transmission in subtropical Shanghai. China J Sci Total Environ. 2020;701:134607.
Kalantari M, et al. Forecasting COVID19 pandemic using optimal singular spectrum analysis. J Chaos Solitons Fractals. 2021;142:110547.
Acknowledgements
Not applicable.
Funding
This study was supported by the National Natural Science Foundation of China Project (Grant Number: 81973155). The funders had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Authors and Affiliations
Contributions
ZYZ, MMZ analyzed and interpreted the data, and are major contributors in writing the manuscript; GHL, XFG collected data; WZS, XCW, HR were responsible for preprocessing the data; YC, YCQ, JHR were responsible for checking the results; LXQ, LMC gave constructive suggestions for the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
This study did not involve any human trials. The use of influenza data was approved by the Ethics Committee at Shanxi Center for Disease Control and Prevention, China. The need of informed consent was deemed unnecessary by the Ethics Review Board of the Shanxi Provincial Center for Disease Control and Prevention, because the data did not contain personal and health information that could be connected back to the original identifiers. The data used in this study was anonymized before its use.
Consent to publish
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Zhao, Z., Zhai, M., Li, G. et al. Study on the prediction effect of a combined model of SARIMA and LSTM based on SSA for influenza in Shanxi Province, China. BMC Infect Dis 23, 71 (2023). https://doi.org/10.1186/s12879023080251
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12879023080251