A hybrid model for tuberculosis forecasting based on empirical mode decomposition in China

Zhao, Ruiqing; Liu, Jing; Zhao, Zhiyang; Zhai, Mengmeng; Ren, Hao; Wang, Xuchun; Li, Yiting; Cui, Yu; Qiao, Yuchao; Ren, Jiahui; Chen, Limin; Qiu, Lixia

doi:10.1186/s12879-023-08609-x

Research
Open access
Published: 07 October 2023

A hybrid model for tuberculosis forecasting based on empirical mode decomposition in China

Ruiqing Zhao¹^na1,
Jing Liu¹^na1,
Zhiyang Zhao²,
Mengmeng Zhai¹,
Hao Ren¹,
Xuchun Wang¹,
Yiting Li¹,
Yu Cui¹,
Yuchao Qiao¹,
Jiahui Ren¹,
Limin Chen³ &
…
Lixia Qiu¹

BMC Infectious Diseases volume 23, Article number: 665 (2023) Cite this article

808 Accesses
1 Citations
Metrics details

Abstract

Background

Pulmonary Tuberculosis is a major public health problem endangering people's health, a scientifically accurate predictive model is of great practical significance for the prevention and treatment of pulmonary tuberculosis.

Methods

The reported incidence data of pulmonary tuberculosis were from the National Public Health Science Data Center (https://www.phsciencedata.cn/). The ARIMA, LSTM, EMD-SARIMA, EMD-LSTM, EMD-ARMA-LSTM models were established using the reported monthly incidence of tuberculosis reported in China from January 2008 to December 2018. The MSE, MAE, RMSE and MAPE were used to evaluate the performance of the models to determine the best model.

Results

Comparing decomposition-based single model with undecomposed single model, it was found that: when predicting the incidence trend in the next year, compared with SARIMA model, the MSE, MAE, RMSE and MAPE of EMD-SARIMA decreased by 39.3%, 19.0%, 22.1% and 19.8%, respectively. The MSE, MAE, RMSE and MAPE of EMD-LSTM were reduced by 40.5%, 12.8%, 22.9% and 12.7%, respectively, compared with the LSTM model; Comparing the decomposition-based hybrid model with the decomposition-based single model, it was found that: when predicting the incidence trend in the next year, compared with EMD-SARIMA model, the MSE, MAE, RMSE and MAPE of EMD-ARMA-LSTM model decreased by 21.7%, 10.6%, 11.5% and 11.2%, respectively. The MSE, MAE, RMSE and MAPE of EMD-ARMA-LSTM were reduced by 16.7%, 9.6%, 8.7% and 12.3%, respectively, compared with EMD-LSTM model. Furthermore, the performance of the model were consistent when predicting the incidence trend in the next 3 months, 6 months and 9 months.

Conclusion

The prediction performance of the decomposition-based single model is better than that of the undecomposed single model, and the prediction performance of the combined model using the advantages of different models is better than that of the decomposition-based single model, so the EMD-ARMA-LSTM combination model can improve the prediction accuracy better than other models, which can provide a theoretical basis for predicting the epidemic trend of pulmonary tuberculosis and formulating prevention and control policies.

Peer Review reports

Introduction

Tuberculosis (TB) is a contagious disease caused by infection with the bacterium Mycobacterium tuberculosis [1], which is spread when people who are sick with TB expel bacteria into the air (e.g. by coughing). TB typically affects the lungs (pulmonary TB) but can also affect other sites (extrapulmonary TB). China is one of the 30 countries with a high burden of TB in the World. The World Health Organization (WHO) estimated in its Global Tuberculosis Report 2021 that the number of new TB cases in China in 2020 would be 842,000 (833,000 in 2019). The TB incidence rate was 59 cases per 100,000 population per year (58/100,000 in 2019), accounting for 8.5% (8.4% in 2019) of the global total and the second highest after India [2]. In our country, pulmonary tuberculosis belongs to class B legal reportable infectious diseases, and its reported incidence number always ranks in the forefront of class A and B notifiable infectious diseases nationwide. Actually, in recent years, Since the extensive use of anti-tuberculosis drugs such as isoniazid and rifampin and the government-led mobilization of the whole society, pulmonary tuberculosis has been effectively controlled to a certain extent, and the incidence of tuberculosis has also shown a trend of decreasing year by year. Nevertheless the spread of Drug-Resistant Tuberculosis bacilli makes the situation of drug-resistant tuberculosis (DR-TB) not optimistic [3], which further aggravates the public health threat to tuberculosis control. With more and more challenges in the prevention and treatment of pulmonary tuberculosis, the prediction of its incidence has become a hot topic. It is of great practical significance to explore the trend and regularity of pulmonary tuberculosis and establish a scientifically accurate predictive model for the prevention and treatment of pulmonary tuberculosis.

The data of infectious diseases changing over time are random, but generally show an upward or downward trend, which makes it possible to predict the incidence trend, but it is still difficult to make accurate prediction [4]. Many scholars have predicted the trend of infectious diseases based on historical data, The relatively perfect and accurate algorithms for the analysis and prediction of time series data mainly include Autoregressive Integrated Moving Average (ARIMA) model based on traditional statistical method and Long-Short term Memory neural network (LSTM) model based on neural network method. Both ARIMA model and LSTM model can well predict the future data according to the laws extracted from the original data. However, The ARIMA model can only extract the linear information in the data, but the valuable nonlinear information in the data is not processed, while the LSTM model can extract the nonlinear components in the data. This suggests that, ARIMA model is suitable for relatively stable series, while LSTM model is suitable for relatively unstable series [5].

However, in practical application, a single prediction model or method has different emphasis on extracting time series data information, so its prediction accuracy is still insufficient when dealing with complex and dynamic time series. Compared with a single prediction model, combined prediction model can effectively reduce system risks while ensuring better prediction performance, thus becoming the mainstream trend in time series prediction research [6].

In recent years, the combinatorial model constructed based on the idea of decomposition and integration decomposes the original sequence to reduce the sequence complexity and obtain sequences with simpler structure, more stable changes and stronger regularity. The accuracy of time series prediction is improved by modeling the decomposed sequence. Empirical Mode Decomposition (EMD) is an adaptive decomposition method for nonlinear and non-stationary signals proposed by Huang and his co-authors in 1998 [7]. It decomposes the original time series into multiple Intrinsic mode functions (IMF) in different time scales and a residual signal (RES). Not only can it be directly applied to nonlinear and non-stationary time series, but also can reveal the changes of different time scales contained in time series [8]. In 2022, An [9] used the Back-Propagation Neural Network (BPNN) model based on EMD to predict incidence of Acquired Immune Deficiency Syndrome (AIDS). First, EMD method was used to decompose the original sequence into four relatively stable IMFs and a residual signal, and then all the decomposition results were respectively established BPNN models and summed to obtain the EMD-BPNN predicted value. Compared with the prediction results of single BPNN and ARIMA, it was found that the prediction effect of EMD-BPNN hybrid model is superior to the above models, that was, the hybrid model improved the prediction accuracy. In 2021, Wang [10] proposed a short-term generation combination forecasting model based on EMD-LSTM-ARMA. Firstly, the normalized IMF1 and IMF2 are input into the designed LSTM network to model and predict, then the IMF3 is modeled and predicted by the ARMA model, and then a low-frequency component is reconstructed from IMF4, IMF5 and residual components. The empirical results show that EMD-LSTM-ARMA combined forecasting model can produce higher forecasting accuracy than single model.

In order to further improve the prediction performance, this paper used EMD to break down the original sequence into several subsequences, and chose the sequences meeting the stationarity requirements to build an ARMA model and the unstable sequences to build an LSTM model after judging the stationarity of the decomposed sequences. On this basis, the EMD-ARMA-LSTM hybrid prediction model was constructed to provide a theoretical foundation for forecasting the epidemic trend of tuberculosis and developing prevention and control programs.

Material and methods

Data sources

The reported incidence data of pulmonary tuberculosis used in this study were from the National Public Health Science Data Center (https://www.phsciencedata.cn/), and a total of 132 months of reported incidence data of tuberculosis (per 100 000) from 2008 to 2018 were downloaded, the reported incidence rates of tuberculosis in China from January 2008 to December 2017 were used as the training set to predict the reported incidence of pulmonary tuberculosis in the next 3 months, 6 months, 9 months and 12 months.

Empirical modal decomposition (EMD)

EMD is a new signal decomposition method. Compared with traditional signal decomposition methods, it completely gets rid of the restriction of basis function and can decompose any signal (time series) in theory [11]. The core idea of this algorithm is to decompose complex original data into a finite number of IMFs with different scales, stationarity and periodic volatility characteristics and a residual signal representing the overall trend of the original signal. Therefore, it has good adaptability to nonlinear and non-stationary sequences. The IMF should meet the following two conditions: (1) The number of extremes does not differ from the number of zeros by more than 1. (2) At any point in an envelope represented by a local maximum and an envelope represented by a local minimum, the average of both is zero. The decomposition steps are as follows [12]:

(1) All maximum points and minimum points on the original tuberculosis sequence are calculated;

(2) By cubic spline interpolation method, the local maximum and local minimum points are constructed into the upper and lower envelope(${e}_{t\left(\mathrm{min}\right)}$、${e}_{t\left(\mathrm{max}\right)}$), and then the average value of the two envelope lines is calculated:

$${m}_{t}\text{=}\left({e}_{t\left(\mathrm{min}\right)}+{e}_{t\left(\mathrm{max}\right)}\right)/2$$

(1)

Subtract ${m}_{t}$ from the original signal:

$${d}_{t}\text{=}{X}_{t}-{m}_{t}$$

(2)

Determine whether ${d}_{t}$ meets the conditions of IMF, if so, ${d}_{t}$ at this time is the first IMF component obtained by decomposition, denoted as $IM{F}_{t}^{1}={d}_{t}$; If it is not satisfied, we need to take ${d}_{t}$ as the new original signal ${X}_{t}$ and repeat steps (1) and (2) until it is satisfied.

(3) The original sequence and the newly obtained intrinsic mode function component are calculated to obtain the residual components after the first decomposition.

$${r}_{t}={X}_{t}-IM{F}_{t}^{1}$$

(3)

Repeat step (1) until the loop stops. The original sequence at this point can be expressed as:

$${X}_{t}=\sum_{i=1}^{N}{IMF}_{t}^{i}+{r}_{n}(t)$$

(4)

where ${IMF}_{t}^{i}$ is the ith intrinsic mode function component, and ${r}_{n}(t)$ represents the nth residual sequence.

Seasonal Autoregressive Integrated Moving Average (SARIMA)

ARIMA is a time series forecasting method proposed by Box, an American statistician, and Jenkins, a British statistician. The basic idea is to regard the data formed by the predicted object as a random sequence, use the corresponding mathematical model to describe the autocorrelation in the sequence and predict future values from potential relationships between past and present values of a sequence [13]. It has two forms: non-seasonal ARIMA model and seasonal ARIMA model, and its expressions are ARIMA (p, d, q) and ARIMA (p, d, q) (P, D, Q) s. Where, p and P represent the autoregressive order and seasonal autoregressive order respectively. d and D are difference order and seasonal difference order respectively. q and Q are the moving average order and seasonal moving average order, and s is the cycle length [14]. The specific modeling process of ARIMA model is as follows: (1) Stationarity test: Autocorrelation Function (ACF) plot, and Augmented Dickey-fuller (ADF) test were used to comprehensively judge whether the time series data was stable. If it was non-stable, d or D-order difference processing was required. (2) Ljung-Box test: Ljung-Box test was performed on the sequence, if the p value was less than the signifcance level, the sequence had no randomness. Modeling can be continued if the sequence was non-random. (3) Model identification and order determination: Python Grid Search was used to automatically fit the SARIMA model. (4) Model selection: according to the minimum Akaike information criterion (AIC), the optimal model was selected. (5) Model test: the success of model fitting was judged by the residual white noise test. If the residual sequence was random, the model is fitted successfully. (6) Prediction: use the constructed model to make predictions.

Long-short term memory (LSTM) model

LSTM network is a kind of Recurrent Neural Networks (RNN) with special network structure proposed by Sepp Hochreiter and Jurgen Schmidhuber in 1997 for the gradient dispersion problem of RNN model [15]. It is characterized by introducing memory unit in each neuron of the hidden layer and solving the contradiction problem of input and output weights through input and output gates, so that it can make more effective use of long-distance time series data. Thus, the long-term dependence problem in traditional RNN model training was overcome [16]. LSTM unit structure is also known as memory unit (A). Its structure was shown in Fig. 1, including three gated structures [17], namely "forget Gates(${f}_{t}$)", " input Gates(${i}_{t}$)" and "output Gates(${o}_{t}$)", these three gating structures can selectively control passage of information [18], and also include the cell state ${C}_{t}$ representing long-term memory, and the candidate state ${m}_{t}$ waiting to be deposited in long-term memory [19]. The calculation formula of each calculation gate is as follows [20]:

$${f}_{t}=\sigma \left({W}_{fh}{h}_{t-1}+{W}_{fx}{x}_{t}+{b}_{f}\right)$$

(5)

$${i}_{t}=\sigma \left({W}_{ih}{h}_{t-1}+{W}_{ix}{x}_{t}+{b}_{i}\right)$$

(6)

$${o}_{t}=\sigma \left({W}_{oh}{h}_{t-1}+{W}_{ox}{x}_{t}+{b}_{o}\right)$$

(7)

$${m}_{t}=\mathrm{tanh}\left({W}_{mh}{h}_{t-1}+{W}_{mx}{x}_{t}+{b}_{m}\right)$$

(8)

$${C}_{t}={f}_{t}{C}_{t-1}+{i}_{t}{\mathrm{m}}_{t}$$

(9)

$${h}_{t}={o}_{t}\mathrm{tanh}\left({C}_{t}\right)$$

(10)

In the above equation, $W$ is the weight matrix connecting the two layers, $\upsigma$ is the sigmoid activation function, $b$ is the corresponding offset item and the tanh function represents the feed-forward network layer of the hyperbolic tangent function. ${h}_{t-1}$ represents the output at time $t-1$, and ${X}_{t}$ represents the input at time $t$.

The construction process of LSTM model is as follows [14]:

(1)
Data preprocessing, including normalization and reconstruction of data;
(2)
The original sequence is transformed into three-dimensional data;
(3)
The original sequence was divided into training set and test set;
(4)
Adjust model parameters.

EMD-SARIMA combined model

The pulmonary tuberculosis sequence was decomposed by EMD method to obtain a group of IMFs and a residual signal. The decomposed IMF components contain partial characteristic signals of different time scales of the original signals, and the EMD method completely throw away the constraint of the basis function, and has good compatibility for various signals. EMD-SARIMA model is a combination model based on the idea of "decomposition before integration" [21]. The specific steps are as follows:

(1)
The original pulmonary tuberculosis signal was decomposed into multiple IMFs and a residual signal using the EMD method;
(2)
Each IMF component and residual signal were predicted by the corresponding SARIMA model;
(3)
According to the completeness of EMD and the orthogonality of IMF, the predicted values of the above parts are summed and reconstructed [22] to get the final results.

EMD-LSTM combined model

Due to the characteristics of its internal structure, LSTM model can realize long-term learning of dependent information [11]. The specific steps of EMD-LSTM are as follows:

(1)
The original pulmonary tuberculosis signal was decomposed by EMD method to obtain finite IMFs and a residual signal representing the overall trend of the original signal;
(2)
Each IMF component and residual signal were predicted by the corresponding LSTM;
(3)
The predicted value of the original signal was obtained by superposition of the predicted value of each decomposition sequence.

EMD-ARMA-LSTM Combined Model

The results of single prediction model based on decomposition show that the subsequences obtained by EMD method all have stationary and non-stationary sequences. In view of the shortage of direct modeling without feature analysis after the decomposition of the above model, the stationarity of the decomposed sequences was judged, and, and then chose the sequences meeting the stationarity requirements to build an ARMA model and the unstable sequences to build an LSTM model. On this basis, the EMD-ARMA-LSTM hybrid prediction model was constructed. The modeling process was shown in Fig. 2:

Model effect evaluation

The model performance evaluation of continuous data mainly depends on the difference between the predicted value and the real value. The smaller the value is, the better the model prediction effect will be. Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE) were used to compare the predictive performance of each model.

$$MAE=\frac{1}{N}\sum_{k=1}^{N}\left|{X}_{k}-{\widehat{X}}_{k}\right|$$

(11)

$$MSE=\frac{1}{N}\sum_{k=1}^{N}{\left({X}_{k}-{\widehat{X}}_{k}\right)}^{2}$$

(12)

$$RMSE=\sqrt{\frac{1}{N}\sum_{k=1}^{N}{\left({X}_{k}-{\widehat{X}}_{k}\right)}^{2}}$$

(13)

$$MAPE=\frac{100\mathrm{\%}}{n}\sum_{i=1}^{n}\left|\frac{{\widehat{X}}_{k}-{X}_{k}}{{X}_{k}}\right|$$

(14)

where, ${X}_{k}$ represents the real value at the moment $k$, ${\widehat{X}}_{k}$ represents the predicted value of each model, and $N$ represents the sample size during the test.

Statistical analysis

Excel software version 2021 was used for data collection and collation, Anaconda software version 4.10.3 was used to establish the SARIMA model and the LSTM model. MATLAB software version 2022 was used for EMD.

Results

Time distribution of pulmonary tuberculosis in China

The time series of the reported incidence of pulmonary tuberculosis in China from January 2008 to December 2018 was shown in Fig. 3. It can be seen that the reported incidence of pulmonary tuberculosis in China presents a decreasing trend year by year and has obvious seasonality, with two apparent epidemic peaks in January and March of each year, which were close to the results of previous research [23].

EMD

The original sequence of tuberculosis was decomposed by EMD, and three IMFs and a residual signal were obtained, among which IMF1 had the highest frequency and represented the high frequency component of tuberculosis signal, residual signal is the lowest frequency signal and represents the trend in the pulmonary tuberculosis signal. The original signal of pulmonary tuberculosis and each decomposition sequence after EMD decomposition were shown in Fig. 4.

SARIMA Model

The SARIMA model was constructed for the original sequence, three IMFs and a residual signal respectively. The original sequence, IMF1, IMF3 and residual signal were non-stationary sequences, so difference processing was carried out to make them stationary, all the sequences became stationary after d or D-order difference. IMF2 was a stationary sequence. Taking the original sequence as an example, the ACF/PACF figure before and after difference is shown in Fig. 5. All the adjusted sequences and IMF2 were non-white noise (Table 1). The d and D of the original sequence, IMF3 and residual signal were determined by the number of differences in the sequence. According to previous literature experience, the values of p, q, P and Q ranged from 0 to 2, and the SARIMA model was automatically fitted by Python grid search [24]. According to AIC minimum principle, the optimal models were determined as follows: Original sequence: SARIMA (2,1,0) (1,1,2) ₁₂; IMF1: SARIMA (2,0,2) (0,1,2) ₁₂; IMF2: ARMA (2,2); IMF3: SARIMA (2,1,1) (0,0,1) ₁₂; Residual signal: ARIMA (2,2,1). Ljung-Box test was performed on the residual sequence of the models, the results showed that the p value was more than the significance level and the sequences were random (Table 2). The above models were used to predict the corresponding IMF and residual signal, and the predicted values were integrated in the form of direct summation to obtain the final forecast results of the EMD-SARIMA model.

Table 1 Unit root test and white noise test results after difference of each sequence

Full size table

Table 2 Residual white noise test results of each prediction model

Full size table

LSTM Model

Since appropriate model parameters have a great impact on the prediction performance, the last 12 data of the training set were used as verification sets to adjust the model parameters. Due to the obvious periodicity of the original sequence of pulmonary tuberculosis, the window length of the LSTM network was set to 12, that was, the number of nodes in the input layer was set to 12. The following one-month data was used as the output for prediction, and the number of nodes in the output layer was set to 1. Since the number of hidden layer nodes has a great impact on the model accuracy, the empirical formula $M=\sqrt{m+n}+a (a=1\sim 10)$ [25] was used to determine the range of node number M. In this paper, the number of hidden layer nodes was first determined under the condition that the number of hidden layers was fixed as 1. The M calculated by the experimental formula was 5–13. When the number of hidden layer nodes was set to 13, the LSTM had the smallest error value (Table 3).

Table 3 Influence of the number of hidden layer nodes on the fitting performance of the model

Full size table

When the number of hidden layer nodes was fixed at 13, the experiment was carried out with the number of hidden layer layers 1–4, when the number of hidden layers was set to 1, the error value of LSTM was the lowest (Table 4).

Table 4 Influence of the number of hidden layers on the fitting performance of the model

Full size table

To sum up, this paper finally set the window length as 12, the number of hidden layers as 1, and the number of hidden layer nodes as 13, and fixed the number of seeds of the model, selected 1000 iterations, set batch size as 32, and used Adam optimizer to predict the model.

Comparative analysis of models

In this paper, the pulmonary tuberculosis sequence was transformed into a series of relatively stable subsequences by EMD decomposition. Then, the decomposition-based single model and the decomposition-based hybrid model were established respectively to predict the incidence trend in the next year, and compared with the prediction results of the undecomposed single model, as shown in Fig. 6. Additionally, we built corresponding prediction models using the next 3 months, 6 months, and 9 months as the test period in order to assess the robustness of the results.

(1)
Comparing decomposition-based single model with undecomposed single model, it was found that: when predicting the incidence trend in the next year, compared with SARIMA model, the MSE, MAE, RMSE and MAPE of EMD-SARIMA decreased by 39.3%, 19.0%, 22.1% and 19.8%, respectively. The MSE, MAE, RMSE and MAPE of EMD-LSTM were reduced by 40.5%, 12.8%, 22.9% and 12.7%, respectively, compared with the LSTM model. Furthermore, the performance of the model were consistent when predicting the incidence trend in the next 3 months, 6 months and 9 months (Table 5).
(2)
Comparing the decomposition-based hybrid model with the decomposition-based single model, it was found that: when predicting the incidence trend in the next year, compared with EMD-SARIMA model, the MSE, MAE, RMSE and MAPE of EMD-ARMA-LSTM model decreased by 21.7%, 10.6%, 11.5% and 11.2%, respectively. The MSE, MAE, RMSE and MAPE of EMD-ARMA-LSTM were reduced by 16.7%, 9.6%, 8.7% and 12.3%, respectively, compared with EMD-LSTM model. Furthermore, the performance of the model were consistent when predicting the incidence trend in the next 3 months, 6 months and 9 months (Table 6).

Table 5 Comparison of the decomposition-based single model and undecomposed single model

Full size table

Table 6 Comparison of the decomposition-based hybrid model and the decomposition-based single model

Full size table

Discussion

In the past two decades, great progress had been made in the prevention and control of pulmonary tuberculosis, but pulmonary tuberculosis is still a major public health problem endangering people's health [26], and "precise prevention" is the key to the current prevention and control of tuberculosis [27]. Therefore, the timely understanding of tuberculosis epidemic trend and the establishment of accurate tuberculosis prediction model can provide scientific basis for the formulation of disease prevention and control policies.

In recent years, most predictions of infectious diseases are based on the original time series model. Studies have shown that, compared with a single prediction model, the combination model built based on the decomposition and integration idea can reduce the complexity of the sequence and effectively improve the prediction performance of the model by decomposing the original sequence. Comparing the decomposition-based single model with the undecomposed single model, it was found that the prediction performance of the decomposition-based single model was better than that of the undecomposed single model, which was mainly due to the decomposition of the initial sequence, so as to obtain relatively simple, stable and regular subsequence, which reduced the difficulty of model and improved the accuracy of prediction. Secondly, in view of the limitation of using only a single model for prediction in the analysis process, this paper attempted to use SARIMA model for stationary series and LSTM model for non-stationary series to establish a decomposition-based hybrid model. Compared with the decomposition-based corresponding single model, it was found that constructing the combined model can improve prediction performance of the model, indicating that selecting the appropriate model according to the subsequence characteristics was beneficial to improve the performance of the model. In conclusion, compared with other models, the combined model of EMD-ARMA-LSTM adopted in this study can improve the prediction accuracy more effectively, make a more accurate and reasonable prediction of pulmonary tuberculosis, and provide a theoretical basis for the prediction of tuberculosis epidemic trend and the formulation of prevention and control policies.

The innovation of this paper is to build an EMD-ARMA-LSTM hybrid model based on the idea of decomposition before integration, and use the existing incidence data of tuberculosis to predict the incidence trend of this infectious disease in the next year, and achieve good results. In addition, we ensured the robustness of the model by changes in results over the next 3 months, 6 months, and 9 months as test period. The model is not only suitable for predicting an infectious disease such as tuberculosis, but can also be extended to other datasets such as hand, foot and mouth disease and influenza. Therefore, this study has certain applicability in the field of epidemiology, which can not only improve people's attention to infectious diseases through the prediction of future incidence, but also help relevant departments to formulate relevant prevention and control policies.

Conclusion

(1)
The reported incidence of tuberculosis in China from 2008 to 2018 showed a decreasing trend year by year, with obvious seasonality, with two obvious epidemic peaks in January and March each year.
(2)
The prediction performance of EMD-SARIMA model was better than that of SARIMA model, and that of EMD-LSTM model was better than that of LSTM model. This suggested that the prediction performance of single model based on EMD decomposition was better than that of undecomposed single model.
(3)
The predictive performance of EMD-ARMA-LSTM model was better than that of EMD-SARIMA model and EMD-LSTM model. This suggested that the prediction performance of the combined model using the advantages of different models was better than that of decomposition-based single model.

Availability of data and materials

The datasets generated and analysed during the current study are available in the [the National Public Health Science Data Center] repository, [https://www.phsciencedata.cn/].

References

Li YJL, Liu J, Zhang Zekun. Analysis on the epidemic trend of tuberculosis in Anhui province from 2005 to 2018 [J]. J Tropical Diseases Parasitology. 2019;17(01):5–9.
Google Scholar
ORGANIZATION W H. Global tuberculosis report 2021: supplementary material [J]. 2022.
Zhang L. New thinking in pulmonary tuberculosis diagnosis and drug-resistant tuberculosis treatment [J]. Clinical Focus. 2014;29(06):601–4.
Google Scholar
FENG H. Research on Time Series Prediction Method Based on Hybrid Model [D]; Tianjin University of Technology, 2019.
Jianfang G. Application comparison of ARIMA and LSTM algorithms [J]. Digital Technol Applic. 2022;40(01):58–60.
Google Scholar
Hibon M, Evgeniou T. To combine or not to combine: selecting among forecasts and their combinations [J]. Int J Forecast. 2005;21(1):15–24.
Article Google Scholar
NE. H, SR. L, MLC. W, et al. The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis [J]. Proceedings of the Royal Society Mathematical, physical and engineering sciences, 1998, (1971): 454.
Yuhong MA. QIANG YARONG, MEI Y. A time series prediction method based on emirical mode decomposition [J]. 2020;56(01):27–34.
Google Scholar
An Q, Wu J, Meng J, et al. Using the hybrid EMD-BPNN model to predict the incidence of HIV in Dalian, Liaoning Province, China, 2004–2018 [J]. BMC Infect Dis. 2022;22(1):102.
Article PubMed PubMed Central Google Scholar
Zhongping W, Lili H, Gengqian D. Short-term power generation combination prediction based on EMD-LSTM-ARMA model [J]. Modern Electronics Technique. 2023;46(03):151–5.
Google Scholar
Xia J, Zheng W, Wang Z, et al. Short-term Prediction of Ship Motion Attitude Based on EMD-LSTM [J]. Computer & Digital Engineering. 2022;50(07):1434–8.
Google Scholar
Liu X. Review on the Development and Application of EMD Algorithm [J]. Changjiang Information & Communications. 2022;35(01):61–4.
Google Scholar
Lou HR, Wang X, Gao Y, et al. Comparison of ARIMA model, DNN model and LSTM model in predicting disease burden of occupational pneumoconiosis in Tianjin, China [J]. BMC Public Health. 2022;22(1):2167.
Article PubMed PubMed Central Google Scholar
MENGMENG Z. Study on the prediction effect of LSTM/BILSTM-ARMA model based on signal decomposition for influenza in Shanxi Province [D]; Shanxi Medical University, 2021.
Hochreiter S, Schmidhuber J. Long Short-Term Memory [J]. Neural Comput. 1997;9(8):1735–80.
Article CAS PubMed Google Scholar
HAN TIANQI, BO S. Prediction of Incidence of Measles Based on LSTM Neural Network [J]. Computer & Telecommunication, 2018, (05): 54–7.
Zhai M, Wang X, Hao R. Study on LSTM Model based on Python in Prediction of Influenza [J]. Chinese Journal of Health Statistics. 2022;39(02):162–6+71.
Google Scholar
Yue Z, Ding Y, Zhao H, Wang Z. Mechanics-Guided optimization of an LSTM network for Real-Time modeling of Temperature-Induced deflection of a Cable-Stayed bridge. Eng Struct. 2022;252:113619.
Youwei CAO, Shuanghong YAN, Haitao LIU, et al. Short-term Wind Power Forecasting Method Based on Noise-reduction Time-series Deep Learning Network [J]. Proceedings of the CSU-EPSA. 2020;32(1):6.
Google Scholar
Munir HS, Ren S, Mustafa M, et al. Attention based GRU-LSTM for software defect prediction [J]. PLoS ONE. 2021;16(3): e0247444.
Article CAS PubMed PubMed Central Google Scholar
Liu YANG, Shuxian L. Application of time series model in predicting the incidence of pulmonary tuberculosis [J]. Applied Preventive Medicine. 2022;28(04):320–3.
Google Scholar
Xuning Z. Prediction of Railway Transportation Volume of Commodity Vehicles Based on EMD-SARIMA [J]. Logistics Technology. 2022;41(07):87–91.
Google Scholar
LIPING L. Statistical Analysis of the distribution Characteristics of the incidence of Tuberculosis in China [D]; Lanzhou University of Finance and Economics, 2019.
Chunyan CHEN, Yixiong CHEN, Zhiyang ZHAO, et al. Application of SARIMA model and l STM model in predicting incidence of hand-foot-mouth disease in B Sao’an district of Shenzhen city [J]. J Shanxi Medical University. 2022;53(10):1302–7.
Google Scholar
Zhao Z, Zhai M, Li G, et al. Study on the prediction effect of a combined model of SARIMA and LSTM based on SSA for influenza in Shanxi Province, China [J]. BMC Infectious Diseases. 2023;23(1):71.
Article PubMed PubMed Central Google Scholar
Wang L, Cheng S, Mingting C. The fifth National Epidemiological Survey on Tuberculosis in 2010 [J]. Chinese Journal of Antituberculosis. 2012;34(08):485–508.
CAS Google Scholar
Fu Z, Zhou Y, Cheng C. Application of Time Series Analysis and Machine Learning in Predicting the Trend of Pulmonary Tuberculosis [J]. Chinese Journal of Health Statistics. 2020;37(02):190–5.
Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

This study was supported by the National Natural Science Foundation of China Project (grant numbers:81973155). The funders had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Ruiqing Zhao and Jing Liu contributed equally to this work.

Authors and Affiliations

Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi, China
Ruiqing Zhao, Jing Liu, Mengmeng Zhai, Hao Ren, Xuchun Wang, Yiting Li, Yu Cui, Yuchao Qiao, Jiahui Ren & Lixia Qiu
School of Public Health, Sun Yat-Sen University, Guangzhou, China
Zhiyang Zhao
Shanxi Provincial Peoples Hospital, Taiyuan City, Shanxi Province, China
Limin Chen

Authors

Ruiqing Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Mengmeng Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Hao Ren
View author publications
You can also search for this author in PubMed Google Scholar
Xuchun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yiting Li
View author publications
You can also search for this author in PubMed Google Scholar
Yu Cui
View author publications
You can also search for this author in PubMed Google Scholar
Yuchao Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Jiahui Ren
View author publications
You can also search for this author in PubMed Google Scholar
Limin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lixia Qiu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.Q.Z., J.L. analyzed and interpreted the data, and are major contributors in writing the manuscript; Z.Y.Z., M.M.Z. were responsible for preprocessing the data; H.R., X.C.W., Y.T.L., Y.C., Y.C.Q., J.H.R. were responsible for checking the results; L.X.Q., L.M.C. gave constructive suggestions for the manuscript. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Limin Chen or Lixia Qiu.

Ethics declarations

Ethics approval and consent to participate

This study did not involve any human trials. The data in this study were from the National Public Health Science Data Center, the data did not contain personal and health information that could be connected back to the original identifiers. The data used in this study was anonymized before its use.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Zhao, R., Liu, J., Zhao, Z. et al. A hybrid model for tuberculosis forecasting based on empirical mode decomposition in China. BMC Infect Dis 23, 665 (2023). https://doi.org/10.1186/s12879-023-08609-x

Download citation

Received: 01 March 2023
Accepted: 13 September 2023
Published: 07 October 2023
DOI: https://doi.org/10.1186/s12879-023-08609-x

A hybrid model for tuberculosis forecasting based on empirical mode decomposition in China

Abstract

Background

Methods

Results

Conclusion

Introduction

Material and methods

Data sources

Empirical modal decomposition (EMD)

Seasonal Autoregressive Integrated Moving Average (SARIMA)

Long-short term memory (LSTM) model

EMD-SARIMA combined model

EMD-LSTM combined model

EMD-ARMA-LSTM Combined Model

Model effect evaluation

Statistical analysis

Results

Time distribution of pulmonary tuberculosis in China

EMD

SARIMA Model

LSTM Model

Comparative analysis of models

Discussion

Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Infectious Diseases

Contact us