- Research article
- Open Access
Statistical methods for predicting tuberculosis incidence based on data from Guangxi, China
BMC Infectious Diseases volume 20, Article number: 300 (2020)
Tuberculosis (TB) remains a serious public health problem with substantial financial burden in China. The incidence of TB in Guangxi province is much higher than that in the national level, however, there is no predictive study of TB in recent years in Guangxi, therefore, it is urgent to construct a model to predict the incidence of TB, which could provide help for the prevention and control of TB.
Box-Jenkins model methods have been successfully applied to predict the incidence of infectious disease. In this study, based on the analysis of TB incidence in Guangxi from January 2012 to June 2019, we constructed TB prediction model by Box-Jenkins methods, and used root mean square error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE) to test the performance and prediction accuracy of model.
From January 2012 to June 2019, a total of 587,344 cases of TB were reported and 879 cases died in Guangxi. Based on TB incidence from January 2012 to December 2018, the SARIMA((2),0,(2))(0,1,0)12 model was established, the AIC and SC of this model were 2.87 and 2.98, the fitting accuracy indexes, such as RMSE, MAE and MAPE were 0.98, 0.77 and 5.8 respectively; the prediction accuracy indexes, such as RMSE, MAE and MAPE were 0.62, 0.45 and 3.77, respectively. Based on the SARIMA((2),0,(2))(0,1,0)12 model, we predicted the TB incidence in Guangxi from July 2019 to December 2020.
This study filled the gap in the prediction of TB incidence in Guangxi in recent years. The established SARIMA((2),0,(2))(0,1,0)12 model has high prediction accuracy and good prediction performance. The results suggested the change trend of TB incidence predicted by SARIMA((2),0,(2))(0,1,0)12 model from July 2019 to December 2020 was similar to that in the previous two years, and TB incidence will experience slight decrease, the predicted results can provide scientific reference for the prevention and control of TB in Guangxi, China.
Tuberculosis (TB) is a chronic respiratory infectious disease caused by the pathogen Mycobacterium tuberculosis. Infected people can spread TB germs from their mouth when they cough or sneeze. After suffering from TB, if the patients are not given timely, thorough treatment, which can pose a serious threat to their health, even make them completely lost the ability to work, and the TB patients may also infect others . At present, although great progress has been made around the world in the prevention and control of TB, many countries, especially in low-income and middle-income settings, are still afflicted with a chronic plague of TB with huge economies losses . Moreover, TB remains one of the top 10 causes of death worldwide; it is estimated that globally there were 10.0 million new cases of TB in 2017, of which 1.3 million individuals’ deaths were directly attributable to TB, and TB has killed more people than any other infectious disease in the past few decades [2, 3]. China is one of the countries with high burden of TB, number of TB patients ranked second in the world, accounting for a quarter of the world’s patients, and about 250 thousand patients died of TB every year in China .
The Guangxi is a province of China, it is located in the south of China, the latitude 20°54′ ~ 26°24′ N, longitude 104°26′ ~ 112°04′ E, it covers a total area about 236,700 km2, with a population over 49.26 million in 2018, and is one of the Chinese provinces that is most affected by TB. From 2015 to 2017, the annual incidences (per 100,000 populations) of TB in China were 63.42, 61 and 60.53, respectively, while the annual incidences of TB in Guangxi province of China were 96.41, 86.27 and 87.86, respectively. These incidences of TB in Guangxi were much higher than that in the national level, so it is necessary to pay more attention to the prevention and control of TB in this area.
To master the regularity of infectious diseases, analyze and know the epidemic situation of infectious diseases by using the existing surveillance data, then predict the future, which can provide scientific reference for disease prevention and control. The Box-Jenkins method is a representative time series analysis and prediction method, which can take into account trend changes, periodic changes, and random disturbances in time series. It is very useful in modeling temporal dependence structure of a time series. At present, this method has been widely used in the prediction of infectious diseases, and has achieved successful prediction results, for instance, Tian C W et al.  forecasted monthly cases of hand-foot-mouth disease successfully in China; Wang T et al.  suggested that ARIMA(3,1,1)(2,1,1)12 model was reliable with a high validity, which could be used to predict hemorrhagic fever with renal syndrome incidence in Zibo; Myriam Gharbil et al.  predicted the dengue incidence in Guadeloupe based on time series analysis; López-Montenegro LE  predicted dengue cases in Colombia from 2018 to 2022 based on Auto-Regressive Integrated Moving Average (ARIMA) model; Zheng Y-L et al.  and Liao Z  forcasted TB incidence successfully using SARIMA model, etc. [10,11,12,13,14,15,16,17].
The incidence of TB in Guangxi is very high, but there are few related prediction studies so far. In order to do a better job of prevention and control, in the study, the prediction research was carried out. Firstly, we briefly analyzed the change trend of the TB incidence in Guangxi over the years, and then, based on the data characteristics of the TB incidence in Guangxi, China, we established the best SARIMA model for prediction. Finally, the TB incidence in the future was predicted, which can provide scientific reference for prevention and control of TB in Guangxi.
The data of the TB cases in Guangxi from January 2012 to June 2019 was obtained from the Guangxi center for Disease Control and Prevention, China; Population data was obtained from the official website of Guangxi Bureau of Statistics, based on the population data and the reported number of TB cases, we calculated the monthly incidence of TB (per 100,000 populations). The data used in this study is provided as Additional file 1.
SARIMA model descriptions
The Box-Jenkins method is a famous time series prediction method proposed by Box and Jenkins in the early 1970s, it includes the ARIMA(p,d,q) model called. Autoregressive Integrated Moving Average Model, AR is auto regression, p is the number of auto regression term, MA is moving average, q is the number of moving average terms [18, 19]. If the time series contains a seasonal cycle, it is often necessary to do a seasonal difference to establish a SARIMA model, the SARIMA model with s observations per period, denoted by SARIMA (p, d, q)(P, D, Q)s. Generally, the standard statistical methodology to construct an SARIMA(p, d, q)(P, D, Q)s model includes four steps:
First step, data stationary test. Usually, data set needs to be divided into two subsets for model: one for training set, and the other one for testing set. The training set needs to be stationary time series. If the original training set data is not stationary, common differential or seasonal difference is required, d is the order of the ordinary difference, and D is the order of the seasonal difference. Augmented Dickey-Fuller (ADF) test can determine whether the time series was stationary, the significance level of the test is 0.05 (if the test Prob is less than 0.05, then, the data is stationary).
Second step, based on the data of stationary time series, to plot the graphs of the autocorrelation function (ACF) and partial autocorrelation function (PACF). According to the analysis of ACF and PACF, we can determine the possible values of p, q, P and Q, this process requires both skill and experience. Generally, more than one tentative model is chosen in this step.
Third step, to do parameter estimation and hypothesis test of all tentative SARIMA models by least square method. These model passed by the parameter test is feasible, furthermore, to do diagnostic checking of their residuals, if residuals are almost equivalent to white noises (significant level Prob> 0.05) by using the Box-Jenkins Q test, then SARIMA model has good performance. Then, to select the best SARIMA model by the Akaike information criterion (AIC) and Schwarz criterion (SC). The preferred model is the one with the lowest AIC and SC values.
Forth step, to predict the TB incidence based on the preferred SARIMA model, then, to calculate forecast accuracy indexes, such as root mean square error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE). Good fitting and prediction performance of SARIMA model are demonstrated with RMSE, MAE and MAPE as small as possible.
Data processing and analysis
All analyses were performed using ArcGIS 10.4, Eviews7.2, R3.6.2 and Matlab 2012b.
From January 2012 to June 2019, a total of 587,344 cases of TB and 879 deaths of TB were reported in Guangxi. It can be seen from Fig. 1 that the TB incidence was decreasing year by year, and there was certain seasonality. The TB incidence in the second and third quarters were higher than that in the first and fourth quarters.
We used R3.6.2 software to decompose TB incidence data, and found that TB incidence data have obvious seasonality, periodicity and randomness (see Fig. 2), so it is suitable to establish SARIMA model for prediction analysis.
The data from January 2012 to June 2019 was divided into two parts, the part from January 2012 to December 2018 was used to construct the SARIMA(p,d,q)(P,D,Q)s model, and the other part from January 2019 to June 2019 was used to test the prediction performance of the SARIMA(p,d,q)(P,D,Q)s model.
The SARIMA(p,d,q)(P,D,Q)s model method requires data to be stationary, otherwise, neither of backcast or forecast of the series can be available. First, ADF was used to test the stability of original series, and the tested Prob value was 0.94 greater than 0.05, which showed that the series was not stationary. Because there was obvious seasonality in the TB incidence series in Guangxi (see Fig. 2), we did the first-order seasonal difference with period 12 on original series, and then, did ADF test of the seasonal difference data again, and the tested Prob value was less than 0.01, therefore, after the first-order seasonal difference, the data was stationary, then, d = 0, D = 1 and s = 12. The test results were shown in Table 1.
Second, to draw ACF and PACF graphs of stationary data (see Fig. 3). According to the analysis of the ACF and PACF graphs, we established eight tentative models, SARIMA(1,0,1)(0,1,0)12,SARIMA(1,0,(2))(0,1,0)12,SARIMA((2),0,1)(0,1,0)12, SARIMA((2),0,(2))(0,1,0)12,SARIMA(2,0,(2))(0,1,0)12, SARIMA(2,0,1)(0,1,0)12, SARIMA(1,0,2)(0,1,0)12, and SARIMA(2,0,2)(0,1,0)12. Then, the least square method was used to test the parameters of the eight models, and the AIC and SC values of these models were calculated, the test results were shown in Table 2. It could be seen that only the SARIMA((2),0,(2))(0,1,0)12 model with lowest AIC and SC passed the parameter test (all Prob values were less than 0.05).
Finally, we did the diagnostic checking of residuals of the SARIMA((2),0,(2))(0,1,0)12 model by using the Box-Jenkins Q test, the test Prob was more than 0.05, therefore, according to these analyses, the SARIMA((2),0,(2))(0,1,0)12 model was feasible for the prediction of TB incidence in Guangxi.
We used the SARIMA((2),0,(2))(0,1,0)12 model to fit the TB incidence data from March 2013 to December 2018, and the RMSE, MAE, and MAPE were 0.98, 0.77 and 5.8 respectively; We used the SARIMA((2),0,(2))(0,1,0)12 model to predict the TB incidence from January 2019 to June 2019, and the RMSE, MAE, and MAPE were 0.62, 0.45 and 3.77, respectively. Both the fitting accuracy values and the prediction accuracy values were very small, which indicated that the SARIMA((2),0,(2))(0,1,0)12 model was very good and its prediction accuracy was high. Based on the SARIMA((2),0,(2))(0,1,0)12 model, we predicted the TB incidence in Guangxi from July 2019 to December 2020, these predicted values were shown in Table 3, and the fitted and predicted incidence were compared with the observed incidence in Fig. 4.
Currently, the annual TB incidence in Guangxi is much higher than that in the national level, although it has been slightly decreasing annually; the potential achievement is diminished by an increasing large-scale transient population, the emergence of MDR-TB, along with the co morbid conditions of AIDS and non-communicable diseases, which have led to a resurgence of TB in recent years [20,21,22]. Additionally, WHO initiated the End TB Strategy with the target of a 90% reduction in new TB cases by 2035 compared with 2015,and a milestone of reducing the TB incidence by 50% by 2025 relative to 2015 , in order to accelerate progress towards such a daunting task, corresponding measures and actions are expected at both the national and international levels. At the national level, every province should make efforts, especially in provinces with high incidence, such as Guangxi. Appropriate plans may fail to be becomingly formulated without getting a clear perspective of the past, current and future temporal levels of this disease, therefore, advanced detection and early response systems for epidemics have formed an integral part of the effective precautions against TB and the reasonable allocation of available health resources.
In this study, the historical trend of TB incidence in Guangxi was carefully analyzed, then, the prediction model of TB incidence in Guangxi was established by using Box Jenkins model method, and this method is one of the most widely used time series forecasting techniques because of its structured modeling basis and acceptable forecasting performance. Through the analysis of the change trend and decomposition graph of original TB incidence, we found that the data had obvious seasonality, trend and randomness, so it is suitable to establish SARIMA model for prediction analysis. For SARIMA model construction, monthly TB incidence from January 2012 to December 2018 was used; for testing the predictive ability of this model, TB incidence from January 2019 to June 2019 was used.
SARIMA model requires data to be stationary, Table 1 showed that the Prob value of ADF test was 0.94 more than 0.05, indicating that the original data was not stable. Considering the seasonal variation of TB incidence data, we did the first-order seasonal difference with a period of 12, after that, we used ADF to test the stationarity of the seasonal-difference data, the Prob value of the test was less than 0.01(see Table 1), which indicated that the difference data was stable and could be used to build SARIMA model. Then, in order to determine the p, q, P and Q in SARIMA(p,0,q)(P,1,Q)12 model, the ACF and PACF graphs were drawn, then, eight tentative models were established by the analysis of ACF and PACF graphs. The parameters of these tentative models were tested and these models performance were compared by AIC and SC, Table 2 showed SARIMA((2),0,(2))(0,1,0)12 model had smallest AIC and SC, as well as, all the Prob values of its parameter test were less than 0.01, and the Prob value of the Box-Jenkins Q test was more than 0.05, which indicated that the SARIMA((2),0,(2))(0,1,0)12 model was feasible to predict the TB incidence in Guangxi. Using SARIMA((2),0,(2))(0,1,0)12 model to fit original TB incidence from January 2012 to December 2018, the RMSE(0.98), MAE(0.77), and MAPE(5.80) were very small; Using SARIMA((2),0,(2))(0,1,0)12 model to predict TB incidence from January 2019 to June 2019, the RMSE(0.62),MAE(0.45),and MAPE(3.77) were very small too, which indicated that the SARIMA((2),0,(2))(0,1,0)12 model was very good and its prediction accuracy was very high. We predicted the TB incidence in Guangxi based on the SARIMA((2),0,(2))(0,1,0)12 model from July 2019 to December 2020(see Table 3 and Fig. 4), the results suggested the change trend of predicted TB incidence was similar to change trend in the previous two years, and TB incidence will experience slight decrease, the predicted results can provide scientific reference for the prevention and control of TB in Guangxi, China.
The incidence of tuberculosis in Guangxi is high, but there is little prediction study of the disease in recent years, advanced detection and early response systems have formed an integral part of the effective precautions against TB and the reasonable allocation of available health resources. In view of this, we used Box-Jenkins method to establish the SARIMA((2),0,(2))(0,1,0)12 model for predicting the TB incidence in Guangxi. The RMSE, MAE and MAPE of the SARIMA((2),0,(2))(0,1,0)12 were very small, which indicated that the model was successful, its prediction accuracy was high, and its prediction performance was good. Based on SARIMA((2),0,(2))(0,1,0)12 model,we predicted the TB incidence of Guangxi from July 2019 to December 2020, the results suggested the TB incidence will experience slight decrease, and its changing trend will be similar to before. The prediction results can provide help for reallocating resources so as to get better in control and prevention of TB in Guangxi, China.
Availability of data and materials
The data used in this study are available from the corresponding author on reasonable request and with permission of the Guangxi center for Disease Control and Prevention, China. The relevant data is provided as Additional file 1.
Partial autocorrelation function
Akaike information criterion
Seasonal autoregressive integrated moving average
Augmented Dickey-Fuller (ADF) test
Root mean square error
Mean absolute error
Mean absolute percentage error.
Zhao Y, Li M, Yuan S. Analysis of transmission and control of tuberculosis in mainland China, 2005-2016, based on the age structure mathematical model. Int J Environ Res Public Health. 2017;14:1192.
WHO. Global tuberculosis report 2018. http://www.who.int/tb/publications/global_report/en/. (Accessed on 4 Dec 2018).
Moosazadeh M, Khanjani N, Nasehi M, et al. Predicting the incidence of smear positive tuberculosis cases in Iran using time series analysis. Iran J Public Health. 2015;44:1526–34.
Tian CW, Wang H, Luo XM. Time-series modelling and forecasting of hand, foot and mouth disease cases in China from 2008 to 2018. Epidemiol Infect. 2019;147:e82.
Wang T, Zhou Y, Wang L, et al. Using an autoregressive integrated moving average model to predict the incidence of hemorrhagic fever with renal syndrome in Zibo, China, 2004-2014. Jpn J Infect Dis. 2016;69(4):279–84.
Gharbi M, Quenel P, Gustave J, et al. Time series analysis of dengue incidence in Guadeloupe, French West Indies: Forecasting models using climate variables as predictors. BMC Infect Dis. 2011;11:166.
López-Montenegro LE, Pulecio-Montoya AM, Marcillo-Hernández GA. Dengue cases in Colombia: mathematical forecasts for 2018-2022. MEDICC Rev. 2019;21(2–3):38–45.
Zheng Y-L, Zhang L-P, Zhang X-L, et al. Forecast model analysis for the morbidity of tuberculosis in Xinjiang, China. PLoS ONE. 2015;10(3):e0116832.
Liao Z, Zhang X, Zhang Y, et al. Seasonality and Trend Forecasting of Tuberculosis Incidence in Chongqing, China. Interdiscip Sci. 2019;11(1):77–85.
Carvajal Thaddeus M, Viacrusis Katherine M, Hernandez Lara Fides T, et al. Machine learning methods reveal the temporal pattern of dengue incidence using meteorological factors in metropolitan Manila, Philippines. BMC Infect Dis. 2018;18:183.
Mao Q, Zhang K, Yan W, et al. Forecasting the incidence of tuberculosis in China using the seasonal auto-regressive integrated moving average (SARIMA) model. J Infect Public Health. 2018;11(5):707–12.
Anokye R, Acheampong E, Owusu I, et al. Time series analysis of malaria in Kumasi: Using ARIMA models to forecast future incidence. Cogent Soc Sci. 2018;4(1):1461544.
Withanage GP, Viswakula SD, Yi SGN, et al. A forecasting model for dengue incidence in the district of Gampaha, Sri Lanka. Parasit Vectors. 2018;11(1):262.
Siregar FA, Makmur T, Saprin S. Forecasting dengue hemorrhagic fever cases using ARIMA model: a case study in Asahan district. In: IOP Conference Series Materials Science and Engineering; 2018. p. 300.
Tohidinik HR, Mohebali M, Mansournia MA, et al. Forecasting zoonotic cutaneous leishmaniasis using meteorological factors in eastern Fars province, Iran: a SARIMA analysis. Tropical Med Int Health. 2018;23(8):860–9.
Xu Q, Li R, Liu Y, et al. Forecasting the incidence of mumps in Zibo City based on a SARIMA model. Int J Environ Res Public Health. 2017;14(8):925.
Wang H, Tian CW, Wang WM, et al. Time-series analysis of tuberculosis from 2005 to 2017 in China. Epidemiol Infect. 2018;146(8):935–9.
Box GEP, Jenkins GM, Reinsel GC, et al. Time series analysis: forecasting and control, 5th edition. J Oper Res Soc. 2015;22(2):199–201.
Box, George E.P, Jenkins, Gwilym M, Reinsel, Gregory C. Time series analysis. Forecasting and control. 3rd ed. journal of time. 2010;31(4):303.
Moon MS, Kim SS, Moon H. (i) Tuberculosis of the spine: Current views in diagnosis, management, and setting a global standard. Orthopaedics Trauma. 2013;27(4):185–94.
Maitra A, Bates S, Shaik M, et al. Repurposing drugs for treatment of tuberculosis: a role for non-steroidalanti-inflammatory drugs. Br Med Bull. 2016;118(1):138–48.
Berlin L. Tuberculosis: resurgent disease, renewed liability. AJR Am J Roentgenol. 2008;190(6):1438–44.
We would like to express our gratitude to peer reviewers for carefully revising our manuscript and useful comments.
This work was supported by the Natural science funding of Xinjiang Uygur Autonomous Region (2017D01C189), China.
Ethics approval and consent to participate
The research did not involve any direct participation by human subjects. The TB data were extracted from monthly reports maintained by Guangxi center for Disease Control and Prevention.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Zheng, Y., Zhang, L., Wang, L. et al. Statistical methods for predicting tuberculosis incidence based on data from Guangxi, China. BMC Infect Dis 20, 300 (2020). https://doi.org/10.1186/s12879-020-05033-3
- SARIMA model