Research on the predictive effect of a combined model of ARIMA and neural networks on human brucellosis in Shanxi Province, China: a time series predictive analysis

Background Brucellosis is a major public health problem that seriously affects developing countries and could cause significant economic losses to the livestock industry and great harm to human health. Reasonable prediction of the incidence is of great significance in controlling brucellosis and taking preventive measures. Methods Our human brucellosis incidence data were extracted from Shanxi Provincial Center for Disease Control and Prevention. We used seasonal-trend decomposition using Loess (STL) and monthplot to analyse the seasonal characteristics of human brucellosis in Shanxi Province from 2007 to 2017. The autoregressive integrated moving average (ARIMA) model, a combined model of ARIMA and the back propagation neural network (ARIMA-BPNN), and a combined model of ARIMA and the Elman recurrent neural network (ARIMA-ERNN) were established separately to make predictions and identify the best model. Additionally, the mean squared error (MAE), mean absolute error (MSE) and mean absolute percentage error (MAPE) were used to evaluate the performance of the model. Results We observed that the time series of human brucellosis in Shanxi Province increased from 2007 to 2014 but decreased from 2015 to 2017. It had obvious seasonal characteristics, with the peak lasting from March to July every year. The best fitting and prediction effect was the ARIMA-ERNN model. Compared with those of the ARIMA model, the MAE, MSE and MAPE of the ARIMA-ERNN model decreased by 18.65, 31.48 and 64.35%, respectively, in fitting performance; in terms of prediction performance, the MAE, MSE and MAPE decreased by 60.19, 75.30 and 64.35%, respectively. Second, compared with those of ARIMA-BPNN, the MAE, MSE and MAPE of ARIMA-ERNN decreased by 9.60, 15.73 and 11.58%, respectively, in fitting performance; in terms of prediction performance, the MAE, MSE and MAPE decreased by 31.63, 45.79 and 29.59%, respectively. Conclusions The time series of human brucellosis in Shanxi Province from 2007 to 2017 showed obvious seasonal characteristics. The fitting and prediction performances of the ARIMA-ERNN model were better than those of the ARIMA-BPNN and ARIMA models. This will provide some theoretical support for the prediction of infectious diseases and will be beneficial to public health decision making.

benchmark to evaluate many new modelling methods [15]. The other category is a prediction model based on machine learning theory that the nonlinear mapping performance is strong [16], such as Back Propagation Neural Network (BPNN) [17], Multivariate Adaptive Regression Splines (MARS) [18], Random Forest (RF) [19], Multilayer Perceptron (MLP) networks [20], Support Vector Machines (SVM) [20], and Radial Basis Function (RBF) [21], and has been used to predict the incidence of infectious diseases. Nevertheless, a major limitation of models such as BPNN and SVR is that they are intrinsically static; that is, they do not account for the dynamic nature of infectious disease sequences [22]. These static models can learn information only about the current time; they do not take advantage of historical information. The Elman recurrent neural network (ERNN) model obtains previous information through the receiving layer and can combine current information with historical information [23]. This characteristic makes it one of the most powerful tools for the prediction of nonlinear time series [24]. Time series is considered to consist of both linear and nonlinear components [25,26]. Neither a single linear model nor a nonlinear model can capture the different patterns in the time series [27]. Both types of prediction models have problems with incomplete information extraction, and the prediction accuracy needs to be further improved.
In 1969, Bates J M and Granger C W J elaborated on the combined forecasting method, and their research results attracted great attention from researchers in related fields [28]. The combined model used the unique advantages of different models to analyse the characteristics of time series and achieve accurate prediction. In 2011, Khashei M et al. applied a combined model of ARIMA and an artificial neural network to time series prediction. The results showed that the combined model had better prediction performance than the single model [29]. In 2019, Li S et al. established a combined model of ARIMA and BPNN (ARIMA-BPNN) using coal consumption data in India. The research showed that the combined model had significantly higher prediction accuracy than the single model [30]. Currently, a combined model based on ARIMA and ERNN (ARIMA-ERNN) is mainly applied to air pollution prediction [31], spot price forecasting [32], error compensation [33] and other fields. Nevertheless, there have been no reports in the use of the combined model to predict the epidemic trend of human brucellosis. In this study, the ARIMA-ERNN model was established based on the monthly incidence data of human brucellosis from 2007 to 2017 in Shanxi Province and compared with the ARIMA-BPNN and ARIMA models to evaluate the fitting and predictive effects of the three models. This study will provide certain theoretical support for the prevention and control of human brucellosis in Shanxi Province and offer some reference for the prediction of infectious diseases.

Data sources
In this study, the reported cases of human brucellosis from January 2007 to December 2017 were obtained from Shanxi Provincial Center for Disease Control and Prevention. All cases were diagnosed under the '2007 Diagnostic Criteria of Brucellosis (WS269-2007)' [34,35]. Relevant demographic data were obtained from 'the Statistical Yearbook of Shanxi Province'. The human brucellosis cases from January 2007 to December 2017 were assembled as monthly counts. The monthly incidence data of human brucellosis from January 2007 to December 2016 were used to build the ARIMA model. The fitted data of the ARIMA model were used as the input of neural networks and were split into two sections: a training set and a verification set. The training set data from January 2007 to December 2015 were employed to construct the neural network, and the verification set data from January 2016 to December 2016 were used to verify the neural network. The monthly incidence data from January 2017 to December 2017 were used as the test set to test the prediction performance of the three models.

Analysis of seasonal characteristics
STL [36] can be used to decompose time series with seasonal characteristics into long-term trends, seasonal trends, and random effects as follows: where X t is the actual value of human brucellosis at time t and T t , S t and I t are the long-term trends, seasonal trends and random effects, respectively. Since STL is only suitable for the decomposition of the addition model, logarithmic or Box-Cox transformation is required for the multiplication model, and then the monthplot is used to identify the high-occurrence season of human brucellosis.

ARIMA model
ARIMA [37], a classic model in many time series analyses, is usually constructed as ARIMA (p, d, q) (P, D, Q) s as follows [23]: where Θ P , θ p , Φ Q and ϕ q are polynomials of order P, order p, order Q and order q, respectively. D and d represent the order of trend differencing and seasonal differencing, which are determined when the original time series is stable. p, q, P, Q and s represent the order of the autoregressive, moving average, seasonal autoregressive, seasonal moving average and seasonal periodicity, respectively, which are determined by the autocorrelation function (ACF) plot and the partial autocorrelation function (PACF) plot of the adjusted series. In this study, the monthly incidence of human brucellosis from January 2007 to December 2016 was used to build the ARIMA model, and the process included the following steps. First, the original series was smoothed with a differential method, and the Augmented Dickey-fuller (ADF) test was used to check the stationarity of the adjusted sequence. The white noise test method, also known as the Ljung-Box test, was used to determine whether the adjusted sequence was caused by random effects. If the p value was less than the significance level, the adjusted sequence was considered to be stationary and was not a random sequence. Second, the plots of the ACF and PACF of the adjusted sequence were used to provide a rough guide for reasonable models. Then, a test statistic was constructed to determine whether the residuals of candidate models were random effects, and maximum likelihood estimation (MLE) was used to perform the parameter test of the candidate models. At the same time, the Akaike information criterion (AIC), the Schwarz Bayesian information criterion (SBC) and the coefficient of decision (R 2 ) were used to select the optimal model. When the AIC and SBC values of the models are relatively close, the model with the largest R 2 is selected [38]. Finally, the incidence data from January 2017 to December 2017 were used to test the prediction effect of the optimal model.

Artificial neural networks (ANNs)
ANNs [39] are nonlinear adaptive systems consisting of a large number of neural units. They are mainly used to establish an appropriate model by adjusting the connection weight between neurons to meet the requirements to solve practical problems. According to the different information flow directions of the neural network operation process, they can be divided into two basic forms: feedforward (static) neural networks and feedback (dynamic) neural networks. Therefore, two representative models, BPNN [40] and ERNN [23], are respectively used in this paper to establish the combined model. BPNN is a classic multilayer feedforward neural network based on the error backpropagation algorithm and consists of an input layer, a hidden layer and an output layer. The neurons in the three layers are fully connected in order, while the neurons in the same layer are not connected, and the multilayer design enables them to mine more information and perform nonlinear mapping well. The essence of BPNN learning is to minimize the MSE between predicted and actual values by adjusting the connection weight between the input layer, the hidden layer and the output layer. The learning process is divided into two parts: forward propagation of information and backward feedback of errors. The information from the input layer through the hidden layer reaches the output layer, and the predicted value is obtained. When the error between the predicted and actual values does not satisfy the requirements, the error back propagation adjusts the connection weights of each layer and iterates the process until the requirements are met. The mathematical formulas of BPNN used in this study are shown as follows.
where n, h and o are the neuron numbers of the input layer, the hidden layer and the output layer, respectively. X i (t) is the input of the input layer at time t. S j (t) and Y k (t) are the outputs of the hidden and output layers, respectively. V ij (i = 1, 2, ⋯, n; j = 1, 2, ⋯, h) and W jk (k = 1, 2, ⋯, o) represent the connection weights of the input layer-the hidden layer and the hidden layer-the output layer, respectively. f 1 , f 2 are activation functions of BPNN. With repeated learning, the model prediction accuracy is maximized. To obtain the most effective model, it is often necessary to define the model during the training process. In this paper, the hidden layer of BPNN selects the tan-sigmoid function, the output layer selects the linear function, the training function is trainlm, and the performance index is MSE. The parameters of the network are set to 10,000 iterations, the learning rate is 0.01, and the error is 0.004. The number of hidden layer neurons is calculated using the following empirical formula, where a is a constant between 1 and 10: ERNN is a classical nonlinear local recursive network. In contrast to the feedforward neural network, the receiving layer is added to the hidden layer to achieve dynamic memory capabilities. ERNN consists of four parts: an input layer, a hidden layer, a receiving layer and an output layer. The input layer transmits signals. The hidden layer receives the input from the input layer and the feedback input of the receiving layer, and its self-joining mode has a strong sensitivity to time series data. The receiving layer stores the output value of the previous hidden layer and passes it to the current hidden layer by a one-step delay operator to achieve the purpose of dynamic memory. The output layer receives the output of the hidden layer, mainly the role of linear weighting. The learning process of ERNN is a process of learning and training sample data, obtaining dynamic characteristics between input and output parameters, and ultimately obtaining stable network parameters. In this paper, the training function is traingdx, the number of hidden layer neurons is also calculated by the above empirical formula, and other parameters are the same as the BP neural network. The mathematical formulas of ERNN used in this study are shown as follows: Similar to the BPNN, n, h and o are the neuron numbers of the input layer, the hidden layer and the output layer, respectively. X i (t) is the input of input layer at time t. S r (t − 1), S j (t) and Y k (t) are the outputs of the receiving, hidden and output layers, respectively. U rj (r = 1, 2, ⋯, h; j = 1, 2, ⋯, h), V ij (i = 1, 2, ⋯, n) and W jk (k = 1, 2, ⋯, o) represent the connection weights of the receiving layerthe hidden layer, the input layer-the output layer and the hidden layer-the output layer, respectively. g 1 , g 2 are activation functions of ERNN.

ARIMA-BPNN model and ARIMA-ERNN model
The ARIMA model is suitable for extracting the linear components of the original time series, but it loses nonlinear information in the residual [41]. The nonlinear mapping ability of ANNs can reduce the error of the ARIMA model, so artificial neural networks based on the optimal ARIMA model are constructed to improve the prediction accuracy of the model. The specific steps are as follows: First, the optimal ARIMA model was established based on the original series, the fitting value was obtained, and the error was calculated by the following formula: where y t is the actual value of the original series, L t ∧ is the fitting value of the optimal ARIMA, and e t is the error, also known as the residual. Since a first-order difference and a seasonal difference were performed in building the optimal ARIMA model, the incidence data of the first 13 months were lost in this step. Second, the data from February 2008 to December 2015 were used to build the BPNN or ERNN, and the data from January 2016 to December 2016 were used to verify the neural network. The input values were the fitting values of the optimal ARIMA model and the corresponding time information, and the actual values were taken as output. In this study, to avoid unnecessary results or difficult training processes causing algorithm convergence problems, we used the mapminmax function to normalize the input and output data [23]. Third, BPNN and ERNN continuously learned and trained the network through the input data set and the output data set, and we selected MSE as the evaluation index of network performance. When the MSE is the smallest, the corresponding BPNN and ERNN have the best fitting effect. Finally, the predicted values of the optimal ARIMA model from January 2017 to December 2017 were used as the input values of the combined model to obtain the output predicted values, and the inverse normalization method was used to restore the output predicted value of the combined models into meaningful data.

Indicators of model performance
Three performance indexes, MSE, MAE and MAPE, are used to assess the fitting and prediction effects of those models. The smaller the value, the better the model performance.
X k is the actual value at time k. X k ∧ is the predicted value of the model. N is the number of the incidence data.

Data analysis
STL was performed with the stl function of R statistical software version 3.1.2, the ARIMA model was built on the appropriate module of SAS Software version 9.2, and the combined model was built on MATLAB2014a.

Seasonal characteristics of human brucellosis
STL was used to study the time series of human brucellosis in Shanxi Province from 2007 to 2017, and the results are shown in Fig. 1. The grey bars of the figure represent the same magnitude and were used to compare the sizes of each part. The original data (data), seasonal trends (seasonal), long-term trends (trend) and random effects (remainder) are shown from top to bottom. Based on the seasonal part, human brucellosis in Shanxi Province showed obvious seasonality and periodicity, with a cycle of 1 year. The trend part revealed that the incidence increased from 2007 to 2014 and decreased from 2015 to 2017. However, the seasonal decomposition plot could not determine the peak season, which we solved by using the monthplot. We found that the long-term trend of reported cases was basically consistent in the same month of each year, and the data indicated that the months from March to July were high-risk months, of which the reported cases were the highest in May (Fig. 2).

ARIMA model
The monthly incidence data of human brucellosis from January 2007 to December 2016 in Shanxi Province were used to develop the ARIMA model (Fig. 3). We also observed an upward trend from 2007 to 2014 and a significant decline from 2015 to 2016. The original series became stationary after the first-order difference and a seasonal difference, and the adjusted sequence was not a random effect (Fig. 4 and Table 1). The ARIMA model could be built at this time. Since the periodic change of the original series was 1 year, the parameters d, D and s for the ARIMA model were set to 1, 1 and 12, respectively. The possible values for P, Q, p, and q were determined according to the plot of ACF and PACF of the adjusted sequence (Fig. 5). The residual sequence of those fitting models was a random sequence (Table 2). Therefore, several alternative models could be initially identified by the residual test: ARIMA (1,1,0) (1,1,0) 12 , ARIMA (0,1,1) (0,1,1) 12 , ARIMA (1,1,1) (1,1,0) 12 , and ARIMA (1,1,1) (0,1,1) 12 .
The MLE was used to estimate the parameters of the candidate model. According to the results of the parameter estimates and fitting index, we found that the parameters of the ARIMA (0, 1, 1) (0, 1, 1) 12 model were statistically significant and that the residual sequence of the model was a random sequence. In addition, the AIC and the SBC of this model were the smallest, and the R 2 was the largest (Table 2). Therefore, the ARIMA (0, 1, 1) (0, 1, 1) 12 model was the optimal model for prediction.

ARIMA-BPNN model and ARIMA-ERNN model
According to the above formula, the hidden layer neurons of the BPNN and ERNN were between 3 and 14. We tried different neuron numbers in the hidden layer ( Table 3) and found that when the numbers of hidden layer neuron in the BPNN and ERNN were 7 and 11, respectively, the performance of the two models was optimal; that is, the structure of BPNN was 2-7-1, and the structure of ERNN was 2-11-1. Finally, the predicted values of the ARIMA (0,1,1) (0,1,1) 12 model from January 2017 to December 2017 were used as the inputs of BPNN with a structure of 2-7-1 and ERNN with a structure of 2-11-1, respectively, and the output values were the predicted values of the combined models.

Comparison of the three models
The optimal ARIMA model, the ARIMA-BPNN model and the ARIMA-ERNN model were used to predict the incidence of human brucellosis in Shanxi Province from January 2017 to December 2017. The predicted values of the three models and the incidence of human brucellosis are shown in Fig. 6. The fitting and prediction performances of the three models were compared by MSE, MAE and MAPE ( Table 4). The combined model was better than the single ARIMA model, and the ARIMA-ERNN model was better than the ARIMA-BPNN model.
Compared with those of the ARIMA model, the MAE, MSE and MAPE of the ARIMA-ERNN model decreased

Discussion
Since 2000, with the rapid development of agriculture and the animal husbandry economy in Shanxi Province, human brucellosis has become one of the fastestgrowing infectious diseases in Shanxi Province [42]. The incidence of human brucellosis in Shanxi Province showed an upward trend from 2007 to 2014 and peaked in 2014 (Fig. 3). This may be related to the implementation of the Central Transfer Payment Brucellosis Prevention and Control Project since 2006 and the pilot project of human brucellosis prevention and treatment from 2008 to 2010. In the early stage, the persistence of risk factors for human brucellosis and increased awareness of brucellosis among residents and medical institutions led to an increase in epidemiology reporting. The incidence decreased yearly from 2015 to 2017, which may be due to the successful implementation of the above two initiatives [43]. The analysis of seasonal characteristics (Fig. 1) shows significant seasonal characteristics, mainly in the period from March to July. The main reason may be related to farming operations. In the spring, herders have close contact with livestock because of shearing. Summer is the peak season of delivery of livestock such as cattle and sheep, which greatly increases the chances of contact with pathogenic factors during this process [44]. Moreover, most human brucellosis infections occur in spring and summer, which are attributed to meteorological and temperature factors. As the temperature and relative humidity decrease in autumn and winter, the survival rate of pathogenic bacteria decreases, thereby reducing the chance of infection in humans [45]. Therefore, prevention and control measures for brucellosis should consider seasonal fluctuations, and some targeted interventions should be performed at the peak of the epidemic. This suggests that we should pay special attention to protection when we are in contact with cattle, sheep and other livestock and implement active monitoring measures.
Accurate prediction of epidemic trends is of great significance for the prevention and control of human brucellosis [19]. The occurrence of brucellosis is subject to many factors, and it is difficult to collect data on influencing factors. However, the time series prediction model can overcome the shortcomings of conventional mathematical-statistical methods in the face of this situation, and all the complex external factors are attributed to the time factor to predict the future incidence. The ARIMA model is one of the most commonly used methods in infectious disease prediction and has been proven to have high accuracy [38]. It does not require additional variables and is more practical when the data for other influencing factors are not available. In this paper, we used the optimal model ARIMA (0, 1, 1) (0, 1, 1) 12 as the basic model for evaluating the performance of other models, and the results showed that the predicted value of the optimal ARIMA (0, 1, 1) (0, 1, 1) 12 model was essentially consistent with the actual value, but there was still a certain gap. The possible reason is that the real-time series are generally a combination of linear and nonlinear relationships. The ARIMA model can extract the linear components of the time series, but it loses the nonlinear information in the residual. An artificial neural network is an emerging technology that can imitate the learning and reasoning process of the human brain and nervous system and has a nonlinear mapping ability. In this paper, we used human brucellosis data to compare the performance of the ARIMA-BPNN, ARIMA-ERNN and ARIMA models in fitting and prediction. The study found that compared with the ARIMA model, the MAE, MSE and MAPE of the ARIMA-ERNN and ARIMA-BPNN models had different  The reason may be that the BPNN regards the prediction process as static system modelling, while ERNN is based on the structure of the BPNN and uses the receiving layer to provide its function of mapping dynamic characteristics. Therefore, ERNN can better adapt to event changes and fit time series, thus achieving the highest prediction accuracy.
To the best of our knowledge, this is the only study to explore a combined model of ARIMA and ERNN for predicting the incidence of human brucellosis. Its  advantage is that the ARIMA-ERNN model combines the advantages of ARIMA in linear features and a neuron network in nonlinear features and enhances the capability of a single ARIMA while retaining the advantage of its simplicity in utilizing only incidence time series data as input. Second, based on the structure of the BPNN, the ERNN adds a corresponding receiving layer in the hidden layer to provide its dynamic memory and strong sensitivity to time series, which are more suitable for analysing human brucellosis. Third, the use of the ARIMA-ERNN model contributes to rational allocation of limited public health resources and the early prevention and control of human brucellosis. Nevertheless, there are also some limitations. First, the epidemic pattern and incidence of human brucellosis are different in different areas. Whether the ARIMA-ERNN model is suitable for other regions needs further study [47]. Second, the incidence of human brucellosis is vulnerable to many factors [19]. This study used only monthly incidence data, which may have impacted the performance of the models. Third, only two combinatorial models are established in this study, and the superiority of the ARIMA-ERNN model and other models remains to be verified. In the future, we will incorporate the influencing factors of human brucellosis into the prediction model and compare the ARIMA-ERNN with other models.

Conclusions
In this study, the time series of human brucellosis in Shanxi Province from 2007 to 2017 showed obvious seasonal characteristics and a trend of first increasing and then decreasing. The fitting and prediction performances of the ARIMA-ERNN model were better than those of the ARIMA-BPNN and ARIMA models, and the ARIMA-BPNN model was better than the ARIMA model. The ARIMA-ERNN model was more suitable for predicting the incidence of human brucellosis than the ARIMA and ARIMA-BPNN models.