Skip to main content

Zoonotic outbreak risk prediction with long short-term memory models: a case study with schistosomiasis, echinococcosis, and leptospirosis

Abstract

Background

Zoonotic infections, characterized with huge pathogen diversity, wide affecting area and great society harm, have become a major global public health problem. Early and accurate prediction of their outbreaks is crucial for disease control. The aim of this study was to develop zoonotic diseases risk predictive models based on time-series incidence data and three zoonotic diseases in mainland China were employed as cases.

Methods

The incidence data for schistosomiasis, echinococcosis, and leptospirosis were downloaded from the Scientific Data Centre of the National Ministry of Health of China, and were processed by interpolation, dynamic curve reconstruction and time series decomposition. Data were decomposed into three distinct components: the trend component, the seasonal component, and the residual component. The trend component was used as input to construct the Long Short-Term Memory (LSTM) prediction model, while the seasonal component was used in the comparison of the periods and amplitudes. Finaly, the accuracy of the hybrid LSTM prediction model was comprehensive evaluated.

Results

This study employed trend series of incidence numbers and incidence rates of three zoonotic diseases for modeling. The prediction results of the model showed that the predicted incidence number and incidence rate were very close to the real incidence data. Model evaluation revealed that the prediction error of the hybrid LSTM model was smaller than that of the single LSTM. Thus, these results demonstrate that using trending sequences as input sequences for the model leads to better-fitting predictive models.

Conclusions

Our study successfully developed LSTM hybrid models for disease outbreak risk prediction using three zoonotic diseases as case studies. We demonstrate that the LSTM, when combined with time series decomposition, delivers more accurate results compared to conventional LSTM models using the raw data series. Disease outbreak trends can be predicted more accurately using hybrid models.

Peer Review reports

Background

Zoonotic diseases are infectious diseases that occur between humans and animals [1], including COVID-19, which has devastated the world in recent years [2], swine flu [3], avian influenza [4], dengue fever [5], zika [6], rift valley fever [7], monkeypox [8], Ebola [9], among others. Zooneses, characterized by huge pathogen diversity, high disease complexity, wide affecting area and great society harm, have become a major global public health problem [1]. Studies estimate that approximately 60% of global infections affecting human health originate from animals, underscoring the critical importance of disease prevention and management [10]. Schistosomiasis, often overshadowed by other tropical infectious diseases, is the second most widespread category of human parasitic infections worldwide, following only malaria [11]. Schistosomes can induce a variety of systemic ailments, including intestinal and liver-spleen abnormalities, as well as urinary and reproductive issues, and in some cases, even the development of severe complications and cancer over time [12]. While some provinces and cities in southern China have made progress in eliminating or interrupting the transmission of Schistosoma haematobium due to improved living conditions, the risk factors that can trigger schistosomiasis outbreaks still persist [11]. Another prevalent zoonotic disease in China is echinococcosis, often referred to as is encapsular disease. Globally, it accounts for 91% of multilocular echinococcosis and 40% of fine-grained echinococcosis cases [13, 14]. Leptospirosis is yet another globally distributed zoonotic disease, common in tropical regions of Southeast Asia and eastern sub-Saharan Africa [15]. This disease can reach an annual incidence rate of up to 6.85%, resulting in 1 million clinical infections [16]. Although China classified leptospirosis as a category B infectious disease legally reportable in 1955, its incidence has been steadily rising in certain provinces, including Fujian and Yunnan [17,18,19]. Currently, there are no effective therapeutic drugs for most zoonotic diseases. Thus, the early and accurate prediction of their outbreaks plays an important role in vaccine development, biosafety measures preparation, transmission prevewention and disease control.

Traditional methods for identifying disease outbreak risks have primarily relied on pathogen biology, immunology, molecular biology, and imaging algorithms. The limitations of these conventional techniques include the potential for human errors in manual identification, redundancy in discriminatory operations, and a susceptibility to failing to detect initial disease outbreaks [11]. The incidence data of infectious diseases encapsulates both static information, such as onset duration and recurrence intervals, and dynamic information, such as incidence time series [20]. Early targeted preventive and control measures are invaluable for effective public health management. Such proactive measures not only prevent infections but also hold significant promise for advancing clinical diagnosis and treatment for people. In recent years, tremendous progress has been made in building disease risk prediction models through the use of artificial intelligence (AI) technology, which has profound clinical significance in effectively controlling the spread of infectious diseases [21]. AI serves as a valuable complement to traditional disease transmission risk identification methods, such as those rooted in pathogen biology. The construction of machine learning models to predict disease incidence rate trends is highly desirable. As an example, Manuel et al. [22] harnessed physical examination indicators derived from customized health surveys to develop the Cardiovascular Disease Population Risk Prediction Tool (CVDPoRT). This tool offers an accurate and efficient means of predicting cardiovascular disease risk, eliminating the need for cumbersome clinical indicator measurements.

The Long Short-Term Memory (LSTM) model, a widely used time series prediction model, stands out for its ability to capture essential information by preserving historical data through the ‘forget gate.’ In contrast to conventional neural network models like Recurrent Neural Networks (RNNs), which often struggle with modeling inputs characterized by temporal correlations, LSTM excels in encapsulating the dynamics of time series data within the neural network. Consequently, LSTM proves to be highly suitable for time series data modeling and holds signficant potential for applications in the predicting diseases outbreak risks [23, 24]. In clinical practice, LSTM is often used in combination with other models to construct hybrid models that enhance predictive accuracy. For example, Korsakov et al. [25] used LSTM and logistic regression (LR) to construct a prediction algorithm to extract features such as vital signs, diagnoses, and medications from electronic health record data. Although these raw data were still missing or erroneous, the area under the receiver operating characteristic curve (AUC) metric for cardiovascular risk prediction was as high as 0.78 ~ 0.79, indicating the high predictive accuracy of this LSTM hybrid model. Mehdipour Ghazi et al. [26] also used LSTM to clinically model six volumetric magnetic resonance imaging (MRI) biometric features of Alzheimer's disease, and the best predictive performance of the LSTM model was achieved with a missing data value of up to 74%. There are many other examples of such hybrid LSTM models for the disease prediction [21, 24]. LSTM has a wide range of applications, not only for textual datasets but also for speech datasets, image datasets, etc., with lower requirements for the type of time series data.

In developing predictive models such as LSTM models, graphic visualizations often suffer from the inclusion of noise present in real data, posing a challenge in discerning genuine trends from inherent seasonal variations [27]. Consequently, it has become common practice to decompose time series data into distinct components, which can effectively eliminate superfluous noise while preserving essential elements that represent the underlying sources of variation [28,29,30,31,32]. This decomposition process not only aids in the removal of irrelevant noise but also serves to extract more meaningful signals, ultimately contributing to enhanced model accuracy.

The zoonotic diseases discussed above, including schistosomiasis, echinococcosis, and leptospirosis, remain a significant threat to human health, as indicated by statistics from the Scientific Statistics Center of the Ministry of Health of China. To date, no predictive modeling of the outbreaks risk of schistosomiasis, echinococcosis and leptospirosis has been reported. Consequently, our study aims to analyze the outbreak risk of these diseases using monthly incidence data. In this article, we leverage the LSTM model to delve into the disease outbreak risk by examining time series features of infectious diseases.

Methods

Data sources

The incidence time series data used in this study were sourced from the Scientific Data Center of the National Ministry of Health of China. We focused on the analysis of three most prevalent zoonotic diseases: schistosomiasis, echinococcosis, and leptospirosis. The dataset encompassed monthly and annual incidence statistics for each year, spanning from 2004 to 2019.

Time series decomposition

In this study, we postulated that the incidence time series can be decomposed into three distinct components: the trend component (\({\widehat{T}}_{t}\)), the seasonal component (\({S}_{t}\)) and the random (residual) component (\({R}_{t}\)). Assuming a fixed period and amplitude for the seasonal component, we opted for the additive decomposition method, a classical approach outlined in formula (1).

The additive decomposition was executed in three primary steps. First, \({\widehat{T}}_{t}\) was calculated using the moving average method. This involved employing a sliding window and averaging sub-series, which consisted of the current observation, \(n1\) observations to the left of the current value, and \(n2\) observations to the right of the current observation. This process resulted in an m-order formula (3) and a moving formula (2). The calculated average at time \(t\) represented \({\widehat{T}}_{t}\), effectively mitigating some of the randomness in the sequence through the moving average of order \(m\)(\(m\)-MA), while preserving the periodic trend component. Second, the detrended values were obtained by subtracting the above trend component \({\widehat{T}}_{t}\) from the original time series \({\text{y}}_{\text{t}}\), and subsequently, \({S}_{t}\) was derived by averaging these de-trended values. Third, \({\widehat{T}}_{t}\) and \({S}_{t}\) were subtracted from the original series, resulting in the residual component \({R}_{t}\), as expressed in formula (4). All the aforementioned computational steps were integrated into an R-based decomposition function.

$${y}_{t}={S}_{t}+{\text{T}}_{\text{t}}+{\text{R}}_{\text{t}}$$
(1)
$${\widehat{T}}_{t}=\frac{1}{m}{\sum }_{j=-{n}_{1}}^{{n}_{2}}{y}_{t+j}$$
(2)
$$m=n1+n2+1$$
(3)
$${R}_{t}={Y}_{t}-{\widehat{T}}_{t}-{S}_{t}$$
(4)

Constructing LSTM predictive models

The LSTM model, a well-established type of neural network, is particularly suited for time series forecasting. In this study, we utilized the trending sequence as input data, which was subsequently divided into training and test sets based on their temporal indices. Before feeding the data into the model, we employed a normalization process, scaling the input values to fit within a fixed interval, typically ranging from 0 to 1.

The model construction began with the creation of a sequential model, which involves a linear stacking of multiple network layers. Within this architecture, we incorporated an LSTM layer, consisting of 50 neurons in the hidden layer. To prevent overfitting during model training, we introduced a dropout layer. This layer randomly sets input units to zero, with the dropout rate adjustable between 0 and 1. This was implemented using the Keras API (application programming interface) of the TensorFlow platform.

To optimize the model's performance, we adopted a hyperparameter tuning strategy using grid search and cross-validation methods, specifically utilizing the GridSearchCV method from the sklearn library. Table 1 below provides a summary of the individual hyperparameter settings for the LSTM models, encompassing batch size, epoch and optimizer. The best performing model configuration was determined based on the parameters’ performance, and an optimal prediction was obtained with no external forcing covariates. Therefore, this model is not an autoregressive model.

Table 1 Settings for individual hyperparameters in LSTM models

In our efforts to refine the training procedure and minimize the loss function E(X), we employed the Adam and Adadelta optimization algorithms. These algorithms were instrumental in ensuring the model’s rapid convergence and accurate learning during the model training process. \(E(X)\) is derived from internal parameters, including weights \((W)\) and bias (\(b\)), and is used to compute the deviation between the actual and predicted values within the test dataset. Consequently, optimizing these weights \((W)\) and bias \((b)\) is crucial for effectively training the neural network model [33].

The Adam algorithm, also known as the adaptive moment estimation method, automatically calculates an adaptive learning rate \((\upeta )\) for each parameter. It achieves this by concurrently maintaining the exponentially decaying mean of the previous squared gradient and the exponentially decaying mean of the first instantaneous mean of the previous gradient \((M(t))\).

To address the issues of learning rate decay during the parameter search process, we employed the Adadelta optimizer. This optimizer fixes the size of the accumulated gradient. It calculates the moving average value at time \(t\) based on the previous average and the current gradient value. In addition, the Adadelta optimizer effectively resolves the gradient vanishing problem, contributing to improved training stability and convergence.

Model evaluation metrics

To gauge the accuracy of our model predictions, we employed four classical assessment metrics: Root Mean Square Error (RMSE, equation S1), Mean Absolute Error (MAE, equation S2), Mean Square Error (MSE, equation S3) and the Coefficient of Determination R2-score (equation S4), Mean Absolute Percentage Error (MAPE, equation S5) [34].

These metrics were utilized to assess how well the model’s predictions aligned with the actual data. Here is a brief description of each metric. First, RMSE measures the standard deviation of the model’s residuals, indicating how closely the observed data clusters around the expected values. It assesses the disparity between the predicted and actual values. Second, MSE calculates the mean squared variance between the actual and predicted values, providing insights into the magnitude of error in the models. Third, MAE represents the average of absolute errors between the predicted and acutal values, offering a different perspective on the errors in the models. Then, the R2-score quantifies the proportion of the variance in the dependent variable that can be explained by the independent variables. It evaluates the fit of the model by comparing the squared differences of the actual and predicted values (numerator) with the squared differences of the actual and mean values (denominator). Lastly, the MAPE is commonly used to measure the accuracy of time series forecasting. These evaluation metrics collectively provide a comprehensive assessment of the model's predictive accuracy and its ability to capture the underlying patterns in the data.

Model construction procedure

The construction of the decomposition-based model (Fig. 1) involved two key functions: the decomposition function (Fig. 1B) and the LSTM function (Fig. 1D). All input data was sourced from the China Public Health Sciences Data Center. The model construction process commenced with the decomposition function, which separated the incidence data into three primary components, mainly relying on the m-order moving average method, as previously detailed. The implementation of this decomposition-based model was carried out using widely-used programming languages such as Python and R. The process initiated with data decomposition (Fig. 1B) and culminated in the generation of risk prediction sequences (Fig. 1D). Subsequently, while splitting the trend series into training and test sets, the study conducted a comparative analysis of the magnitude and duration of the seasonal series. As depicted in Fig. 1B, the seasonal, trend, and random components are illustrated on the left side, while the graph of the three sequences is displayed on the right. Figure 1C showcases the comparison of the life cycle and amplification of the seasonal component, while Fig. 1D delineates the operational steps of the LSTM function. The LSTM function, instrumental in forgetting and retaining input information, introduced an update layer containing forget and remember gates. These gates selectively filtered the input time series, an integral aspect of LSTM’s architecture. As a variant of the Recurrent Neural Network (RNN) model, LSTM integrated updated information into the layer while preserving historical data. The information within the layer was updated through the incorporation of an activation function. In this study, incidence data served as the input dataset, and a significant portion of the data was delicated to the machine learning process. Only when superior training results were achieved could the validation set data be used for prediction, ultimately yielding the LSTM risk prediction trends as the final output.

Fig. 1
figure 1

Flowchart of LSTM-based risk prediction modeling. (A) The raw data, sourced from the Scientific Data Center of the National Ministry of Health of China, were processed by Python and R; (B) The seasonal, trend, and random components are shown on the left, while the graph of the three sequences is presented on the right; (C) This stage involves a comparison of the seasonal component’s life cycle and amplification; (D) This section illustrates the operational process of the LSTM function

Results

Epidemiological profile

From 2004 to 2019, there were multiple cases of schistosomiasis, echinococcosis, and leptospirosis every year in China (Table S1 and Fig. 2). These three diseases reached their highest incidence rates in different periods, with schistosomiasis peaking in 2015 ~ 2016, echinococcosis in 2017 ~ 2018, and leptospirosis in 2005 ~ 2006, respectively. Conversely, the lowest incidence rates were recorded in 2004 and 2018 ~ 2019. The annual incidence rates for these diseases were as follows: schistosomiasis had a rate of 0.365, echinococcosis had a rate of 0.228, and leptospirosis had a rate of 0.044. Schistosomiasis exhibited a maximum of 34,143 cases in 2015 and a minimum of 113 cases in 2019, with the remaining incidence cases falling between 1,186 and 5,699. Echinococcosis recorded a peak of 5,485 cases in 2017 and a low of 602 cases in 2016, with other incidence cases ranging from 1,287 to 4,777. As for leptospirosis, its incidence primarily fluctuated between 201 and 868 cases, with a peak reaching 1,415. Notably, leptospirosis exhibited the lowest incidence among the three diseases.

Fig. 2
figure 2

Line plots showing the number of schistosomiasis, echinococcosis, and leptospirosis cases (A) and incidence rates (B) across the nation, 2004 to 2019

Analysis of the mortality profiles of these three zoonotic diseases between 2004 and 2019 (Table S2) reveals a diverse and unpredictable pattern in mortality rates over the 16-year period for each disease. Among them, leptospirosis stood out as the most frequently lethal, with mortality ranging from tens to single digits annually except 2017 in which year with no deaths recorded. In contrast, schistosomiasis and echinococcosis exhibited uneven temporal distributions of fatalities, with annual deaths remaining in single digits. Schistosomiasis and echinococcosis caused very few deaths in the period 2004 to 2019, while leptospirosis caused as many as 55 deaths, despite fewer recorded cases. Interestingly, all three diseases demonstrated a substantial reduction in mortality rates over time.

It is evident from Fig. 2 that the incidence ratios align closely with the actual data. Schistosomiasis stands out with the highest incidence number. Figure 2 effectively illustrates the severity of all three diseases consistently, highlighting the similarity in incidence between schistosomiasis and echinococcosis. Moreover, it is noteworthy that the incidence rate due to leptospirosis has been steadily decreasing year after year, although the decline is not as pronounced as observed for the other diseases.

Results of time series decomposition

Fundamentally, time series decomposition serves as a noise reduction technique [35]. In our analysis, the actual incidence data for leptospirosis, schistosomiasis, and echinococcosis were decomposed into three primary components: trend, seasonal, and random (residual). This decomposition was conducted with a period parameter frequency set to 4. The trend component represents the continuous and overarching development trend, showing how the sequence evolves over time. In contrast, the seasonal component depicts the regular cyclic changes in periodicity [36]. The random (residual) component captures the random and irregular fluctuations that remain once the original sequence has been decomposed, with the trend and seasonal components removed.

The observed values for the three disorders, as shown in Fig. 3, exhibit significant differences. The trend component’s trend curves closely resemble the original data but appear smoother, indicating the elimination of noise interference. Particularly, the seasonal components display certain regularities in terms of amplitude and period, exhibiting slower fluctuations over time. This indicates a discernible pattern in the occurrence of these diseases. Simultaneously, the random components capture the irregular signals once the decomposition process is finalized. These components provide insights into the unpredictable variations within the data.

Fig. 3
figure 3

Decomposition results of schistosomiasis, echinococcosis, and leptospirosis incidence cases (A, C, E) and incidence rates (B, D, F) at the national level

Through the decomposition method, the original time series data was partitioned into three distinct components, yielding the seasonal component (Fig. 4). Consequently, the seasonal component sequences for incidence numbers (Fig. 4A) and incidence rates (Fig. 4B) were derived and are presented below.

Fig. 4
figure 4

Seasonal component results of schistosomiasis, echinococcosis, and leptospirosis cases (A) and incidence rates (B) at the national level

In Fig. 4, each cycle comprises four data points, following the specified frequency parameter of 4. The patterns of fluctuations for schistosomiasis, echinococcosis, and leptospirosis are evident. Schistosomiasis exhibited fluctuations starting from a low negative point, reaching a positive peak, and gradually returning to the negative region. Echinococcosis also commenced with a negative point, swiftly hitting its lowest point, and then experienced a sudden peak. Notably, echinococcosis is the only one among the three diseases to display both peaks and troughs. Leptospirosis displayed pronounced troughs, while schistosomiasis primarily featured peaks. All seasonal sequences cycled back and forth in a consistent four-month pattern.

The outbreak number forecasting effects of the LSTM model

The results of the LSTM model predictions for these zoonotic diseases are presented in Fig. 5, which includes a comparison of actual data predictions (Fig. 5A, C, and E) and trend series predictions (Fig. 5B, D, and F), focusing on incidence numbers. The curves representing the actual unprocessed data exhibit noticeable noise and some bias, whereas the trend curves appear smoother. Each of these zoonotic diseases demonstrates its distinct trend, primarily characterized by varying time periods when peak incidences occur. For example, in schistosomiasis (Fig. 5A and C), the peak incidence is predominantly observed in the latter part of the trend. Conversely, the peak in echinococcosis occurs more sporadically (Fig. 5B and D). The trend curves highlight the cyclical nature of peak incidences in leptospirosis (Fig. 5E and F).

Fig. 5
figure 5

LSTM prediction results for raw data (A, C, E) and trend sequences (B, D, F) for incidence numbers of schistosomiasis (A, B), echinococcosis (C, D), and leptospirosis (E, F). The red line in the graph represents the trend curve of the raw data, while the blue line represents the predicted results. There are approximately 190 data samples; the first 115 data samples constitute the training set, and the last 77 data samples constitute the test set

The incidence forecasting effects of the LSTM model

The LSTM prediction of time-series incidence (Fig. 6) closely resembles the actual incidence data (Fig. 5). The dynamic trajectory of the incidence rate closely mirrors the trends observed in the incidence numbers displayed in Fig. 5. Furthermore, the trend curve for incidence rates appears smoother, indicating a minor difference between predicted and actual values.

Fig. 6
figure 6

LSTM prediction results for raw data (A, C, E) and trend sequences (B, D, F) for schistosomiasis (A, B), echinococcosis (C, D), and leptospirosis (E, F) incidence rates. The red line in the graph represents the trend curve of the raw data, while the blue line represents the predicted results. There are approximately 190 data samples; the first 115 data samples constitute the training set, and the last 77 data samples constitute the test set

Model evaluation

To assess the disparities between LSTM-predicted trending sequences and actual data, a comprehensive evaluation employing a range of assessment metrics was conducted, and the results were illustrated in Tables S3-S5. These metrics encompassed RMSE, MAE, MSE and R2-score. It is noteworthy that the raw data exhibited significantly higher levels of error compared to the trend sequences. By shifting the focus to the incidence rate, which provides a more accurate depiction, we observed a remarkable reduction in error.

Upon LSTM model predictions, it became evident that schistosomiasis caused considerably more error in comparison to the other two diseases (Table S3). However, it is noteworthy that the error associated with trend sequences consistently remained lower than that of the original data, regardless of the incidence rate or the number of morbidities. Furthermore, while the error values generated by the four alternative evaluation techniques (RMSE, MAE, MSE, and R2-score) displayed noticeable variations, the actual data consistently exhibited significantly higher error levels compared to the trend series. It is worth noting that the nature of the MSE loss function itself, which is highly sensitive to outliers, can lead to exceptionally large error values when outliers are present [37, 38]. Through the trends depicted in Figs. 5 and 6, it becomes evident that the incidence of schistosomiasis reaches substantially higher peaks than during other phases of the disease. This occurrence contributes to the notably high MSE error values in these four assessment methods. In a similar vein, due to the reduced error and enhanced predictive accuracy of the trend series, the evaluation results for echinococcosis (Table S4) and leptospirosis (Table S5) followed a parallel pattern. Notably, the MSE results indicated a decrease in prediction error for echinococcosis and leptospirosis, attributable to their relatively steadier trends compared to schistosomiasis. The predicted values and main error results are available in Tables S6-S11.

The percentage of the variance of the original time series predicted by the LSTM hybrid model was assessed in this study (Tables S12-S17). The percentage of variance of the original series accounted for by the three components of the decomposed series ranged from about 70 ~ 90% for the trend series, about 50% for the seasonal series (except for echinococcosis), and about 70% for the random series (except for echinococcosis and schistosomiasis). The incidence numbers and incidence rates showed similar predictive performances (except for schistosomiasis), and the percentage of variance of the original time series predicted by the LSTM model reached about 70%. The percentage of variance in the trend series was slightly above 70%, indicating that the trend series variance accounted for a higher proportion of the total variance.

Discussion

As shown in Tables S1 and Fig. 2, schistosomiasis and echinococcosis exhibited varying trends from 2004 to 2019. Schistosomiasis peaked around 2015, followed by a sudden decline. In contrast, echinococcosis displayed irregular increases in incidence, with a gradual decline post-2019. Conversely, leptospirosis showed a consistent decline over this 16-year period. The incidence rates for these three diseases exhibited variations between 2004 and 2019, with fatalities accounting for a relatively modest fraction. These trends were further substantiated by the decomposition results displayed in Fig. 3. Around 2015, schistosomiasis incidence reached a peak that was significantly higher than the previous years. Although the number of cases in other years appeared to stabilize, the seasonal components displayed similar patterns (Fig. 4). Tables S3-S5 highlight that the evaluation indicators, especially the amplitude of schistosomiasis, exhibited higher values compared to the other two diseases. The application of LSTM was notably more effective after mitigating some noise through sequence decomposition, underscoring the impact of data contiguity on model fitting. Evaluation metrics, including RMSE, MAE, MSE, and R2-score (Tables S3-S5), indicate that the predictions for echinococcosis and leptospirosis were slightly more accurate than those for schistosomiasis. Because the MSE loss function itself is extremely sensitive to outliers, the MSE error can be quite large in the presence of outliers. The number of incidences and the peak incidence rate for schistosomiasis are significantly larger than the other values for the disease, which can lead to MSE error values much higher than the RMSE, MAE, and R2-score values. From this we know that the performance of the evaluation metrics may differ from experimental expectations when there are some extreme values in the data. For example, the MAPE error of the predicted value of schistosomiasis incidence is greater than the error of the original value. Furthermore, the use of the incidence rate, which represents the percentage of incidence relative to the total population, offers a more precise data description. The comparison between incidence rates and incidence numbers as evaluation indices further reinforces the model’s validity.

LSTM models may require longer training times than other simple neural networks due to the complexity of storing and forgetting historical information, but their fitting effectiveness is closely linked to data completeness and contiguity. Gap-filling interpolation and dynamic curve reconstruction could maintain the continuity and integrity of the data. Researchers involved in disease risk prediction should, therefore, ensure proper organization of disease data to enhance model fitting [39]. Parameter tuning plays a significant role in model construction and requires necessary adjustments according to the onset cycle or characteristics of different diseases. However, uniform parameter control may yield varying results in model fitting, making it an uncontrollable factor. In summary, different datasets may exhibit varying fitness levels. For example, due to different transmission patterns, predictive disease models for humans and animals may differ.

Despite these challenges, simplifying the model’s complexity can greatly enhance disease onset prediction. Zoonotic diseases are influenced by a variety of factors, such as environmental changes, host behavior, and pathogen evolution, resulting in complex temporal patterns. The LSTM model effectively captures these intricate dynamics by learning from historical data sequences, leading to improved prediction accuracy compared to traditional statistical methods. This predictive capability of LSTM is crucial for public health officials to implement timely interventions and allocate resources to reduce disease outbreaks. The use of trend series modeling in this study provides more research possibilities for the development of combining with spatial models, rather than just limiting to temporal patterns. The utilization of trend series modeling in this study opens up more research possibilities for combining with spatial models, rather than just focusing on temporal patterns alone. Techniques like decomposition, as integrated in this study, successfully reduce the random nonlinearity of the original data, minimize model complexity, enhance robustness, and improve fitting effectiveness [40, 41]. While a simple additive decomposition was used in this study, it effectively reduced model complexity, lowered computational requirements, and enhanced prediction efficiency compared to more complex decomposition models like Singular Spectrum Analysis (SSA) for long data series. Particularly for datasets with limited samples, minimizing model complexity is crucial for assessing model accuracy. Therefore, this study employs LSTM in conjunction with additive decomposition models. As a deep learning model, LSTM has been successfully integrated into veterinary clinical illness diagnosis and risk prediction, offering new avenues for disease prevention [42, 43]. In conclusion, through harnessing the power of deep learning and temporal data analytics, LSTM-based hybrid models can help advance disease prevention strategies and promote global health security for both human and animal populations. Future research efforts should continue to explore innovative methods to further improve disease risk outbreak prediction models and expand their application in predictive epidemiology and public health decision-making.

Conclusion

In summary, we demonstrate that LSTM, when combined with time series decomposition, delivers more accurate results compared to conventional LSTM models using the actual data series. This approach allows for more precise targeting of disease trends. Our study has unveiled the general incidence patterns of leptospirosis, echinococcosis, and schistosomiasis from 2004 to 2019. Moreover, when applied to two additional zoonotic diseases, our model proved its versatility across different data sources. This highlights the ability of time series prediction models, such as LSTM, to accurately forecast disease incidence and trends, transcending specific disease types. This breakthrough is poised to lay a theoretical foundation and make substantial contributions to the field of epidemiological prevention and control of infectious diseases. As we move forward, future research can further explore this promising area and conducting cutting-edge scientific investigations.

Availability of data and materials

The data that support the findings of this study are available from the Scientific Data Center of the National Ministry of Health of China.

Data availability

The incidence time series data used in this study were sourced from the Scientific Data Center of the National Ministry of Health of China (https://www.phsciencedata.cn/Share/).

References

  1. Morse SS, Mazet JA, Woolhouse M, Parrish CR, Carroll D, Karesh WB, et al. Prediction and prevention of the next pandemic zoonosis. Lancet. 2012;380(9857):1956–65.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Safiabadi Tali SH, LeBlanc JJ, Sadiq Z, Oyewunmi OD, Camargo C, Nikpour B, et al. Tools and Techniques for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)/COVID-19 detection. Clin Microbiol Rev. 2021;34(3):e00228-e320.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Myers KP, Olsen CW, Gray GC. Cases of swine influenza in humans: a review of the literature. Cli Infect Dis. 2007;44(8):1084–8.

    Article  Google Scholar 

  4. Chmielewski R, Swayne DE. Avian influenza: public health and food safety concerns. Annu Rev Food Sci T. 2011;2:37–57.

    Article  Google Scholar 

  5. Wilder-Smith A. Dengue during the COVID-19 pandemic. J Travel Med. 2021;28(8):taab183.

    Article  PubMed  Google Scholar 

  6. Brady OJ, Hay SI. The first local cases of Zika virus in Europe. Lancet. 2019;394:1991–2.

    Article  PubMed  Google Scholar 

  7. Ganaie SS, Schwarz MM, McMillen CM, Price DA, Feng AX, Albe JR, et al. Lrp1 is a host entry factor for Rift Valley fever virus. Cell. 2021;184:5163–78.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. York A. The bodily distribution of monkeypox virus. Nat Rev Microbiol. 2022;20:703.

    PubMed  PubMed Central  Google Scholar 

  9. Jacob ST, Crozier I, Fischer WA II, Hewlett A, Kraft CS, de La Vega MA, et al. Ebola virus disease. Nat Rev Dis Primers. 2020;6:13.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Beard R, Wentz E, Scotch M. A systematic review of spatial decision support systems in public health informatics supporting the identification of high risk areas for zoonotic disease outbreaks. Int J Health Geogr. 2018;17(1):38.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Shi L, Zhang JF, Li W, Yang K. Development of new technologies for risk identification of schistosomiasis transmission in China. Pathogens. 2022;11(2):224.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Li EY, Gurarie D, Lo NC, Zhu X, King CH. Improving public health control of schistosomiasis with a modified WHO strategy: a model-based comparison study. Lancet Glob Health. 2019;7(10):e1414–22.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Budke CM, Deplazes P, Torgerson PR. Global socioeconomic impact of cystic echinococcosis. Emerg Infect Dis. 2006;12(2):296–303.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Torgerson PR, Macpherson CNL. The socioeconomic burden of parasitic zoonoses: global trends. Vet Parasitol. 2011;182(1):79–95.

    Article  PubMed  Google Scholar 

  15. Rajapakse S. Leptospirosis: clinical aspects. Clin Med. 2022;22(1):14–7.

    Article  Google Scholar 

  16. Costa F, Hagan JE, Calcagno J, Kane M, Torgerson P, Martinez-Silveira MS, et al. Global morbidity and mortality of leptospirosis: A systematic review. Plos Neglect Trop D. 2015;9(9): e0003898.

    Article  Google Scholar 

  17. Wang Y, Zeng L, Yang H, Xu J, Zhang X, Guo X, et al. High prevalence of pathogenic Leptospira in wild and domesticated animals in an endemic area of China. Asian Pac J Trop Med. 2011;4(11):841–5.

    Article  Google Scholar 

  18. Zhang H, Zhang C, Zhu Y, Mehmood K, Liu J, McDonough SP, et al. Leptospirosis trends in China, 2007–2018: A retrospective observational study. Transbound Emerg Dis. 2020;67(3):1119–28.

    Article  PubMed  Google Scholar 

  19. Dhewantara PW, Mamun AA, Zhang WY, Yin WW, Ding F, Guo D, et al. Epidemiological shift and geographical heterogeneity in the burden of leptospirosis in China. Infect Dis Poverty. 2018;7(3):10–23.

    Google Scholar 

  20. Shuang K, Li R, Gu M, Loo J, Su S. Major-minor long short-term memory for word-level language model. Ieee T Neur Net Lear. 2020;31(10):3932–46.

    Google Scholar 

  21. Zandavi SM, Rashidi TH, Vafaee F. Dynamic hybrid model to forecast the spread of COVID-19 using LSTM and behavioral models under uncertainty. Ieee T Cybernetics. 2022;52(11):11977–89.

    Article  Google Scholar 

  22. Manuel DG, Tuna M, Bennett C, Hennessy D, Rosella L, Sanmartin C, et al. Development and validation of a cardiovascular disease risk-prediction model using population health surveys: the Cardiovascular Disease Population Risk Tool (CVDPoRT). Can Med Assoc J. 2018;190(29):E871–82.

    Article  Google Scholar 

  23. Anwar MY, Lewnard JA, Parikh S, Pitzer VE. Time series analysis of malaria in Afghanistan: using ARIMA models to predict future trends in incidence. Malaria J. 2016;15(1):566.

    Article  Google Scholar 

  24. Lombardo E, Rabe M, Xiong Y, Nierer L, Cusumano D, Placidi L, et al. Evaluation of real-time tumor contour prediction using LSTM networks for MR-guided radiotherapy. Radiother Oncol. 2023;182: 109555.

    Article  PubMed  Google Scholar 

  25. Korsakov I, Gusev A, Kuznetsova T, et al. Deep and machine learning models to improve risk prediction of cardiovascular disease using data extraction from electronic health records. Eur Heart J. 2019;40:1213.

    Article  Google Scholar 

  26. Mehdipour Ghazi M, Nielsen M, Pai A, et al. Training recurrent neural networks robust to incomplete data: Application to Alzheimer’s disease progression modeling. Med Image Anal. 2019;53:39–46.

    Article  PubMed  Google Scholar 

  27. Donatelli RE, Park JA, Mathews SM, Lee SJ. Time series analysis. Am J Orthod Dentofac. 2022;161(4):605–8.

    Article  Google Scholar 

  28. Rojo J, Rivero R, Romero-Morte J, Fernández-González F, Pérez-Badia R. Modeling pollen time series using seasonal-trend decomposition procedure based on LOESS smoothing. Int J Biometeorol. 2017;61(2):335–48.

    Article  PubMed  Google Scholar 

  29. Brockwell PJ, Davis RA. Introduction to time series and forecasting. Biometrics. 1998;54(3):1204.

    Article  Google Scholar 

  30. Currie KI, Brailsford G, Nichol S, Gomez A, Sparks R, Lassey KR, et al. Tropospheric 14CO2 at Wellington, New Zealand: the world’s longest record. Biogeochemistry. 2011;104:5–22.

    Article  CAS  Google Scholar 

  31. Petropavlovskikh I, Evans R, McConville G, Manney GL, Rieder HE. The influence of the North Atlantic Oscillation and El Niño-Southern Oscillation on mean and extreme values of column ozone over the United States. Atmos Chem Phys. 2015;15:1585–98.

    Article  CAS  Google Scholar 

  32. Gu J, Liang L, Song H, Kong Y, Ma R, Hou Y, et al. A method for hand-foot-mouth disease prediction using GeoDetector and LSTM model in Guangxi, China. Sci Rep-UK. 2019;9(1):17928.

    Article  Google Scholar 

  33. Cao J, Zhao D, Tian C, Jin T, Song F. Adopting improved Adam optimizer to train dendritic neuron model for water quality prediction. Math Biosci Eng. 2023;20(5):9489–510.

    Article  PubMed  Google Scholar 

  34. Khazraei SM, Amiri-Simkooei AR. On the application of monte carlo singular spectrum analysis to GPS position time series. J Geodesy. 2019;93(9):1401–18.

    Article  Google Scholar 

  35. Li S, Sun Y, Han Y, Alfarraj O, Tolba A, Sharma PK. A novel joint Time-Frequency Spectrum resources sustainable risk prediction algorithm based on TFBRL-network for the electromagnetic environment. Sustainability. 2023;15(6):4777.

    Article  Google Scholar 

  36. Enevoldsen J, Simpson GL, Vistisen ST. Using generalized additive models to decompose time series and waveforms, and dissect heart-lung interaction physiology. J Clin Monit Comput. 2023;37(1):165–77.

    Article  PubMed  Google Scholar 

  37. Tsukiyama S, Hasan MM, Fujii S, Kurata H. LSTM-PHV: prediction of human-virus protein-protein interactions by LSTM with word2vec. Brief Bioinform. 2021;22(6):1–9.

    Article  CAS  Google Scholar 

  38. Olsen F, Schillaci C, Ibrahim M, Lipani A. Borough-level COVID-19 forecasting in London using deep learning techniques and a novel MSE-Moran’s I loss function. Results Phys. 2022;35: 105374.

    Article  PubMed  PubMed Central  Google Scholar 

  39. Zhou Y, Jia E, Shi H, Liu Z, Sheng Y, Pan M, et al. Prediction of time-series transcriptomic gene expression based on long short-term memory with empirical mode decomposition. Int J Mol Sci. 2022;23(14):7532.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Wang H, Zhang Y, Liang J, Liu L. DAFA-BiLSTM: Deep Autoregression Feature Augmented Bidirectional LSTM network for time series prediction. Neural Netw. 2023;157:240–56.

    Article  PubMed  Google Scholar 

  41. Green MA. Use of machine learning approaches to compare the contribution of different types of data for predicting an individual’s risk of ill health: an observational study. Lancet. 2018;392(2):40–61.

    Article  Google Scholar 

  42. Ali F, El-Sappagh S, Islamd MSR, Kwake D, Ali A, Imrang M, et al. A smart healthcare monitoring system for heart disease prediction based on ensemble deep learning and feature fusion. Inform Fusion. 2020;63:208–22.

    Article  Google Scholar 

  43. Nie A, Zehnder A, Page RL, Zhang Y, Pineda AL, Rivas MA, et al. DeepTag: inferring diagnoses from veterinary clinical notes. NPJ Digit Med. 2018;1:60.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Funding

This research is supported by the earmarked fund for Bama County Program for Talents in Science and Technology, Guangxi, China (20210034); Innovation Program of Chinese Academy of Agricultural Sciences and Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences.

Author information

Authors and Affiliations

Authors

Contributions

GJ, and HC conceived, designed, and supervised the study. CC, XZ, JL, XW and ZC contributed to the collection and curation of data. ZH and JZ contributed to curation, interpretation of data, and electronic data capture development. CC did the statistical analysis and wrote the first draft of the paper. ZH and XZ critically revised the manuscript. HC and GJ obtained funding for the study. All authors approved the final version of the manuscript. All authors had full access to all the data in the study and verified the data. HC and GJ had final responsibility for the decision to submit for publication.

Corresponding authors

Correspondence to Hailan Chen or Gengjie Jia.

Ethics declarations

Ethics approval and consent to participate

This research did not require ethical approval by the data was obtained from public dataset, while all procedures and software were freely available without the need for animal testing.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, C., He, Z., Zhao, J. et al. Zoonotic outbreak risk prediction with long short-term memory models: a case study with schistosomiasis, echinococcosis, and leptospirosis. BMC Infect Dis 24, 1062 (2024). https://doi.org/10.1186/s12879-024-09892-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12879-024-09892-y

Keywords