Skip to main content

Assessing the relationship between malaria incidence levels and meteorological factors using cluster-integrated regression


This paper introduces a novel approach to modeling malaria incidence in Nigeria by integrating clustering strategies with regression modeling and leveraging meteorological data. By decomposing the datasets into multiple subsets using clustering techniques, we increase the number of explanatory variables and elucidate the role of weather in predicting different ranges of incidence data. Our clustering-integrated regression models, accompanied by optimal barriers, provide insights into the complex relationship between malaria incidence and well-established influencing weather factors such as rainfall and temperature.

We explore two models. The first model incorporates lagged incidence and individual-specific effects. The second model focuses solely on weather components. Selection of a model depends on decision-makers priorities. The model one is recommended for higher predictive accuracy. Moreover, our findings reveal significant variability in malaria incidence, specific to certain geographic clusters and beyond what can be explained by observed weather variables alone.

Notably, rainfall and temperature exhibit varying marginal effects across incidence clusters, indicating their differential impact on malaria transmission. High rainfall correlates with lower incidence, possibly due to its role in flushing mosquito breeding sites. On the other hand, temperature could not predict high-incidence cases, suggesting that other factors other than temperature contribute to high cases.

Our study addresses the demand for comprehensive modeling of malaria incidence, particularly in regions like Nigeria where the disease remains prevalent. By integrating clustering techniques with regression analysis, we offer a nuanced understanding of how predetermined weather factors influence malaria transmission. This approach aids public health authorities in implementing targeted interventions. Our research underscores the importance of considering local contextual factors in malaria control efforts and highlights the potential of weather-based forecasting for proactive disease management.

Peer Review reports


Malaria persists as a significant health concern globally, particularly in sub-Saharan Africa, posing a substantial threat to nearly half of the world’s inhabitants [1,2,3]. Annually, it contributes to a minimum of one million fatalities, with more than \(90\%\) of these occurrences transpiring in Africa [4, 5].

Malaria has a significant socioeconomic impact, making it not only a public health concern but also a major contributor to poverty and underdevelopment [6, 7]. Malaria vaccinations are likely insufficient, and despite more than 60 years of research, an effective vaccine that outperforms naturally acquired immunity has yet to be developed [8,9,10,11]. This underscores the importance of malaria research for socioeconomic advancement in Africa and globally. In Nigeria, where malaria transmission is widespread, 97% of the population is at risk. In 2019, Nigeria was one of six countries that accounted for 55% of global malaria cases [12]. Factors such as poor sanitation, overcrowded living conditions, and high population density contribute to the prevalence of malaria in Nigeria [13]. Malaria is actively transmitted across all 36 states of Nigeria [14]. Despite efforts to increase coverage, the proportion of Insecticide-Treated Net (ITN) usage remains low in many endemic regions [15,16,17], and the non-significant negative relationship between malaria transmission and ITN coverage is concerning. This appears to be due to the barriers to ITN or Long-Lasting Insecticidal Net (LLIN) utilization which include heat, adverse reactions to the chemicals, unpleasant odors, and cost [18, 19].

Moreover, meteorological factors play a crucial role in driving malaria transmission, influencing the life cycles of both vectors and parasites [20]. Mosquito breeding, which relies on stagnant water accumulation, is directly affected by precipitation levels. While adequate rainfall can create breeding sites, intense rainfall can also impact mosquito habitats [21,22,23]. Additionally, ambient temperature is another key factor, as temperatures above 20\(^{\circ }\)C are essential for the development of the Plasmodium parasite. This is particularly true for Plasmodium falciparum, the most common malaria parasite in tropical regions [24].

Numerous studies have explored the intricate connection between malaria incidence and various meteorological factors [25,26,27,28,29]. For instance, Gunda et al. [28] conducted an assessment of malaria incidence and its correlation with weather variables in three rural districts in Sub-Saharan Africa (SSA), highlighting a significant association with precipitation and mean temperature at specific lag periods. Similarly, Akinbobola and Omotosho [29] examined the relationship between weather variables and reported malaria cases in two stations from different geopolitical zones in Nigeria. Their study revealed a notable increase in malaria cases associated with changes in weather variables. Specifically, they found that rainfall and humidity had positive associations with the incidence of malaria, while maximum temperature exhibited both inverse and direct relationships, depending on the region under consideration. Further, the study in [30] found that malaria incidence in Nigeria is significantly influenced by environmental factors such as rainfall, temperature, and proximity to water. The research found that higher rates of malaria are associated with increased rainfall and temperatures. The spatio-temporal study identified specific hotspots, facilitating the development of targeted intervention strategies. Additionally, the study indicates that malaria is more prevalent in the northern regions and rural areas compared to the southern and urban regions. Moreover, recent research in Uganda has demonstrated strong associations between malaria incidence and climate variables, indicating a positive correlation with rainfall as well as average temperature [29]. Similarly, a decade-long investigation of regional and temporal patterns of malaria incidence in Mozambique found a higher risk when maximum temperatures exceeded \(28^{\circ }\)C and humidity reached 95% [31]. In addition, regions such as Ethiopia and Senegal have shown similar spatial relationships between climatic variability, such as rainfall, and malaria occurrence [32]. These findings underscore the significant influence of climatic factors on malaria prevalence within endemic environments.

Driven by the evident seasonal fluctuations in malaria prevalence, particularly with a notable percentage of cases occurring during wet seasons [33,34,35] (see also Fig. 1b), this study investigates the relationship between panel malaria incidence data from Nigeria and meteorological variables. Our main objective is to discover the incidence ranges that are most efficiently predicted by meteorological factors, contrasting the traditional approach of predicting weather components based on incidence levels. To achieve this, we develop a clustering-integrated multiple regression model for monthly panel data, incorporating meteorological factors such as rainfall and average temperature. By categorizing incidence data into distinct clusters using a clustering method [36,37,38,39], we improve model fitting and identify clusters where meteorological factors inadequately predict incidence. By varying clustering barriers to minimize mean squared error, we aim to optimize model performance. This approach addresses challenges posed by limited data availability, (particularly, the additional confounding factors) in developing countries, by proposing a clustering strategy to enhance modeling accuracy. Additionally, utilizing panel data from different states across Nigeria enriches our analysis.

Fig. 1
figure 1

a Monthly malaria density data. b Normalized malaria data. c Average rainfall data. d Malaria cases versus average rainfall pattern

Data and study area

The dataset used in this study comprises monthly reports of malaria cases over five years (2014-2018), obtained from the National Malaria Elimination Programme, Federal Ministry of Health, Nigeria. The data encompasses all six geo-political zones in Nigeria, with two states representing each zone. This includes a total of twelve states, which collectively represent one-third of Nigeria’s regions: Anambra, Ebonyi, Bauchi, Gombe, Bayelsa, Delta, Kastina, Kebbi, Nassarawa, Niger, Ogun, and Ondo. The visualization of reported malaria cases is depicted in Fig. 1.

To address the disparity in data magnitude between meteorological and incidence data, which may stem from population variations among regions, we employ normalized density data to ensure interpretability and numerical stability. Population data required for normalization are sourced from the demographic statistics bulletin of Nigeria.

The corresponding mean monthly rainfall and temperature data were obtained for each of the aforementioned states from the World Weather Online. These states altogether comprise areas of low and intense temperatures and rainfall. The plot of the climate data is given in Figs. 2 and 3.

Fig. 2
figure 2

Rainfall and temperature data for different states

Fig. 3
figure 3

Rainfall and temperature data for different states

In Nigeria, the climate exhibits significant diversity, ranging from tropical conditions in the south to semi-arid conditions in the far north. Precipitation patterns vary accordingly, with the annual rainfall below 500mm (20 inches) in the extreme northeast, increasing to 1,000 to 1, 500mm (40 to 60 in) in the central region, and exceeding 2, 000mm (80 in) in the south, particularly in the far southeast. Temperatures also display notable fluctuations across different climatic zones.

In the northern regions, winters are warm and dry, with daytime temperatures soaring to uncomfortable levels of up to 40\(^{\circ }\)C (104\(^{\circ }\)F), while nights are generally cool. In hilly areas of the north, temperatures can drop to freezing (0\(^{\circ }\)C or 32\(^{\circ }\)F). From February onwards, temperatures rise across inland areas, reaching scorching levels from March to May, with temperatures often surpassing 40\(^{\circ }\)C (104\(^{\circ }\)F) in the center-north.

Conversely, in the southern regions, temperature increases are more moderate due to the proximity to the ocean and the onset of rain showers earlier in the year. Rainfall intensifies and becomes more frequent, gradually spreading northwards until it affects the entire country by June. This information is sourced from the Nigerian Meteorological Agency (

Additionally, analysis of average monthly rainfall trends in Fig. 1c, illustrates that the rainy season typically commences between March and April, although variations occur from year to year and from state to state. The peak of rainfall typically occurs in August, followed by a tapering off of the rainy season from September to October, with the dry season prevailing from November onwards. Notably, Fig. 1d highlights a two-month lag between the peak of rainfall and the peak in malaria cases.


Cross-correlations and autocorrelations

We determine the time delays between malaria incidence and meteorological factors prior to their integration into the modeling process. Spearman’s correlation coefficient is employed for this purpose due to its resilience to monotonic transformations across datasets. Using a dataset comprising 60 data points, we fix the last 12 malaria incidence data points and shift 48-month windows of meteorological parameters backward in time for all states. This approach allows us to explore correlations over a full one-year period shift. For each fixed window, correlation coefficients are computed, and the maximum correlation is identified. The corresponding time lags associated with these maximal correlation coefficients are summarized in Table 1.

Table 1 Maximum correlation coefficients and associated time lags following Spearman’s cross-correlation analysis between malaria incidence and meteorological variables in 12 selected states in Nigeria

Furthermore, auto-correlation analysis of malaria cases is conducted by consolidating spatiotemporal data into a single time series, considering the relatively minor variations in normalized data. Lags up to 6 months from the present are selected, resulting in each covariate augmenting the dataset by 12 times 60 observations. Subsequently, Spearman-rank correlation coefficients are computed, as illustrated in Table 2.

Panel regression

As mentioned in the Introduction, this study focuses on modeling malaria incidence in Nigeria using rainfall and temperature data obtained from various states across different periods. The data used in this study is categorized as panel data, encompassing both cross-sectional (across states) and time-series dimensions, as it provides insights into individual behavior over time.

Commonly two primary models are employed for panel data analysis; the fixed effects model and the random effects model. Consider the multiple linear regression model for individual \(i = 1, \ldots ,N\), observed at various time points \(t = 1, \ldots , T\):

$$\begin{aligned} y_{it} = \alpha + x^{\prime }_{it} \beta + z^{\prime }_i \gamma + c_i + u_{it} \end{aligned}$$

Here, \(y_{it}\) represents the dependent variable, \(x^{\prime }_{it}\) denotes a K-dimensional row vector of time-varying explanatory variables, \(z^{\prime }_i\) signifies a M-dimensional row vector of time-invariant explanatory variables (excluding the constant term), \(\alpha\) stands for the intercept, \(\beta\) represents a K-dimensional column vector of parameters, \(\gamma\) denotes a M-dimensional column vector of parameters, \(c_i\) denotes an individual-specific effect, and \(u_{it}\) signifies the idiosyncratic error term.

The fixed effects model accommodates for individual-specific effects (\(\alpha _i\)) that may be correlated with the regressors x. In contrast, the random effects model assumes that these individual-specific effects (\(\alpha _i\)) are distributed independently from the regressors. The selection between the fixed and random effects models is determined using the Hausman test [40, 41]. This test evaluates whether there is a significant difference exists between the fixed and random effects estimators. Specifically, the test statistic is computed solely for the time-varying regressors. If the Hausman test yields an insignificant result, the random effects model is employed. Otherwise, the fixed effects model is preferred [42].

Given the limited availability of incidence data for modeling, utilizing panel data offers specific advantages in uncovering the correlations between malaria incidence and meteorological factors. This stems from several established benefits of employing panel data compared to relying solely on time series or cross-sectional data [40, 43].

One advantage lies in incorporating individual-specific components within the model, which enables addressing heterogeneity across individuals. Integrating this component elucidates correlations among observations over time that are not solely attributable to dynamic trends, thereby mitigating unexplained variability. Additionally, incorporating individual-specific effects helps mitigate the issue of omitted variable bias.

Another advantage, particularly in the time series analysis, is leveraging the available data across individuals to compensate for shorter series lengths, obviating the need for extensive longitudinal data. Consequently, constructing an accurate model becomes feasible by identifying commonalities among individuals.

Conversely, compared to cross-sectional data, panel data’s temporal dimension enhances estimation precision through additional temporal data points.

Incidence-weather relation:(without clustering strategy)

Let i and j represent the state and time indices respectively, where \(i\in \{1,\cdots ,S=12\}\) and \(j\in \{1,\cdots ,N\}\). Our strategy for modeling the monthly malaria cases in the 12 chosen Nigerian states involves directly linking collected variables. These variables consist of current (lag-0) reported cases \(C =(c_{ij})\), cases reported in the preceding six months (lag-1, \(\cdots\), lag-6) from the current period \(C_{-1} = (c_{i,j-1}),\cdots ,C_{-6} =(c_{i,j-6})\), lagged monthly rainfall \(R =\mathbbm {1}_{S}\otimes (r_{j-lag(i)})\), and lagged monthly temperature \(T =\mathbbm {1}_{S}\otimes (t_{j-lag(i)})\); where lag(i) corresponds to the cross-correlation outcome in Table 1. The symbols \(\mathbbm {1}_{S}\) and \(\otimes\) denote the column vector of size S with entries being 1, and the Kronecker product between two matrices respectively. The total number of observations is the length of the entire time window minus the maximum autoregressive lag. Let \(\beta _0\) denote the intercept and \(\beta _{\text {ind}}=(\beta _1,\cdots ,\beta _{S-1})\) represent the individual-specific effects (reduced by one term to prevent linear dependence with the intercept). Further, \(\beta _{-i}\) (for \(i=1,\cdots ,6\)) signify the marginal effects of the lagged incidence cases while \(\beta _R\), \(\beta _T\) and \(\varepsilon =(\varepsilon _{ij})\) represent the marginal effect of rainfall, marginal effect of temperature and the idiosyncratic error respectively. The direct relationship among these covariates is represented by:

$$\begin{aligned} C = \beta _0\mathbbm {1}_{S\times N} + \mathbbm {1}_{N}^{\top }\otimes [\beta _{\text {ind}} \,0]^{\top } + \beta _RR+\beta _TT+\varepsilon . \end{aligned}$$

Incidence-weather relation:(with clustering strategy)

Clustering is applied to the response data, while the associated explanatory variables are categorized based on the levels of the response data. This technique allows for the selective use of certain explanatory variables to predict a specific response variable, particularly when the number of explanatory variables is limited. The primary objective of clustering is to accurately allocate explanatory variables in scenarios where they may not predict a particular response variable effectively. Therefore, unlike conventional regression approaches, clustering-integrated regression aims to identify the ranges of the response variable that are well predicted by the available explanatory variables. By incorporating additional explanatory variables, this approach can enhance model fitting.

The clustering concept involves categorizing the incidence data into M clusters \((\Omega _{k})_{k=1}^M\) separated by barriers \(\theta :=(\theta _{k})_{k=1}^{M-1}\). In closed forms, the clusters are defined as \(\Omega _k=\{c:\max \{0,\theta _{k-1}\}\le c<\min \{\theta _k,\max _{i,j}c_{ij}\}\}\). Let \(\delta _k(C;\theta ):=(\mathbbm {1}_{\Omega _k}c_{ij})\), where \(\mathbbm {1}_{\Omega _k}\) denotes the characteristic function and assigning a value of 1 to \(c_{ij}\) belonging to \(\Omega _k\) or 0 otherwise. Define \(R^k=R^k(\theta ):=\delta _k(C;\theta )\circ R\) and \(T^k=T^k(\theta ):=\delta _k(C;\theta )\circ T\), where the Hadamard product \(\circ\) represents the element-wise multiplication between matrices. These matrices return the original entries of R and T if their corresponding incidence cases belong to the respective cluster or 0 otherwise. This decomposition ensures that \(\sum _kR^k=R\) and \(\sum _kT^k=T\). Incorporating clustering, the model (2) is modified as follows:

$$\begin{aligned} C = \beta _0\mathbbm {1}_{S\times N} + \mathbbm {1}_{N}^{\top }\otimes \beta _{\text {ind}}^{\top } + \sum _{i=1}^3 \beta ^i_RR^{(i)}+\sum _{i=1}^3 \beta ^i_TT^{(i)}+\varepsilon . \end{aligned}$$

In theory, the number of specified clusters is not limited to a small number, as better fitting can be achieved with more explanatory factors. However, concerns about complexity and interpretability may arise when adopting a large number of clusters. For example, if \(R^{(2)}\) is deemed insignificant, it implies that rainfall fails to predict response cases within the range specified by the middle cluster \(\Omega _2\). This approach allows such cases to remain “unexplained by rainfall”.

The pooled estimator \(\hat{\beta }\) changes with the lower and upper barriers \(\theta =(\theta _{\text {l}},\theta _{\text {u}})\), as do \(R^k\) and \(T^k\). Our objective is to determine the optimal barriers such that the squared error between the data \(C=(c_{ij})\) and the model approximation \(C[\hat{\beta }](\theta )\) is minimized. Mathematically, this translates to the optimization problem:

$$\begin{aligned}{} & {} \min _{\theta }\qquad \qquad \qquad \quad \sum _{i,j}\,(c_{ij}[\hat{\beta }](\theta )-c_{ij})^2\end{aligned}$$
$$\begin{aligned}{} & {} \text {subject to}\qquad \qquad \min _{i,j}c_{ij}\le \theta _{\text {l}}\le \theta _{\text {u}}\le \max _{i,j}c_{ij}. \end{aligned}$$

This problem can be solved using optimization techniques such as brute-force or particle swarm optimization methods [44].

Data management and statistical analysis

The malaria incidence data, being population-driven, underwent normalization to account for population variations across states. Specifically, normalization was performed to standardize the number of cases per 100,000 inhabitants based on the 2006 population census. This intentional normalization aimed to ensure comparability of malaria incidence rates across states, facilitating appropriate comparison with weather components unaffected by population dynamics. Due to the skewed nature of the datasets, and given that the applicability of linear regression necessitates that all datasets are identically and independently distributed (i.i.d.). To enhance compatibility with dummy or coded variables, we normalize the incidence and weather datasets. The incidence data transformation was done using the Johnson SU technique [45]. The transformation is defined as follows:

$$\begin{aligned} T(\textbf{x}) = \text {shift} + \text {grad} \cdot \sinh ^{-1}\left( \frac{\textbf{x}}{\text {div}}\right) \end{aligned}$$


  • \(\text {shift}\) is the shift parameter,

  • \(\text {grad}\) is the gradient parameter,

  • \(\text {div}\) is the divisor parameter,

  • \(\textbf{x}\) represents the data points.

For incidence data transformation, we use the parameters [shift, grad, div]= [0, 0.1, 0.000003]. We employed the logarithmic transformation for rainfall and temperature data and defined as follows:

$$\begin{aligned} \text {T}(\textbf{x}) = \log _{10}\left( \text {shift} + \frac{\textbf{x}}{\text {div}}\right) \end{aligned}$$


  • \(\textbf{x}\) represents the data points.

  • \(\text {shift}\) is a shift parameter added to the data to ensure all values are positive before applying the logarithm.

  • \(\text {div}\) is a divisor that scales the data.

We used the parameter values for [shift, div] as [23, 2e-2] and [10, 1] for rainfall and temperature data transformation respectively. The visualisation of the data transformation is presented in Fig. 4.

Fig. 4
figure 4

Distribution of the incidence, rainfall, and temperature; a, c and e for original data and b, d and f for transformed data respectively

The primary statistical analysis, involving panel regression modeling (see Panel regression section), was conducted using STATA software. This analysis focused on exploring the relationship between malaria incidence levels and weather variables while accounting for panel data structure and individual-specific effects. The preliminary analyses, such as cross-correlation and autocorrelation assessments (see Cross-correlations and autocorrelations section), were performed using MATLAB software. These initial analyses helped identify correlations and patterns in the data, providing insights into the relationship between variables before proceeding with more advanced modeling techniques in STATA. Additionally, the PSO clustering technique used for clustering the incidence data was performed in MATLAB.


Incidence-weather cross-correlations and incidence-specific autocorrelation

We observed that the correlations between malaria incidence and rainfall predominantly exhibit initial positive coefficients, which gradually transition to negative coefficients as the lag duration increases (see Fig. 5a-c). Conversely, the correlations between malaria incidence and temperature demonstrate an opposite trend (see Fig. 5d-f). The time lags associated with the highest correlation coefficients between malaria cases and rainfall typically range from 1 to 3 months, with a majority occurring at a 2-month lag (see Table 1). This conforms with previous studies that have found approximately a two-month lag between the peaks of rainfall and malaria incidence [46,47,48]. However, for the correlations between malaria incidence and temperature, the maximal correlation coefficients exhibit a wide variability, ranging from 0 to 7 months, and encompass both negative and positive values (see Table 1). This fluctuating pattern of cross-correlations may be attributed to the monthly data collection frequency, which represents a relatively large time scale for measurement.

Fig. 5
figure 5

Example cross-correlation results for 3 states: a Gombe incidence-rain cross-correlation b Delta incidence-rain cross-correlation c Kebbi incidence-rain cross-correlation d Gombe incidence-temperature cross-correlation e Delta incidence-temperature cross-correlation f Kebbi incidence-temperature cross-correlation

In our study, lag values obtained from the cross-correlation analysis for rainfall are used to construct appropriate variables in the regression models, while those for temperature are excluded due to their unstable nature.

Table 2 Spearman autocorrelation matrix

Table 2 shows the case-specific auto-correlations with the Spearman-rank correlation coefficient. This information can be useful in understanding the temporal patterns and potential predictive power of past malaria incidence on current cases. The results in Table 2 suggest there is a positive correlation between malaria incidence in the current month and the incidence in the previous months, with varying strengths depending on the lag. The correlation coefficient of 0.63888 at lag 2 suggests a relatively strong positive correlation between malaria incidence in the current month and the incidence two months ago. The coefficient at lag 6 is 0.19333, indicating a relatively weak positive correlation between the current month and the incidence six months ago. This is followed by the coefficient of 0.34347 at lag 4 which suggests a weaker positive correlation between the current month and the incidence four months ago.


The model featuring cluster-specific effects yields improved outcomes (in terms of increased \(R^2\) value and minimized RMSE) when utilizing the arbitrarily chosen cluster barriers, as opposed to when the clustering strategy is not employed [49]. Nevertheless, investigation is undertaken to ascertain whether clustering by dividing into tertiles represents the optimal approach, or if a different set of cluster barriers can outperform it in terms of minimizing the mean square error. Therefore, the optimal lower barrier \(b_l\) and upper barrier \(b_u\) are sought using the method of particle swarm optimization (PSO) [39, 44]. The PSO technique is a metaheuristic algorithm that does not rely on gradient information and utilizes a stochastic approach to converge toward optimal solutions. Specifically, the PSO variant employed in this study involves updating players’ positions (barriers) and velocities iteratively, with parameters such as self-confidence (1.0), global-best position attraction (1.0), inertia (1.0), and constriction factor (0.3) carefully tuned for optimal performance. The algorithm utilized 100 players to form a relatively large swarm, balancing between exploration and exploitation. Further details on the implementation can be found in [39].

The optimal barrier values are (\(b_l=0.4124,b_u=0.6990\)). The surface plot of the MSE with respect to the barriers \(b_l\) and \(b_u\) is presented in Fig. 6.

Fig. 6
figure 6

Computation of optimal barriers (\(b_l, b_u) \approx (0.4124, 0.6990)\) for the clustering. Black (\(+\)) denotes the optimal barriers determined by the PSO computation. The figures show the evolution of the positions of 100 players (\(\times\)) converging to an optimal solution: a 5th iteration b 10th iteration c 20th iteration d 40th iterations

The malaria data and the meteorological data after clustering with optimal barriers are given in Fig. 7 .

Fig. 7
figure 7

Malaria data and the meteorological data after clustering with optimal barriers

Panel regression models

We investigated variable selection for model specification, considering criteria such as fit, complexity, insignificance, negative marginal effects, and multicollinearity stemming from certain variables. For fit and complexity evaluation, we aimed to minimize the Bayesian Information Criterion (BIC) [43, 50]. The BIC incorporates a likelihood function L and penalizes the number of parameters (k) more heavily compared to the Akaike Information Criterion (AIC) [51], particularly for large observation sizes, by including a term proportional to \(\log (N)\), where N represents the sample size. Our objective was to reduce BIC by eliminating certain variables and addressing issues of insignificance and multicollinearity as well.

Significance testing was performed using the standard t-test, while multicollinearity assessment involved computing the Inverse Variance Inflation Factor (1/VIF) for all explanatory variables except the constant term. A 1/VIF value below the threshold of 0.1 indicates multicollinearity associated with the tested variable [52]. Additionally, we monitored the p-value of the F-statistic, which indicates whether the overall set of variables is jointly significant; a p-value smaller than the significance level \(\alpha =0.05\) suggests significance. This approach not only evaluates the model’s adequacy beyond a constant term model but also helps diagnose multicollinearity.

figure a

In this study two different models were examined: the first model (Model 1) featured incidence clustering with optimal barriers, while the second model (Model 2) incorporated lag incidence cases and individual-specific effects. The emergence of individual-specific effects automatically categorizes the model as either a fixed-effect or a random-effect model. We opted for a random-effect model to account for variability not explicitly addressed by the model variables, a choice also supported by a Durbin-Wu-Hausman test.

The model including lag incidence cases and individual-specific effects exhibited the highest Adjusted \(R^2\) Value (0.8036) and lowest BIC value (-2222.5), outperforming the model solely based on incidence clustering, which had \(R^2\) and BIC values of 0.7125 and -1804.9 respectively. Model 2 highlights the significance of individual-specific effects on malaria incidence in five cities (Anambra, Ebonyi, Kastina, Nassarawa, and Ogun), indicating meaningful variability in malaria incidence outcomes specific to these cities beyond what can be explained by observed explanatory variables (rainfall and temperature).

Then, we checked if certain marginal effects would be consistent with our auto-correlation study. From Table 2, it is seen how cases in the past 6 months positively predict present cases with the least auto-correlations found from cases from the last 4 to 6 months. The case-specific auto-correlation supports the model specification where lag-1 to lag-3 incidence are significant predictors for present incidence, whereas lag-2 incidence was omitted due to negative marginal effects that may have resulted from certain model specifications.

In both models, all marginal effects corresponding to the rainfall and temperature matrices are positive, except for the effect of rain in the upper cluster, which exhibited a negative marginal effect. This suggests that higher rainfall leads to lower incidence in the upper cluster. In Model 2, both rainfall and temperature have the highest marginal effect in the lower cluster and the least effect in the upper cluster. This pattern is also observed for the marginal effects with respect to rainfall in Model 1, whereas temperature significantly predicts only incidence in the lower cluster. For both models, temperature could not significantly predict cases in the upper cluster.

We also checked that residuals of the models follow a normal distribution which is crucial for ensuring the validity, adequacy, and reliability of the associated inferences. It can be seen from Fig. 8 that the residuals of both models follow a normal distribution with Model 2, conforming slightly better than the first model. The plot of the model fits with the data for Model 2 is given in Fig. 9.

Fig. 8
figure 8

Residual plots for a Model 1 b Model 2

Fig. 9
figure 9

Fitting result (blue lines) for Model 2


We employed a clustering approach to partition datasets into multiple subsets, not only enriching the explanatory variables but also accurately incorporating the role of weather in predicting specific ranges of incidence data. The clustering-integrated regression models were complemented by optimal barriers. Given that varying the clustering barriers returns different modeling results, we aimed to identify optimal barriers that minimize the mean squared error.

In this study, insights from cross-correlations and autocorrelations between weather factors (rainfall and temperature) and malaria incidence were utilized to incorporate suitable variables in the regression models.

Two models were deliberated: clustering-integrated models with and without lag incidence and individual-specific effects. The selection of the model, along with its implications (marginal effects), hinges on the decision-maker’s priorities. When \(R^2\) and BIC are of paramount importance, we advocated for the clustering-integrated model with lag incidence cases and individual-specific effects. Notably, the significance of certain individual-specific effects suggests substantial variability in malaria incidence outcomes specific to these five cities, beyond the explanatory capacity of observed variables (rainfall and temperature). Indeed factors, such as mosquito breeding site availability and human behaviors (e.g., healthcare-seeking practices, bed net usage), can influence these effects [53, 54].

In the model, all marginal effects related to rainfall and temperature matrices exhibit positivity, except for rainfall’s effect in the upper cluster, which displays a negative marginal effect. This suggests that higher rainfall correlates with lower incidence in the upper cluster, reflecting the intricate and context-dependent relationship. Indeed, high rainfall is known to eliminate mosquito breeding sites [21, 22]. The clustering-integrated model solely comprising weather components is preferable when weather takes precedence over lag incidence cases or in scenarios where data on individual-specific effects are lacking.

In Model 2, both rainfall and temperature exert the highest marginal effect in the lower cluster and the least effect in the upper cluster. This pattern is consistent with Model 1’s marginal effects concerning rainfall, whereas temperature significantly predicts incidence only in the lower cluster. Interestingly, temperature fails to significantly predict cases in the upper cluster, suggesting physical implications where rainfall can predict incidence cases consistently throughout the year, while temperature can only predict low-to-medium incidence scenarios. Thus, this research suggests that while temperature and rainfall may influence disease incidence under certain conditions, their predictive power varies depending on the severity of the outbreak or other contextual factors.

The increasing demand for confounding factors to explain various incidence levels is mitigated by incidence clustering. This approach supports the notion of considering specific hypothetical factors for predicting malaria incidence and conventional regression modeling with limited explanatory variables [38, 39, 55]. The localization of accurately predicted incidence via weather components bears significant implications for public health authorities, not only informing the extent of prediction through marginal effects but also facilitating proactive measures amidst impending weather changes.

Ultimately, the present study highlights the importance of compiling data on additional confounding factors (e.g., other weather components, bednet availability and usage, presence of stagnant water bodies,etc), which not only introduce more explanatory variables but also enhance the reliability of the analysis.

Availability of data and materials

All the data sources have been mentioned. The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.


  1. Guinovart C, Navia MM, Tanner M, Alonso PL. Malaria: burden of disease. Curr Mol Med. 2006;6(2):137–40.

    Article  CAS  PubMed  Google Scholar 

  2. Bonner P, et al. Parasite burden and severity of malaria in Tanzanian children. N Engl J Med. 2014;370(19):1799–808.

    Article  Google Scholar 

  3. Beier JC, Killeen GF, Githure JI. Entomologic inoculation rates and Plasmodium falciparum malaria prevalence in Africa. Am J Trop Med Hyg. 1999;61(1):109–13.

    Article  CAS  PubMed  Google Scholar 

  4. Emmanuel OE, Amzat J. Problems of malaria menace and behavioural intervention for its management in sub-Saharan Africa. J Hum Ecol. 2007;21(2):155–62.

    Article  Google Scholar 

  5. World Health Organization. World malaria report, 2015. Geneva: WHO; 2015.

    Book  Google Scholar 

  6. Sachs J, Malaney P. The economic and social burden of malaria. Nature. 2002;415(6872):680.

    Article  CAS  PubMed  Google Scholar 

  7. Scott N, Ataide R, Wilson DP, Hellard M, Price RN, Simpson JA, Fowkes FJ. Implications of population-level immunity for the emergence of artemisinin-resistant malaria: a mathematical model. Malar J. 2018;17(1):279.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Sherman IW. The elusive malaria vaccine: miracle or mirage? Washington, DC: ASM Press; 2009.

    Book  Google Scholar 

  9. Matuschewski K. Vaccines against malaria still a long way to go. FEBS J. 2017;284(16):2560–8.

    Article  CAS  PubMed  Google Scholar 

  10. El-Moamly AA, El-Sweify MA. Malaria vaccines: the 60-year journey of hope and final success-lessons learned and future prospects. Trop Med Health. 2023;51(1):29.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Stanisic DI, Good MF. Malaria vaccines: progress to date. BioDrugs. 2023;37(6):737–56.

    Article  PubMed  PubMed Central  Google Scholar 

  12. WHO. World malaria report. Geneva; 2020. Accessed 9 Aug 2023.

  13. De Silva PM, Marshall JM. Factors contributing to urban malaria transmission in sub-Saharan Africa: a systematic review. J Trop Med. 2012;2012(1):819563.

  14. Okunlola OA, Oyeyemi OT, Lukman AF. Modeling the relationship between malaria prevalence and insecticide-treated bed net coverage in Nigeria using a Bayesian spatial generalized linear mixed model with a Leroux prior. Epidemiol Health. 2021;43:e2021041.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Govella NJ, Okumu FO, Killeen GF. Insecticide-treated nets can reduce malaria transmission by mosquitoes which feed outdoors. Am J Trop Med Hyg. 2010;82(3):415–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Killeen GF, et al. Preventing childhood malaria in Africa by protecting adults from mosquitoes with insecticide-treated nets. PLoS Med. 2007;4(7):e229.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Killeen GF, Smith TA. Exploring the contributions of bed nets, cattle, insecticides and excitorepellency to malaria control: a deterministic model of mosquito host-seeking behaviour and mortality. Trans R Soc Trop Med Hyg. 2007;101(9):867–80.

    Article  PubMed  Google Scholar 

  18. Konlan KD, Kossi Vivor N, Gegefe I, Hayford L. Factors associated with ownership and utilization of insecticide treated nets among children under five years in sub-Saharan Africa. BMC Public Health. 2022;22(1):940.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Israel OK, Fawole OI, Adebowale AS, Ajayi IO, Yusuf OB, Oladimeji A, Ajumobi O. Caregivers’ knowledge and utilization of long-lasting insecticidal nets among under-five children in Osun State, Southwest, Nigeria. Malar J. 2018;17:1–9.

    Article  Google Scholar 

  20. Arab A, Jackson MC, Kongoli C. Modelling the effects of weather and climate on malaria distributions in West Africa. Malar J. 2014;13(1):126.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Eikenberry SE, Gumel AB. Mathematical modeling of climate change and malaria transmission dynamics: a historical review. J Math Biol. 2018;77(g., S.):857–933.

  22. Chaves LF, et al. Indian Ocean dipole and rainfall drive a Moran effect in East Africa malaria transmission. J Infect Dis. 2012;205(12):1885–91.

    Article  PubMed  Google Scholar 

  23. Parham PE, Michael E. Modeling the effects of weather and climate change on malaria transmission. Environ Health Perspect. 2009;118(5):620–6.

    Article  PubMed Central  Google Scholar 

  24. Kurup R, Deonarine G, Ansari AA. Malaria trend and effect of rainfall and temperature within Regions 7 and 8, Guyana. Int J Mosq Res. 2017;4(6):48–55.

    Google Scholar 

  25. Devi NP, Jauhari RK. Climatic variables and malaria incidence in Dehradun, Uttaranchal, India. J Vector-Borne Dis. 2006;43(1):21.

    PubMed  Google Scholar 

  26. Evans OP, Adenomon MO. Modeling the prevalence of malaria in Niger State: An application of Poisson regression and negative binomial regression models. Int J Phys Sci. 2014;2:61–8.

    Google Scholar 

  27. Segun OE, Shohaimi S, Nallapan M, Lamidi-Sarumoh AA, Salari N. Statistical Modelling of the Effects of Weather Factors on Malaria Occurrence in Abuja, Nigeria. Int J Environ Res Public Health. 2020;17(10):3474.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Gunda R, Chimbari MJ, Shamu S, Sartorius B, Mukaratirwa S. Malaria incidence trends and their association with climatic variables in rural Gwanda, Zimbabwe, 2005–2015. Malar J. 2017;16(1):1–3.

    Article  Google Scholar 

  29. Akinbobola A, Omotosho JB. Predicting Malaria occurrence in Southwest and North Central Nigeria using Meteorological parameters. Int J Biometeorol. 2013;57:721–8.

    Article  CAS  PubMed  Google Scholar 

  30. Okunlola OA, Oyeyemi OT. Spatio-temporal analysis of association between incidence of malaria and environmental predictors of malaria transmission in Nigeria. Sci Rep. 2019;9(1):17500.

    Article  PubMed  PubMed Central  Google Scholar 

  31. Zacarias OP, Andersson M. Spatial and temporal patterns of malaria incidence in Mozambique. Malar J. 2011;10(1):189.

    Article  PubMed  PubMed Central  Google Scholar 

  32. Alemu A, et al. Climatic variables and malaria transmission dynamics in Jimma town, South West Ethiopia. Parasites Vectors. 2011;4(1):30.

    Article  PubMed  PubMed Central  Google Scholar 

  33. Roca-Feltrer A, Schellenberg JR, Smith L, Carneiro I. A simple method for defining malaria seasonality. Malar J. 2009;8(1):1–4.

    Article  Google Scholar 

  34. Ibrahim OR, Lugga AS, Ibrahim N, Aladesua O, Ibrahim LM, Suleiman BA, Suleiman BM. Impact of climatic variables on childhood severe malaria in a tertiary health facility in northern Nigeria. Sudan J Paediatr. 2021;21(2):173–81.

    Article  PubMed  PubMed Central  Google Scholar 

  35. Samdi LM, Ajayi JA, Oguche S, Ayanlade A. Seasonal variation of malaria parasite density in paediatric population of Northeastern Nigeria. Glob J Health Sci. 2012;4(2):103–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. West BT, Welch KB, Galecki AT. Linear mixed models: a practical guide using statistical software. Boca Raton: Chapman & Hall/CRC; 2007.

  37. Strand S, Cadwallader C, Firth D. Using statistical regression methods in education research. Southampton: The ReStore team, National Centre for Research Methods; 2011.

  38. Ganegoda NC, et al. Interrelationship between daily COVID-19 cases and average temperature as well as relative humidity in Germany. Sci Rep. 2021;11(1):11302.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Wijaya KP, et al. Learning from panel data of dengue incidence and meteorological factors in Jakarta, Indonesia. Stoch Env Res Risk A. 2021;35:437–56.

    Article  Google Scholar 

  40. Sheytanova T. The accuracy of the Hausman Test in panel data: A Monte Carlo study. Sweden: Öeboro University; 2015. Accessed 13 Dec 2023.

  41. Baltagi BH, Liu L. Random effects, fixed effects and Hausman’s test for the generalized mixed regressive spatial autoregressive panel data model. Econ Rev. 2016;35(4):638–58.

    Article  Google Scholar 

  42. Schmidheiny K, Unversität Basel. Panel data: fixed and random effects. Short Guides Microeconometrics. 2011;7(1):2–7.

  43. Frees EW. Longitudinal and panel data: analysis and applications in the social sciences. New York: Cambridge University Press; 2004.

  44. Hu X, Eberhart R. Solving constrained nonlinear optimization problems with particle swarm optimization. In: Proceedings of the sixth world multiconference on systemics, cybernetics and informatics. Winter Garden: International Institute of Informatics and Systemics (IIIS); 2002.

  45. Friebel L, Friebelová J. Transformation of an empirical distribution to normal distribution by the use of Johnson system of translation and symmetrical quantile method. Acta Univ Bohemiae Meridionales. 2006;9(1):75–9.

    Article  Google Scholar 

  46. Krefis AC, Schwarz NG, Krüger A, Fobil J, Nkrumah B, Acquah S, Loag W, Sarpong N, Adu-Sarkodie Y, Ranft U, May J. Modeling the relationship between precipitation and malaria incidence in children from a holoendemic area in Ghana. Am J Trop Med Hyg. 2011;84(2):285.

    Article  PubMed  PubMed Central  Google Scholar 

  47. Dlamini SN, Fall IS, Mabaso SD. Bayesian Geostatistical Modeling to Assess Malaria Seasonality and Monthly Incidence Risk in Eswatini. J Epidemiol Global Health. 2022;12(3):340–61.

    Article  Google Scholar 

  48. Briët OJ, Vounatsou P, Gunawardena DM, Galappaththy GN, Amerasinghe PH. Temporal correlation between malaria and rainfall in Sri Lanka. Malar J. 2008;7:1–4.

    Article  Google Scholar 

  49. Senvar O, Sennaroglu B. Comparing performances of clements, box-cox, Johnson methods with weibull distributions for assessing process capability. J Ind Eng Manag. 2016;9(3):634–56.

    Google Scholar 

  50. Raftery AE. Bayesian model selection in social research. Sociol Methodol. 1995;25:111–63.

    Article  Google Scholar 

  51. Akaike H. Information theory and an extension of the maximum likelihood principle. In: Parzen E, Tanabe K, Kitagawa G, editors. Selected papers of Hirotugu Akaike. New York: Springer; 1998.

  52. Mansfield ER, Helms BP. Detecting multicollinearity. Am Stat. 1982;36:158–60.

    Google Scholar 

  53. Singh R, Musa J, Singh S, Ebere UV. Knowledge, attitude and practices on malaria among the rural communities in Aliero, Northern Nigeria. J Fam Med Prim Care. 2014;3(1):39–44.

    Article  Google Scholar 

  54. Fatunla OAT, Olatunya OS, Ogundare EO, Fatunla TO, Babatola AO, Adeniyi AT, Oyelami OA. Malaria prevention practices and malaria prevalence among children living in a rural community in Southwest Nigeria. J Infect Dev Ctries. 2022;16(2):352–61.

    Article  PubMed  Google Scholar 

  55. Kuhn K, Campbell-Lendrum D, Haines A, Cox J, Corvalán C, Anker M. Using climate to predict infectious disease epidemics. Geneva: World Health Organization; 2005. pp. 16–20.

Download references


This work was supported by the Research Council of Finland.

Author information

Authors and Affiliations



M.A. drafted the work and performed the computations; K.K.W.H.E interpreted data and conducted preliminary analysis; All authors reviewed earlier drafts and approved its final version.

Corresponding author

Correspondence to Miracle Amadi.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

All authors have given their consent for the publication of this manuscript.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Amadi, M., Erandi, K. Assessing the relationship between malaria incidence levels and meteorological factors using cluster-integrated regression. BMC Infect Dis 24, 664 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: