Skip to main content

A comparison of five epidemiological models for transmission of SARS-CoV-2 in India



Many popular disease transmission models have helped nations respond to the COVID-19 pandemic by informing decisions about pandemic planning, resource allocation, implementation of social distancing measures, lockdowns, and other non-pharmaceutical interventions. We study how five epidemiological models forecast and assess the course of the pandemic in India: a baseline curve-fitting model, an extended SIR (eSIR) model, two extended SEIR (SAPHIRE and SEIR-fansy) models, and a semi-mechanistic Bayesian hierarchical model (ICM).


Using COVID-19 case-recovery-death count data reported in India from March 15 to October 15 to train the models, we generate predictions from each of the five models from October 16 to December 31. To compare prediction accuracy with respect to reported cumulative and active case counts and reported cumulative death counts, we compute the symmetric mean absolute prediction error (SMAPE) for each of the five models. For reported cumulative cases and deaths, we compute Pearson’s and Lin’s correlation coefficients to investigate how well the projected and observed reported counts agree. We also present underreporting factors when available, and comment on uncertainty of projections from each model.


For active case counts, SMAPE values are 35.14% (SEIR-fansy) and 37.96% (eSIR). For cumulative case counts, SMAPE values are 6.89% (baseline), 6.59% (eSIR), 2.25% (SAPHIRE) and 2.29% (SEIR-fansy). For cumulative death counts, the SMAPE values are 4.74% (SEIR-fansy), 8.94% (eSIR) and 0.77% (ICM). Three models (SAPHIRE, SEIR-fansy and ICM) return total (sum of reported and unreported) cumulative case counts as well. We compute underreporting factors as of October 31 and note that for cumulative cases, the SEIR-fansy model yields an underreporting factor of 7.25 and ICM model yields 4.54 for the same quantity. For total (sum of reported and unreported) cumulative deaths the SEIR-fansy model reports an underreporting factor of 2.97. On October 31, we observe 8.18 million cumulative reported cases, while the projections (in millions) from the baseline model are 8.71 (95% credible interval: 8.63–8.80), while eSIR yields 8.35 (7.19–9.60), SAPHIRE returns 8.17 (7.90–8.52) and SEIR-fansy projects 8.51 (8.18–8.85) million cases. Cumulative case projections from the eSIR model have the highest uncertainty in terms of width of 95% credible intervals, followed by those from SAPHIRE, the baseline model and finally SEIR-fansy.


In this comparative paper, we describe five different models used to study the transmission dynamics of the SARS-Cov-2 virus in India. While simulation studies are the only gold standard way to compare the accuracy of the models, here we were uniquely poised to compare the projected case-counts against observed data on a test period. The largest variability across models is observed in predicting the “total” number of infections including reported and unreported cases (on which we have no validation data). The degree of under-reporting has been a major concern in India and is characterized in this report. Overall, the SEIR-fansy model appeared to be a good choice with publicly available R-package and desired flexibility plus accuracy.

Peer Review reports


Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [1]. At the time of revising this paper (March 24, 2021), roughly 124 million cases have been reported worldwide. The disease was first identified in Wuhan, Hubei Province, China in December 2019 [2]. Since then, more than 2.74 million lives have been lost as a direct consequence of the disease. Notable outbreaks were recorded in the United States of America, Brazil and India -- which remains a crucial battleground against the outbreak. The Indian government imposed very strict lockdown measures early in the course of the pandemic in order to reduce the spread of the virus. Said measures have not been as effective as was intended [3], with India now reporting the largest number of confirmed cases in Asia, and the third highest number of confirmed cases in the world after the United States and Brazil [4], with the number of confirmed cases crossing the 10 million mark on December 18, 2020. On March 24, 2020, the Government of India ordered a 21-day nationwide lockdown, later extending it until May 3. This was followed by two-week extensions starting May 3 and 17 with substantial relaxations. From June 1, the government started ‘unlocking’ most regions of the country in five unlock phases. In order to formulate and implement policy geared toward containment and mitigation, it is important to recognize the presence of highly variable contagion patterns across different Indian states [5]. India saw a decay in the virus curve in September, 2020 with daily number of cases going below 10,000. At the time of revising the paper, the daily incidence curve is sharply rising again, as India faces its second wave. There is a rising interest in studying potential trajectories that the infection can take in India to improve policy decisions.

A spectrum of models for projecting infectious disease spread have become widely popular in wake of the pandemic. Some popular models include the ones developed at the Institute of Health Metrics (IHME) [6] (University of Washington, Seattle) and at the Imperial College London [7]. The IHME COVID-19 project initially relied on an extendable nonlinear mixed effects model for fitting parametrized curves to COVID-19 data, before moving to a compartmental model to analyze the pandemic and generate projections. The Imperial College model (henceforth referred to as ICM) works backwards from observed death counts to estimate transmission that occurred several weeks ago, allowing for the time lag between infection and death. A Bayesian mechanistic model is introduced - linking the infection cycle to observed deaths, inferring the total population infected (attack rates) as well as the time-varying reproduction number R(t). With the onset of the pandemic, there has been renewed interest in multi-compartment models, which have played a central role in modeling infectious disease dynamics since the twentieth century [8]. The simplest of compartmental models include the standard SIR [9] model, which has been extended [10] to incorporate various types of time-varying quarantine protocols, including government-level macro isolation policies and community-level micro inspection measures. Further extensions include one which adds a spatial component to this temporal model by making use of a cellular automata structure [11]. Larger compartmental models include those which incorporate different states of transition between susceptible, exposed, infected and removed (SEIR) compartments, which have been used in the early days of the pandemic in the Wuhan province of China [12]. The SEIR compartmental model has been further extended to the SAPHIRE model [13], which accounts for the infectiousness of asymptomatic [14] and pre-symptomatic [15] individuals in the population (both of which are crucial transmission features of COVID-19), time varying ascertainment rates, transmission rates and population movement.

Researchers and policymakers are relying on these models to plan and implement public health policies at the national and local levels. New models are emerging rapidly. Models often have conflicting messages, and it is hard to distinguish a good model from an unreliable one. Different models operate under different assumptions and provide different deliverables. In light of this, it is important to investigate and compare the findings of various models on a given test dataset. While some work has been done in terms of trying to reconcile results from different models of disease transmission that can be fit to emerging data [16], more comparisons need to be done to investigate how differences between competing models might lead to differing projections on the same dataset. In the context of India, such head-to-head comparison across models are largely unavailable.

We consider five different models of different genre, starting from the simplest baseline model. The baseline model we investigate relies on curve-fitting methods, with cumulative number of infected cases modeled as an exponential process [17]. Next, we consider the extended SIR (eSIR) model [10], which uses a Bayesian hierarchical model to generate projections of proportions of infected and removed people at future time points. The SAPHIRE [13] model has been demonstrated to reconstruct the full-spectrum dynamics of COVID-19 in Wuhan between January and March 2020 across five periods defined by events and interventions. Using this, we study the evolution of the pandemic in India over nine well-defined lockdown and unlock periods, each with distinct transmission and ascertainment features. Another model, SEIR-fansy [18] modifies the SEIR model to account for high false negative rate and symptom-based administration of COVID-19 tests. Finally, we study the ICM model, which utilizes a semi-mechanistic Bayesian hierarchical model based on renewal equations that model infections as a latent process and links deaths to infections with the help of survival analysis. Each of the models mentioned above have had appreciable success in being able to satisfactorily analyze and project the trajectory of the pandemic in different countries [19,20,21].

In order to fairly compare and contrast the models mentioned above, we study their respective treatment of the different lockdown and unlock periods declared by the Government of India. Additionally, we compare their projections based on reported data, with special emphasis on how the models deal with (if they do, at all) under-reporting and under-detection of COVID-cases, which has been a major point of discussion in the scientific community, particularly for India [22]. We also compare the uncertainty associated with the projections across the models which is often overlooked in the literature.

The rest of the paper is organized as follows. In Section 2 we provide an overview of the various models considered in our analysis. The supplement has detailed discussion on the formulation, assumptions and estimation methods utilized by each of the models. We present the numerical findings of our comparative investigation of the models in Section 3 by comparing projected COVID-counts (i.e., case and death counts associated with COVID-19) and (wherever possible) parameter estimates which help understand transmission dynamics of the pandemic. Next, in Section 4 we discuss sensitivity analyses and note applications of the models studied in the context of data from countries other than India. Finally, we discuss the implications of our findings in Section 5.


Overview of models

In this section, we discuss the assumptions and formulation of each of the five classes of models described above. Table 1 provides an overview of the models compared in this article.

Table 1 Overview of models studied

Baseline model


The baseline model we investigate aims to predict the evolution of the COVID-19 pandemic by means of a regression-based predictive model [17]. More specifically, the model relies on a regression analysis of the daily cumulative count of infected cases based on the least-squares fitting. In particular, the growth rate of the infection is modeled as an exponentially decaying process. Figure 1 provides a schematic overview of this model.

Fig. 1
figure 1

Schematic overview of the baseline model


The baseline model assumes that the following simple differential equation governs the evolution of a disease in a fixed population:

$$ \frac{dI(t)}{dt}=\uplambda I(t) $$

where I(t) is defined as the number of infected people at time t and λ is the growth rate of infection. Unlike the other models described in subsequent sections, the baseline model analyses and projects only the cumulative number of infections, and not counts/proportions associated with other compartments like deaths and recoveries. The model uses reported field data of the infections in India over a specific time period. The growth rate can be numerically approximated from Eq. (1) above as

$$ \hat{\uplambda_{\mathrm{t}}}=\frac{I_t-{I}_{t-2}}{2\cdotp {I}_t} $$

Having estimated the growth rate, the model uses a least-squares method to fit an exponential time-varying curve to \( \hat{\uplambda_t} \), obtained from Eq. (2) above. Since all the other methods involve Bayesian estimation methods and use posterior distributions to obtain estimates and associated credible intervals, we place a non-informative prior on the random error in the above curve fitting method [27] to ensure comparable results. Specifically, we consider a uniform prior for the log of error variance. Using projected values of \( \hat{\lambda_t}, \) we extrapolate the number of infections which will occur in future. The baseline model described above has been implemented in R [28] using standard packages for exponential curve fitting.

Extended SIR (eSIR) model


We use an extension of the standard susceptible-infected-removed (SIR) compartmental model known as the extended SIR (eSIR) model [10]. To implement the eSIR model, a Bayesian hierarchical framework is used to model time series data on the proportion of individuals in the infected and removed compartments. Markov chain Monte Carlo (MCMC) methods are used to implement this model, which provides not only posterior estimation of parameters and prevalence values associated with all three compartments of the SIR model, but also predicted proportions of the infected and the removed people at future time points. Figure 2 is a diagrammatic representation of the eSIR model.

Fig. 2
figure 2

The eSIR model with a latent SIR model on the unobserved proportions. Reproduced from Wang et al., 2020 [10]


The eSIR model assumes the true underlying probabilities of the three compartments follow a latent Markov transition process and require observed daily proportions of infected and removed cases as input.

The observed proportions of infected and removed cases on day t are denoted by \( {Y}_t^I \) and \( {Y}_t^R \), respectively. Further, we denote the true underlying probabilities of the S, I, and R compartments on day t by \( {\theta}_t^S \), \( {\theta}_t^I \), and \( {\theta}_t^R \), respectively, and assume that for any t, \( {\theta}_t^S+{\theta}_t^I+{\theta}_t^R=1 \). Assuming a usual SIR model on the true proportions we have the following set of differential equations

$$ \frac{d{\theta}_t^S}{dt}=-\beta {\theta}_t^S{\theta}_t^I, $$
$$ \frac{d{\theta}_t^I}{dt}=\beta {\theta}_t^S{\theta}_t^I-\gamma {\theta}_t^I, $$
$$ \frac{d{\theta}_t^R}{dt}=\gamma {\theta}_t^I, $$

where β > 0 denotes the disease transmission rate, and γ > 0 denotes the removal rate. The basic reproduction number R0β/γ indicates the expected number of cases generated by one infected case in the absence of any intervention and assuming that the whole population is susceptible. We assume a Beta-Dirichlet state space model for the observed infected and removed proportions, which are conditionally independently distributed as

$$ {Y}_t^I\mid {\boldsymbol{\theta}}_{\boldsymbol{t}},\boldsymbol{\tau} \sim Beta\left({\lambda}^I{\theta}_t^I,{\lambda}^I\left(1-{\theta}_t^I\right)\right) $$
$$ {Y}_t^R\mid {\boldsymbol{\theta}}_{\boldsymbol{t}},\boldsymbol{\tau} \sim Beta\left({\lambda}^R{\theta}_t^R,{\lambda}^R\left(1-{\theta}_t^R\right)\right). $$

Further, the Markov process associated with the latent proportions is built as:

$$ {\boldsymbol{\theta}}_{\boldsymbol{t}}\mid {\boldsymbol{\theta}}_{\boldsymbol{t}-\mathbf{1}},\boldsymbol{\tau} \sim \mathrm{D} irichlet\left(\kappa f\left({\boldsymbol{\theta}}_{\boldsymbol{t}-\mathbf{1}},\beta, \gamma \right)\right) $$

where θt denotes the vector of the underlying population probabilities of the three compartments, whose mean is modeled as an unknown function of the probability vector from the previous time point, along with the transition parameters. \( \boldsymbol{\tau} =\left(\beta, \gamma, {\boldsymbol{\theta}}_{\mathbf{0}}^T,\boldsymbol{\lambda}, \kappa \right) \) denotes the whole set of parameters where λI, λR and κ are parameters controlling variability of the observation and latent process, respectively. The function f(·) is then solved as the mean transition probability determined by the SIR dynamic system, using a fourth order Runge-Kutta approximation [29].

Priors and MCMC algorithm

The prior on the initial vector of latent probabilities is set as \( {\boldsymbol{\theta}}_{\mathbf{0}}\sim \mathrm{Dirichlet}\left(1-{Y}_1^I-{Y}_1^R,{Y}_1^I,{Y}_1^R\right) \), \( {\theta}_0^S=1-{\theta}_0^I-{\theta}_0^R \). The prior distribution of the basic reproduction number is lognormal such that E(R0) = 3.28 [30] (this value was also confirmed by calculating the average time-varying R(t) by from January 30 till March 24, 2020, using the package developed by [31]). The prior distribution of the removal rate is also lognormal such that E(γ) = 0.5436. We use the proportion of death within the removed compartment as 0.0184 so that the initial infection fatality ratio is 0.01 [32]. For the variability parameters, the default choice is to set large variances in both observed and latent processes, which may be adjusted over the course of epidemic with more data becoming available: \( \kappa, {\lambda}^I,{\lambda}^R\ \overset{iid}{\sim }\ \mathrm{Gamma}\left(2,{10}^{-4}\right). \)

Denoting t0 as the last date of data availability, and assuming that the forecast spans over the period [t0 + 1, T], the eSIR algorithm is as follows.

  • Step 0. Take M draws from the posterior \( \left[{\boldsymbol{\theta}}_{\mathbf{1}:{\boldsymbol{t}}_{\mathbf{0}}},\boldsymbol{\tau} |{\boldsymbol{Y}}_{\mathbf{1}:{\boldsymbol{t}}_{\mathbf{0}}}\right] \).

  • Step 1. For each solution path m {1, …, M}, iterate between the following two steps via MCMC.

    1. i.

      Draw \( {\boldsymbol{\theta}}_{\boldsymbol{t}}^{\left(\boldsymbol{m}\right)} \) from \( \left[\left.{\boldsymbol{\theta}}_{\boldsymbol{t}}\right|{\boldsymbol{\theta}}_{t-1}^{\left(m-1\right)},{\boldsymbol{\tau}}^{(m)}\right],t\in \left\{{t}_0+1,\dots, T\right\} \).

    2. ii.

      Draw \( {\boldsymbol{Y}}_{\boldsymbol{t}}^{\left(\boldsymbol{m}\right)} \) from \( \left[\left.{\boldsymbol{Y}}_{\boldsymbol{t}}\right|{\boldsymbol{\theta}}_t^{(m)},{\boldsymbol{\tau}}^{(m)}\right],t\in \left\{{t}_0+1,\dots, T\right\} \).


We implement the proposed algorithm in R package rjags [33] and the differential equations were solved via the fourth-order Runge–Kutta approximation. To ensure the quality of the MCMC procedure, we fix the adaptation number (which denotes the number of MCMC samples discarded by JAGS in order to tune parameters which in turn improves speed or de-correlation of sampling) at 104, thin the chain by keeping one draw from every 10 random draws to further reduce autocorrelation, set a burn-in period of 105 draws under 2 × 105 iterations for four parallel chains. This implementation provides not only posterior estimation of parameters and prevalence of all the three compartments in the SIR model, but also predicts proportions of the infected and the removed people at future time point(s). The R package for implementing this general model for understanding disease dynamics is publicly available at



This model [13] extends the classic SEIR model to estimate COVID-related transmission parameters, in addition to projecting COVID-19 case counts, while accounting for pre-symptomatic infectiousness, time-varying ascertainment rates (i.e. reporting rates), transmission rates and population movements. Figure 3 provides a schematic diagram of the compartments and transitions conceptualized in this model. The model includes seven compartments: susceptible (S), exposed (E), pre-symptomatic infectious (P), reported infectious (I), unreported infectious (A), isolation in hospital (H) and removed (R). Compared with the classic SEIR model, SAPHIRE explicitly models population movement and introduce two additional compartments (A and H) to account for the fact that only reported cases would seek medical care and thus be quarantined by hospitalization. The model described and implemented here relies on the same methodology and arguments as presented by [13]. The only difference is that while the original model analyzed data from China over a time period of December 2019 to March 2020 (which constituted the initial days of the pandemic in China), we analyze data from India. Additionally, the original manuscript adjusted the model to account for population movement. Data on population movement not being available consistently over time and regions in India, we make no such modifications. We further note that the SAPHIRE model returns reported and unreported cumulative COVID-case counts, in addition to cumulative counts of the removed compartment. As such, for the purpose of comparisons, the SAPHIRE model is used only to study cumulative COVID-case counts (reported and unreported). The R package for implementing this general model for understanding disease dynamics is publicly available at

Fig. 3
figure 3

The SAPHIRE model includes seven compartments: susceptible (S), exposed (E), pre-symptomatic infectious (P), reported infectious (I), unreported infectious (A), isolation in hospital (H) and removed (R)


The dynamics of the 7 compartments described above at time t are described by the set of ordinary differential equations

$$ \frac{dS}{dt}=n-\frac{bS\left(\alpha P+\alpha A+I\right)}{N}-\frac{nS}{N}, $$
$$ \frac{dE}{dt}=\frac{bS\left(\alpha P+\upalpha A+I\right)}{N}-\frac{E}{D_e}-\frac{nE}{N}, $$
$$ \frac{dP}{dt}=\frac{E}{D_e}-\frac{P}{D_P}-\frac{nP}{N}, $$
$$ \frac{dA}{dt}=\frac{\left(1-r\right)P}{D_P}-\frac{A}{D_i}-\frac{nA}{N}, $$
$$ \frac{dI}{dt}=\frac{rP}{D_P}-\frac{I}{D_i}-\frac{I}{D_q}, $$
$$ \frac{dH}{dt}=\frac{I}{D_q}-\frac{H}{D_h}, $$
$$ \frac{dR}{dt}=\frac{A+I}{D_i}+\frac{H}{D_h}-\frac{nR}{N}, $$

in which b is the transmission rate for reported cases (defined as the number of individuals that an reported case can infect per day), α is the ratio of the transmission rate of unreported cases to that of reported cases, r is the ascertainment rate, De is the latent period, Dp is the pre-symptomatic infectious period, Di is the symptomatic infectiousness period, Dq is the duration from illness onset to isolation and Dh is the isolation period in the hospital. Further, we set N = 1.34 × 109 as the population size for India and set n = 0 to indicate no incoming or outgoing travelers.

Under this setup, the reproductive number R (as presented in the original manuscript) may be expressed as

$$ R=\alpha b{\left({D}_{\mathrm{p}}^{-1}+\frac{n}{N}\right)}^{-1}+\left(1-r\right)\alpha b{\left({D}_{\mathrm{i}}^{-1}+\frac{n}{N}\right)}^{-1}+ rb{\left({D}_{\mathrm{i}}^{-1}+{D}_{\mathrm{q}}^{-1}\right)}^{-1}, $$

in which the three terms represent infections contributed by pre-symptomatic individuals, unreported cases and reported cases, respectively. The model adjusts the infectious periods of each type of case by taking isolation of patients who test positive \( \left(\mathrm{by}\ \mathrm{means}\ \mathrm{of}\ {D}_q^{-1}\right) \) into account.

Initial states and parameter settings

We set α = 0.55, assuming lower transmissibility for unreported cases [34]. Compartment P contains both reported and unreported cases in the pre-symptomatic phase. We set the transmissibility of P to be the same as unreported cases, because it has previously been reported that the majority of cases are unreported [34]. We assume an incubation period of 5.2 days and a pre-symptomatic infectious period Dp = 2.3 days [35, 36]. The latent period was De = 2.9 days. Since pre-symptomatic infectiousness was estimated to account for 44% of the total infections from reported cases [35], we set the mean of total infectious period as (Dp + Di) = Dp/0.44 = 5.2 days, assuming constant infectiousness across the pre-symptomatic and symptomatic phases of reported cases [37] – thus the mean symptomatic infectious period was Di = 2.9 days. We set a long isolation period of Dh = 17 days, based on a study investigating hospitalisation of COVID-19 patients in the state of Karnataka [38]. The duration from the onset of symptoms to isolation was estimated to be Dq = 7 [23, 39] as the median time length from onset to confirmed diagnosis. On the basis of the parameter settings above, the initial state of the model is specified on March 15. The initial number of reported symptomatic cases I(0) is specified as the number of reported cases who experienced symptom onset during 12–14 March. The initial ascertainment rate is assumed to be r0 = 0.10 [40], and thus the initial number of unreported cases is \( A(0)={r}_0^{-1}\left(1-{r}_0\right)I(0) \). P1(0) and E1(0) denote the numbers of reported cases in which individuals experienced symptom onset during 15–16 March and 17–19 March, respectively. Then, the initial numbers of exposed and pre-symptomatic individuals are set as \( E(0)={r}_0^{-1}{E}_1(0) \) and \( P(0)={r}_0^{-1}{P}_1(0) \), respectively. The initial number of the hospitalized cases H(0) is set as half of the cumulative reported cases on 8 March since Dq = 7 and there would be more severe cases among the reported cases in the early phase of the epidemic.

Likelihood and MCMC algorithm

Considering the time-varying strength of control measures implemented in India over the trajectory of the pandemic, we chose to break the training period into ten sequential blocks: pre-lockdown (March 15–24), lockdown phases 1, 2, 3, and 4 (March 25 – April 14, April 15 – May 3, May 4–17, and May 18–31 respectively) followed by unlock phases 1, 2, 3, 4 and 5 (June 1–30, July 1–31, August 1–31, September 1–30 and October 1–15 respectively). In other words, the model assumes that the value of b (and r) corresponding to the ith lockdown period to vary as bi(and ri) for i = 1, 2, 3, …, 10. The observed number of reported cases in which individuals experience symptom onset on day t – denoted by xt – is assumed to follow a Poisson distribution with rate \( {\uplambda}_t=r{P}_{t-1}{D}_p^{-1} \), with Pt denoting the expected number of pre-symptomatic individuals on day t. The following likelihood equation is used to fit the model using observed data from March 15 (T0) to October 15 (T1).

$$ L\left({b}_1,{b}_2,\dots, {b}_{10},{r}_1,{r}_2,\dots, {r}_{10}\right)=\prod \limits_{t={T}_0}^{T_1}\frac{{\mathrm{e}}^{-{\uplambda}_{\mathrm{t}}}{\lambda}_t^{x_t}}{x_t!}, $$

and the model is used to predict COVID-counts from October 16 to December 31. A non-informative prior of U(0, 2) is used for b1, b2, …, b10. For r1, an informative prior of Beta(10, 90) is used based on the findings of [40]. We reparameterise r2, …, r10 as

$$ \mathrm{logit}\left({r}_i\right)=\mathrm{logit}\left({r}_{i-1}\right)+{\updelta}_i\ \mathrm{for}\ i=2,3,\dots, 10 $$

where logit(t) = log(t/(1 − t)) is the standard logit function. In the MCMC, δiN(0, 1) for i = 2, 3, …, 10. A burn-in period of 100,000 iterations is fixed, with a total of 200,000 iterations being run.

SEIR-fansy model


One of the problems with applying a standard SIR model in the context of the COVID-19 pandemic is the presence of a long incubation period. As a result, extensions of SIR model like the SEIR model are more applicable. In the previous subsection, we have seen an extension which includes the ‘pre-symptomatic infectious’ compartment (people who are infected at time t and contributing to the spread of the virus, but do not show any symptom yet). In the SEIR-fansy model, we use an alternate formulation by defining an ‘untested infectious’ compartment for infected people who are spreading infection but are not tested after the incubation period. This compartment is necessary because there is a large proportion of infected people who are not being tested (a part of them are asymptomatic or mildly symptomatic but for a country like India there are other reasons like access to care and stigma that can prevent someone from getting tested/diagnosed). We have assumed that after the ‘exposed’ compartment, a person enters either the ‘untested infectious’ compartment or the ‘tested infectious’ compartment. To incorporate the possible effect of misclassifications due to imperfect testing, we include a compartment for false negatives (infected people who are tested but reported as negative). As a result, after being tested, an infected person enters either into the ‘false negative’ compartment or the ‘tested positive’ compartment (infected people who are tested and reported to be positive). We keep separate compartments for the recovered and deceased persons coming from the untested and false negatives compartments which are ‘recovered unreported’ and ‘deceased unreported’ respectively. For the ‘tested positive’ compartment, the recovered and the death compartments are denoted by ‘recovered reported’ and ‘deceased reported’ respectively. Thus, we divide the entire population into ten main compartments: S (Susceptible), E (Exposed), T (Tested), U (Untested), P (Tested positive), F (Tested False Negative), RR (Reported Recovered), RU (Unreported Recovered), DR (Reported Deaths) and DU (Unreported Deaths). This model is implemented using the R package SEIRfansy [26].


Like most compartmental models, this model assumes exponential times for the duration of an individual staying in a compartment. For simplicity, we approximate this continuous-time process by a discrete-time modeling process. The main parameters of this model are β (rate of transmission of infection by false negative individuals), αp (scaling factor that measures the rate of spread of infection by patients who test positive for COVID-19 relative to infected patients who return false negative test results), αu (scaling factor for the rate of spread of infection by untested individuals), De (incubation period in days), Dr (mean days till recovery for positive individuals), Dt (mean number of days for the test result to come after a person is being tested), μc (death rate due to COVID-19 which is the inverse of the average number of days for death due to COVID-19 starting from the onset of disease multiplied by the probability death of an infected individual due to COVID), λ and μ (natural birth and death rates respectively, assumed to be equal for the sake of simplicity), r (probability of being tested for infectious individuals), f (false negative probability of RT-PCR test), \( {\beta}_1\ and\ {\beta}_2^{-1} \) (scaling factors for rate of recovery for undetected and false negative individuals respectively), \( {\delta}_1\mathrm{and}\ {\delta}_2^{-1} \) (scaling factors for death rate for undetected and false negative individuals respectively). The number of individuals at the time point t in each compartment is governed by the system of differential equations given by Eqs. (8a) – (8i). To simplify this model, we assume that testing is instantaneous. In other words, we assume there is no time difference from the onset of the disease after the incubation period to getting test results. This is a reasonable assumption to make as the time for testing is about 1–2 days which is much less than the mean duration of stay for the other compartments. Further, once a person shows symptoms for COVID-19 like diseases, they are sent to get tested almost immediately. Figure 4 provides a schematic overview of the model.

Fig. 4
figure 4

Schematic diagram for the SEIR-fansy model with imperfect testing and misclassification. The model has ten compartments: S (Susceptible), E (Exposed), T (Tested), U (Untested), P (Tested positive), F (Tested False Negative), RR (Reported Recovered), RU (Unreported Recovered), DR (Reported Deaths) and DU (Unreported Deaths). Reproduced from Bhaduri, Kundu et al., 2020 [18]

The following differential equations summarize the transmission dynamics being modeled.

$$ \frac{\partial S}{\partial t}=-\beta \frac{S(t)}{N}\left({\alpha}_PP(t)+{\alpha}_UU(t)+F(t)\right)+\lambda N-\mu S(t), $$
$$ \frac{\partial E}{\partial t}=\beta \frac{\mathrm{S}(t)}{N}\left({\alpha}_PP(t)+{\alpha}_UU(t)+F(t)\right)-\frac{E(t)}{D_e}-\mu E(t), $$
$$ \frac{\partial U}{\partial t}=\left(1-r\right)\frac{E(t)}{D_e}-\frac{U(t)}{\beta_1{D}_r}-{\delta}_1{\mu}_cU(t)-\mu U(t), $$
$$ \frac{\partial P}{\partial t}=\left(1-f\right)r\frac{E(t)}{D_e}-\frac{P(t)}{D_r}-{\mu}_cP(t)-\mu P(t), $$
$$ \frac{\partial F}{\partial t}= fr\frac{E(t)}{D_e}-\frac{\beta_2F(t)}{D_r}-\frac{\mu_cF(t)}{\delta_2}-\mu F(t), $$
$$ \frac{\partial RU}{\partial t}=\frac{U(t)}{\beta_1{D}_r}+\frac{\beta_2F(t)}{D_{\mathrm{r}}}-\mu RU(t), $$
$$ \frac{\partial RR}{\partial t}=\frac{P(t)}{D_r}-\mu RR(t), $$
$$ \frac{\partial DU}{\partial t}={\delta}_1{\mu}_cU(t)+\frac{\mu_cF(t)}{\delta_2}, $$
$$ \frac{\partial DR}{\partial t}={\mu}_cP(t). $$

Using the Next Generation Matrix Method [41], we calculate the basic reproduction number

$$ {R}_0=\frac{\beta {S}_0}{\mu {D}_e+1}\left(\frac{\alpha_U\left(1-r\right)}{\frac{1}{\beta_1{D}_r}+{\delta}_1{\mu}_c+\mu }+\frac{\alpha_Pr\left(1-f\right)}{\frac{1}{D_r}+{\mu}_c+\mu }+\frac{rf}{\frac{\beta_2}{D_r}+\frac{\mu_c}{\delta_2}+\mu}\right), $$

where S0 = λ/μ = 1 since we assume that natural birth and death rates are equal within this short period of time. Supplementary Table S1 describes the parameters in greater detail.

Likelihood assumptions and estimation

Parameters are estimated using Bayesian estimation techniques and MCMC methods (namely, Metropolis-Hastings method [42] with Gaussian proposal distribution). First, we approximated the above set of differential equations by a discrete time approximation using daily differences. After we start with an initial value for each of the compartments on the day 1, using the discrete time recurrence relations we obtain the counts for each of the compartments at the next days. To proceed with the MCMC-based estimation, we specify the likelihood explicitly. We assume (conditional on the parameters) the number of new confirmed cases on day t depend only on the number of exposed individuals on the previous day. Specifically, we use multinomial modeling to incorporate the data on recovered and deceased cases as well. The joint conditional distribution is

$$ P\left[{P}_{new}(t),{RR}_{new}(t),{DR}_{new}(t)\right|E\left(t-1\right),P\left(t-1\right)\left]=P\left[{P}_{new}(t)\right|E\left(t-1\right),P\left(t-1\right)\right].P\left[\ {RR}_{new}(t),{DR}_{new}(t)\right|E\left(t-1\right),P\left(t-1\right)\left]=P\left[{P}_{new}(t)\right|E\left(t-1\right)\right].P\left[\ {RR}_{new}(t),{DR}_{new}(t)\right|P\left(t-1\right)\Big]. $$

A multinomial distribution-like structure is then defined

$$ {P}_{new}(t)\mid E\left(t-1\right)\sim Bin\left(E\left(t-1\right),r\left(1-f\right)/{D}_e\right) $$
$$ {RR}_{new}(t),{DR}_{new}(t)\mid P\left(t-1\right)\sim Mult\left(P\left(t-1\right),\left({D}_r^{-1},{\mu}_c,1-{D}_r^{-1}-{\mu}_c\right)\right) $$

Note: the expected values of E(t − 1) and P(t − 1) are obtained by solving the discrete time differential equations specified by Eqs. (8a) – (8i).

Prior assumptions and MCMC

For the parameter r, we assume a U(0, 1) prior, while for β, we assume an improper non-informative flat prior with the set of positive real numbers as support. After specifying the likelihood and the prior distributions of the parameters, we draw samples from the posterior distribution of the parameters using the Metropolis-Hastings algorithm with a Gaussian proposal distribution. We run the algorithm for 200,000 iterations with a burn-in period of 100,000. Finally, the mean of the parameters in each of the iterations are obtained as the final estimates of β and r for the different time periods. As in the case of the SAPHIRE model, we again break the training period into ten sequential blocks: pre-lockdown (March 15–24), lockdown phases 1, 2, 3, and 4 (March 25 – April 14, April 15 – May 3, May 4–17, and May 18–31 respectively) followed by unlock phases 1, 2, 3, 4 and 5 (June 1–30, July 1–31, August 1–31, September 1–30 and October 1–15 respectively).

Imperial College London model (ICM)


We examine a Bayesian semi-mechanistic model for estimating the transmission intensity of SARS-CoV-2 [7]. The model defines a renewal equation using the time-varying reproduction number Rt to generate new infections. As a lot of cases in SARS-CoV-2 are asymptomatic and reported case data is unreliable especially in early part of the epidemic in India, the model relies on observed deaths data and calculates backwards to infer the true number of infections. The latent daily infections are modeled as the product of Rt with a discrete convolution of the previous infections, weighted using an infection-to-transmission distribution specific to SARS-CoV-2. We implement this Bayesian semi-mechanistic model in the context of COVID-19 data arising from India in order to estimate the reproduction number over time, along with plausible upper and lower bounds (95% Bayesian credible intervals (CrI)) of the daily infections and the daily number of infectious people. We parametrize Rt with a fixed effect and a random effect for each week over the course of the epidemic for each state. The fixed effect accounts for the variations in Rt across India as a whole whereas the random effect allows for variations among different states. The weekly effects are encoded as a random walk, where at each successive step the random effect has an equal chance of moving upwards or downwards from its current value. The model is implemented using epidemia [43], a general purpose R package for semi-mechanistic Bayesian modelling of epidemics. Figure 5 represents a schematic overview of the model.

Fig. 5
figure 5

Schematic overview of ICM


The true number of infected individuals, i, is modelled using a discrete renewal process. We specify a generation distribution [44] g with density g(τ) as g Gamma(6.5,0.62). Given the generation distribution, the number of infections it, m on a given day t, and state m is given by the discrete. Convolution function:

$$ {i}_{t,m}={S}_{t,m}{R}_{t,m}\sum \limits_{\uptau =0}^{t-1}{i}_{\uptau, m}{g}_{t-\uptau}, $$
$$ {S}_{t,m}=1-\frac{\sum_{j=0}^{t-1}{i}_{j,m}}{N_m}, $$

where the generation distribution is discretized by \( {g}_s={\int}_{s-0.5}^{s+0.5}g\left(\uptau \right)d \) for s = 2, 3, …,and \( {g}_1={\int}_0^{1.5}g\left(\uptau \right)d\uptau \). The population of state m is denoted by Nm. We include the adjustment factor St, m to account for the number of susceptible individuals left in the population.

We define daily deaths, Dt, m, for days t {1, …, n} and states m {1, …, M}. These daily deaths are modelled using a positive real-valued function dt, m = E[Dt, m] that represents the expected number of deaths attributed to COVID-19. The daily deaths Dt, m are assumed to follow a negative binomial distribution with mean dt, m and variance \( {d}_{t,m}+{d}_{t,m}^2/{\uppsi}_1 \), where ψ1 follows a positive half normal distribution, i.e.,

$$ {D}_{t,m}\sim \mathrm{NB}\ \left({d}_{t,m},{d}_{t,m}+{d}_{t,m}^2/{\uppsi}_1\right),\kern1em t=1,\dots, n, $$
$$ {\uppsi}_1\sim {N}^{+}\left(0,5\right). $$

We link our observed deaths mechanistically to transmission [7]. We use a previously estimated COVID-19 infection fatality ratio (IFR, probability of death given infection) of 0.1% [45, 46] together with a distribution of times from infection to death π. To incorporate the uncertainty inherent in this estimate we modify the ifr for every state to have additional noise around the mean, denoted by \( \mathrm{if}{\mathrm{r}}_{\mathrm{m}}^{\ast } \). Specifically, we assume.

$$ \mathrm{if}{\mathrm{r}}_{\mathrm{m}}^{\ast}\sim \mathrm{if}\mathrm{r}\cdotp N\left(1,0.1\right), $$

where \( \mathrm{if}{\mathrm{r}}_{\mathrm{m}}^{\ast } \) represents the noise-added analog of ifr. Using estimated epidemiological information from previous studies, we assume the distribution of times from infection to death π (infection-to-death) to be the convolution of an infection-to-onset distribution (π) [47] and an onset-to-death distribution [32].

$$ \uppi \sim \mathrm{Gamma}\left(5.1,0.86\right)+\mathrm{Gamma}\left(17.8,0.45\right). $$

The expected number of deaths dt, m, on a given day t, for state m is given by the following discrete sum

$$ {d}_{t,m}=\mathrm{if}{\mathrm{r}}_{\mathrm{m}}^{\ast}\sum \limits_{\uptau =0}^{t-1}{i}_{\uptau, m}{\uppi}_{t-\uptau}, $$

where iτ, m is the number of new infections on day τ in state m and where, similar to the generation distribution, π is discretized via \( {\uppi}_s={\int}_{s-0.5}^{s+0.5}\uppi \left(\uptau \right)d\uptau \) for s = 2, 3, …, and \( {\uppi}_1={\int}_0^{1.5}\uppi \left(\uptau \right)\mathrm{d}\uptau \), where π(τ) is the density of π.

We parametrize Rt, m with a random effect for each week of the epidemic as follows

$$ {R}_{t,m}={R}_0\cdotp f\left(-{\upepsilon}_{w\left(t,m\right)}-{\upepsilon}_{m,w\left(t,m\right)}^{state}\right), $$

where f(x) = 2 exp (x)/(1 +  exp (x)) is twice the inverse logit function, and ϵw(t) and \( {\epsilon}_{m,w\left(t,m\right)}^{state} \)follow a weekly random walk process, that captures variation between Rt, m in each subsequent week. ϵw(t) is a fixed effect estimated across all the states and \( {\epsilon}_{m,w\left(t,m\right)}^{state} \) is the random effect specific to each state in India. The prior distribution for R0 [30] was chosen to be

$$ {R}_0\sim N\left(\mathrm{3.28,0.5}\right). $$

We assume that seeding of new infections begins 30 days before the day after a state has cumulatively observed 10 deaths. From this date, we seed our model with 6 sequential days of an equal number of infections: i1 = … = i6 Exponential(τ−1), where τ  Exponential(0.03). These seed infections are inferred in our Bayesian posterior distribution. Fitting was done with the R package epidemia [43] which uses STAN [48], a probabilistic programming language, using an adaptive Hamiltonian Monte Carlo (HMC) sampler.

Comparing models and evaluating performance

Having established differences in the formulation of the different models, we compare their respective projections and inferences. In order to do so, we use the same data sources [49, 50] for all five models. Well-defined time points are used to denote training (March 15 to October 15) and test (October 16 to December 31) periods.

Using the parameter values specified above along with data from the training period as inputs, we compare the projections of the five models with observed data from the test period. In order to do so, we use the symmetric mean absolute prediction error (SMAPE) and mean squared relative prediction error (MSRPE) metrics as measures of accuracy. Given observed time-varying data \( {\left\{{O}_t\right\}}_{t=1}^T \) and an analogous time-series dataset of projections \( {\left\{{P}_t\right\}}_{t=1}^T \), the SMAPE metric is defined as

$$ SMAPE(T)=\frac{100}{T}\cdotp \sum \limits_{t=1}^{t=T}\frac{\left|{P}_t-{O}_t\right|}{\left(\left|{P}_t\right|+\left|{O}_t\right|\right)/2}, $$

where |x| denotes the absolute value of x. The metric MSRPE is defined as

$$ MSRPE(T)={\left[{T}^{-1}\sum \limits_{t=1}^T{\left(1-\frac{P_t}{O_t}\right)}^2\right]}^{1/2}. $$

It can be seen that 0 ≤ SMAPE ≤ 100, with smaller values of both MSRPE and SMAPE indicating a more accurate fit. For active reported cases (cases that are active on a given day which is the difference of cumulative reported cases and cumulative reported counts of recoveries and deaths), we compute and compare the metrics defined above for projections from eSIR and SEIR-fansy models as no other model returns relevant projections. For cumulative reported cases we obtain projections from all models apart from ICM (which yields total, i.e., sum of reported and unreported, cumulative cases). For cumulative reported deaths we compare projections from eSIR, SEIR-fansy and ICM, since the baseline and SAPHIRE models do not yield relevant projections. Supplementary Table S2 gives an overview of output from each of the models we consider and Table 2 reports the values of accuracy metrics described above.

Table 2 Comparison of estimated time-varying Rt and prediction accuracy of the models under consideration

Further, we compare (when possible) the estimated time-varying reproduction number R(t) over the different lockdown and unlock stages in India. Specifically, for each lockdown stage, we report the median R(t) value along with the associated 95% credible interval (CrI). The values are presented in Table 2.

Since we are interested in comparing relative performances of the models (specifically, their projections), we define another metric – the relative mean squared prediction error (Rel-MSPE). Given time series data on observed cumulative cases (or deaths) \( {\left\{{O}_t\right\}}_{t=1}^T \), projections from a model A \( {\left\{{P}_t^A\right\}}_{t=1}^T \), and projections from some other model B, \( {\left\{{P}_t^B\right\}}_{t=1}^T \), the Rel-MSPE of model B with respect to model A is defined as

$$ Rel- MSPE\left(B:A\right)={\left[\sum \limits_{t=1}^T{\left(\frac{O_t-{P}_t^A}{O_t-{P}_t^B}\right)}^2\right]}^{1/2} $$

Higher values of Rel-MSPE(B:A) indicate better performance of model B over model A. Since the baseline model yields projections of cumulative reported cases, we compute Rel-MSPE for the other models with respect to the baseline model for reported cumulative cases. Projections from ICM represent total (i.e., sum of reported and unreported) cumulative cases and are left out of this comparison of reported counts. For cumulative reported deaths, we compute Rel-MSPE of the SEIR-fansy and ICM models relative to the eSIR model. In addition to comparing the accuracy of fits that arise from the different models, we also investigate if projections from the different models are correlated with observed data. We use the standard Pearson’s correlation coefficient and Lin’s concordance correlation coefficient [51] as summary measures to study said correlation. Higher values of these correlation metrics indicate better concordance of model projections and the observed data from the test period. Rel-MSPE and correlation metrics are presented in Table 3. Since we have projections for total (sum of reported and unreported cases) for active cases from SEIR-fansy, for cumulative cases from SAPHIRE, SEIR-fansy and ICM, and for cumulative deaths from SEIR-fansy, we present the projected totals along with 95% credible intervals and associated underreporting factors on three specific dates – October 31, November 30 and December 31 in Table 4. The table also includes projected cumulative reported counts (which are available from all models under investigation apart from ICM) with 95% credible intervals for the three dates mentioned above.

Table 3 Comparison of relative performance and correlation with observed data of projections of the models under consideration from October 16 till December 31, 2020
Table 4 Projected counts of reported cumulative cases and total (sum of reported and unreported) counts of cases and deaths (cumulative) from the models under comparison

Data source

The data on confirmed cases, recovered cases and deaths for India and the 20 states of interest are taken from COVID-19 India [49] and the JHU CSSE COVID-19 GitHub repository [50]. In addition to this and other similar articles concerning the spread of this disease in India, we have created an interactive dashboard [52] summarizing COVID-19 data and forecasts for India and its states (generated with the eSIR model discussed in this paper). While the models are trained using data from March 15 to October 15, 2020, their performances are compared by examining their respective projections from October 16 to December 31, 2020.


Estimation of the reproduction number

From Table 2, we compare the mean of the time-varying effective reproduction number R(t) over the four phases of lockdown and subsequent unlock phased in India. The eSIR model returns a mean value of 2.08 (95% credible interval: 1.41–2.12) over the entire training period. Factoring in different levels of government interventions which modified transmission dynamics during lockdown, we get period specific estimates ranging from 2.12 (1.44–2.16) in lockdown phase 1, which drops to 1.48 (1.00–1.51) in lockdown phase 2 and then reports a steady decline over the subsequent lockdown and unlock phases. The mean values returned by the SAPHIRE model varied from 2.54 (2.41–2.74) during phase 1 of the lockdown, 1.60 (1.36–2.17) for phase 2, 1.69 (1.46–1.97) for phase 3 and 1.54 (1.29–2.00) for the fourth and final lockdown phase. The estimated values for subsequent unlock phases are quite close to each other, starting from 1.27 (1.19–1.32) in unlock phase 1 and dropping to 1.09 (0.91–1.69) in the fifth unlock phase. The SEIR-fansy notes that the mean R(t) drops from 5.03 (5.01–5.04) during the first phase of lockdown, to 1.90 (1.89–1.91) during the second lockdown phase, before rising again to 2.33 (2.30–2.36) during lockdown phase 4. The estimated mean drops steadily from 1.80 (1.79–1.81) during unlock phase 2 to 0.86 (0.85–0.87) during unlock phase 5. The ICM-based mean values fluctuate, from 1.77 (1.58–1.96) during the first lockdown phase, followed by 1.22 (1.18–1.27), then dropping to 1.33 (1.28–1.38) and finally rising to 1.41 again (1.35–1.47) for the fourth phase of lockdown. Estimates from ICM during unlock phases behave like those from the SEIR-fansy model – in unlock phase 2 the estimated mean is 1.11 (1.08–1.14) and in unlock phase 5, the mean is 0.83 (0.82–0.84). In terms of agreement of reported values, SAPHIRE, SEIR-fansy and ICM report the highest mean R for phase one of the lockdown. Values reported by SAPHIRE, SEIR-fansy and ICM report a drop in intermediate lockdown phases, followed by a rise. Values during unlock period increase from phase 1 to phase 2, followed by a steady decline. SAPHIRE, SEIR-fansy and ICM report the lowest value of R for unlock phase 5.

Estimation of reported case counts

From Figs. 6, 7, 8 and 9, we note that the eSIR model overestimates the count of active cases – a behavior which gets worse with time. While the observed counts decrease steadily in the test period, the eSIR model fails to capture this behaviour and returns projections which rise over time. In comparison, the SEIR-fansy model is able to replicate the decreasing behaviour but yields projections which are higher than observed counts. In terms of prediction accuracy, the SEIR-fansy model has an SMAPE value of 35.14% and an MSRPE value of 1.11. For eSIR model, those values are at 37.96% (SMAPE) and 2.28 (MSRPE).

Fig. 6
figure 6

Comparison of projected and observed reported active cases from October 16 to December 31 for India, using training data from March 15 to October 15, 2020

Fig. 7
figure 7

Comparison of projected and observed reported cumulative cases from October 16 to December 31 for India, using training data from March 15 to October 15, 2020

Fig. 8
figure 8

Comparison of projected and observed reported cumulative deaths from October 16 to December 31 for India, using training data from March 15 to October 15, 2020

Fig. 9
figure 9

Scatter plot and marginal densities of projected and observed reported active cases from October 16 to December 31 for India, using training data from March 15 to October 15, 2020

From Figs. 7, 8, 9 and 10 we note that while the SAPHIRE model underestimates the count of cumulative cases, the baseline, eSIR and SEIR-fansy models overestimate the count. Table 2 reveals that SAPHIRE performs the best in terms of SMAPE metric with a value of 2.25%, followed closely by SEIR-fansy (2.29%). The eSIR and baseline models perform poorly in comparison, yielding 6.59 and 6.89% respectively. The SEIR-fansy model performs best in terms of MSRPE with a value of 0.05, followed closely by SAPHIRE (0.06). Table 3 further reveals a similar relative performance through Rel-MSPE values (all Rel-MSPE figures reported here are relative to the baseline model). The SEIR-fansy model performs the best with Rel-MSPE value of 3.27, followed by SAPHIRE (3.01), and finally, the eSIR model (1.72). All four sets of projections are highly correlated with the observed time series – with all model projections having a Pearson’s correlation coefficient of nearly 1 with the observed data. Lin’s concordance coefficient yields an ordering (from worst to best) of the eSIR model (0.48), followed by the baseline model (0.51), the SAPHIRE model (0.74) and finally, the SEIR-fansy model (0.89).

Fig. 10
figure 10

Scatter plot and marginal densities of projected and observed cumulative cases from October 16 to December 31 for India, using training data from March 15 to October 15, 2020

Estimation of reported death counts

From Figs. 8, 9, 10 and 11, we note that the eSIR and SEIR-fansy models almost always overestimate, whereas the ICM model slightly underestimates the confirmed cumulative death counts. From Table 2 and Table 3, the SMAPE and MSRPE values, along with comparison of projections with observed data reveal that the ICM model is most accurate (SMAPE: 0.77%, MSRPE: 0.020), followed by SEIR-fansy (SMAPE: 4.74%, MSRPE: 0.12) followed by the eSIR model (SMAPE: 8.94%, MSRPE: 0.25). Relative to the eSIR model, the Rel-MSPE values of the models reveal that the SEIR-fansy model performs better (Rel-MSPE: 6.96), followed by ICM (Rel-MSPE: 3.64). Judging by values of Pearson’s correlation coefficient, all three sets of projections are highly correlated with the observed data. Lin’s concordance coefficient yields an ordering (from best to worst) of ICM (0.96), followed by SEIR-fansy (0.62) and finally eSIR (0.34).

Fig. 11
figure 11

Scatter plot and marginal densities of projected and observed cumulative death from October 16 to December 31 for India, using training data from March 15 to October 15, 2020

Estimation of unreported case and death counts

From Table 4, we note that the SEIR-fansy model yields underreporting factors of about 10 for active cases on October 31, November 30 and December 31. Further, we observe that the SAPHIRE model projects the maximum count of total cumulative cases on the above three dates, followed by the SEIR-fansy and then ICM. SAPHIRE returns under-reporting factors of the order of approximately 65, while SEIR-fansy and ICM return under-reporting factors which are approximately 7 and 4 respectively. For cumulative deaths, SEIR-fansy estimates underreporting factors approximately equal 3.

Uncertainty quantification of estimates and predictions

From Fig. 12 we observe that the width of 95% credible intervals associated with projections from each of the models vary significantly. While the eSIR model consistently returns the widest intervals, SEIR-fansy has the narrowest intervals. In case of cumulative counts, the ordering (best to worst) starts with SEIR-fansy, followed by the baseline, followed by SAPHIRE and finally the eSIR model. For cumulative deaths, the ordering (best to worst) starts with SEIR-fansy, followed by ICM and finally eSIR. From Table 4, we compare projections of reported cumulative cases for each model (apart from ICM which returns projections of cumulative total cases and not cumulative reported cases) and their associated prediction intervals on October 31, November 30 and December 31, 2020. On October 31, we observe 8.18 million cumulative reported cases, while the projections (in millions) from the baseline model are 8.71 (95% credible interval: 8.63–8.80), while eSIR yields 8.35 (7.19–9.60), SAPHIRE returns 8.17 (7.90–8.52) and SEIR-fansy projects 8.51 (8.18–8.85) million cases. We do not present our projections for November 30 and December 31, 2020 here in the interest of conciseness.

Fig. 12
figure 12

Boxplots showing width of 95% credible interval associated with projected active cases, cumulative cases and cumulative deaths from October 16 to December 31 for India, using training data from March 15 to October 15, 2020

Sensitivity analyses and performance in other countries

Sensitivity analyses for some of the discussed models have been carried out in several other publications. In the interest of conciseness, we refer to said publications and comment on what parameters are central to estimation and generating projections for the models examined here. We also include information on how these models have performed in the context of data from other countries.


Evaluation of the model results in terms of their sensitivity to initial parameter choices and under-reporting and clustering issues within the data have been discussed in the context of India in prior literature [53]. The range of scenarios considered earlier include 10-fold underreporting of cases, clustering of cases in metropolitan areas, and prior mean of R0 ranging from 2 to 4 (See Supplementary Table S3). Even though the posterior estimates and predictions changed in scale to some extent across these scenarios, they did not significantly change the broad conclusions. It is undeniable that the exact predicted case counts are sensitive to the choice of priors, but with new data coming in over a longer time frame, as seen in the results from this work, the model is capable of washing out the prior effects in the posterior outcomes.

The eSIR model has been successfully implemented and utilized in the context of COVID-19 across different geographical locations, including China [24, 25, 54], Poland [55], Italy [24], Bangladesh and Pakistan [56]. These countries cover a broad range in terms of socio-economic status, health infrastructure and pandemic management strategies. In each of these cases the eSIR model was seen to be successfully capturing the patterns of growth of the pandemic via estimated parameters, as well as efficiently forecasting future case counts via predictive modeling.


We conducted the sensitivity analysis (results not shown) by changing the initial parameters as 20% lower or higher than the specified values in the SAPHIRE model. The estimated R and ascertainment rates were robust to misspecification of the duration from the onset of symptoms to isolation and of the relative transmissibility of unreported versus reported cases. R estimates were positively correlated with the specified latent and infectious periods, and the estimated ascertainment rates were positively correlated with the specified ascertainment rate in the initial state. This finding is consistent with sensitivity analyses of the SAPHIRE model implemented in Wuhan [13]. The estimated ascertainment rates were positively correlated with the specified ascertainment rate in the initial state while the under-reported factors were negatively associated with initial ascertainment. The estimated under-reported factor on October 31 (see Table 4) decreases dramatically from 117 to 0.07 with the initial ascertainment rate increasing from 0.07 to 0.14, with an initial ascertainment rate of 0.10 providing the best fit, which is presented in this article.

The SAPHIRE model was originally developed in the context of data from China and was successfully able to delineate the transmission dynamics of COVID-19 in Wuhan [13] and in South Africa [57].


In the paper, we fix most parameters in our model and examine transmission dynamics only through β and r. It is necessary to design and implement a sensitivity analysis focusing on various combinations of the parameters that were previously fixed. The details of the sensitivity analyses are described in detail in [18]. The basic findings from the sensitivity analyses are summarized as follows. We observe that the predictions for the reported active cases (P) remains same for all parameter choices. The estimates for R0 mainly differ in the first period, although some variation is noted for the second period as well. However, the estimated R are almost the same for the later stages of the pandemic in the different models. For the untested cases, in some of the settings of our analysis, there are substantial deviations from the true numbers. The total number of active cases (which include both the unreported and the reported cases) also varies substantially with different parameter values. Consequently, we note how the estimation of unreported cases is sensitive to different choices for the parameter values. In particular, we see different values of E0 have the most impact on our sensitivity analysis, while different choices of DE have the least impact.

The SEIR-fansy model has not been run for different countries, but it has been implemented for most Indian states separately [18] which showed that the model was able to capture the transmission dynamics of COVID-19 in most states of India quite efficiently. For instance, this model was able to match the sero-survey results of Delhi quite well [45]. For other states, the predicted reported cases came out to be quite close to the observed reported cases (with observed cases lying within the credible interval of projections).


The parameters critical to the estimation and projection methods include the infection-to-death distribution [32], infection fatality ratio [45, 46], generation distribution [44], prior for R0 [7, 30] and seeding [7]. Researchers have performed sensitivity analysis for various choices of infection-to-death distribution and found the resultant projections to be robust under changes [7]. We used a range of values for our prior of IFR, with mean 1, 0.4 and 0.1%. We found that the model fits and estimated Rt are more or less the same for all three choices but certainly our estimates for total infections changes. This implies the ascertainment of cases (positive results) will be affected. Sensitivity analyses towards the choice of the generation distribution was performed by other researchers [7] who found the models to be robust against various choices. It has a very minimal effect on the estimation of time varying reproduction number and total infections by the model. We used the R0 prior suggested in both [7, 30]. We did run sensitivity on a few other choices and found that our prior choice affected the inferred Rt values for only the first few days and subsequent dynamics are the same irrespective of the choice. Finally, as discussed in [7] we validated our seeding scheme through an importance sampling leave-one-out cross validation scheme [58, 59].

Different versions of ICM model has been applied to 11 European countries in [7]. On a subregional basis the model is used in the USA [60], Brazil [20, 61] and Italy [21]. At a local level work the model is used for producing daily estimates for all local and regions in the UK [62, 63]. It is also used by Scotland government [64] and New York State government [65].


In this comparative paper we have described five different models of various stochastic structures that have been used for modeling SARS-Cov-2 disease transmission in various countries across the world. We applied them to a case-study in modeling the full disease transmission of the coronavirus in India. While simulation studies are the only gold standard way to compare the accuracy of the models, here we were uniquely poised to compare the projected case-counts and death-counts against observed data on a test period. We learned several things from these models. While the estimation of the reproduction number is relatively robust across the models, the prediction of active and cumulative number of cases and cumulative deaths show variation across models. Our findings in terms of estimates of R(t) are reflective of the national and state-level implementations of four lockdown phases [66] which are summarized in Supplementary Table S4. The largest variability across models is observed in predicting the “total” number of infections including reported and unreported cases. The degree of underreporting has been a major concern in India and other countries [67]. We note from Table 4 that the underreporting factor from SAPHIRE is much higher than those reported by SEIR-fansy and ICM. This may be attributed to the fact that SEIR-fansy and ICM both fit daily reported deaths with a pre-specified death rate (which is higher than that for unreported cases), SAPHIRE does not include daily reported death counts in the likelihood function. Additionally, SEIR-fansy also considered the false positive/negative rates of tests and the selection bias in testing, which also contribute to more accurate unreported case projections along with untested infectious case counts. With a comprehensive exposition and a single beta-testing case-study we hope this paper will be useful to understand the mathematical nuance and the differences in terms of deliverables for the models.

There are several limitations to this work. First and foremost, all model estimates are based on a scenario where we assumed no change in either interventions or behavior of people in the forecast period. This is not true as there is tremendous variation in policies across Indian states in the post lockdown phase. We did observe regional lockdowns that were enacted in the forecast period. None of our models tried to capture this variability. Second, the five models we compare are a subset of a vast amount of work that has been done in this area, including models that incorporate age-specific contact network and spatiotemporal variation [11, 68]. Third, we have not tested the models for predicting the oscillatory growth and decay behavior of the virus incidence curve, in particular, predicting the second wave. Finally, an extensive simulation study would be the best way to assess the models under different scenarios, but we have restricted our attention to India.

Availability of data and materials

Please visit



Imperial College Model


Markov Chain-Monte Carlo


Mean squared relative prediction error


Relative mean squared prediction error






Symmetric mean absolute prediction error


  1. Mayo Clinic. Coronavirus disease 2019 (COVID-19)—Symptoms and causes [Internet]. 2020 [cited 2020 May 21]. Available from:

    Google Scholar 

  2. Wikipedia. Coronavirus disease 2019. [cited 2020 Aug 3]. Available from:

    Google Scholar 

  3. Aiyar S. Covid-19 has exposed India’s failure to deliver even the most basic obligations to its people [Internet]: CNN; 2020. [cited 2020 Aug 3]. Available from:

    Google Scholar 

  4. Kulkarni S. India becomes third worst affected country by coronavirus, overtakes Russia Read more at: [Internet]. Deccan Herald. [cited 2020 Aug 3]. Available from:

  5. Basu D, Salvatore M, Ray D, Kleinsasser M, Purkayastha S, Bhattacharyya R, et al. A Comprehensive Public Health Evaluation of Lockdown as a Non-pharmaceutical Intervention on COVID-19 Spread in India: National Trends Masking State Level Variations [Internet]. Epidemiology. 2020; [cited 2020 Aug 3]. Available from:

  6. IHME COVID-19 health service utilization forecasting team, Murray CJ. Forecasting COVID-19 impact on hospital bed-days, ICU-days, ventilator-days and deaths by US state in the next 4 months [Internet]. Infect Dis (except HIV/AIDS). 2020; [cited 2020 Aug 18]. Available from:

  7. Imperial College COVID-19 Response Team, Flaxman S, Mishra S, Gandy A, Unwin HJT, Mellan TA, et al. Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe. Nature. 2020; [cited 2020 Aug 7]; Available from:

  8. Tang L, Zhou Y, Wang L, Purkayastha S, Zhang L, He J, et al. A Review of Multi-Compartment Infectious Disease Models. Int Stat Rev. 2020;88:462–513.

  9. Kermack WO, McKendrick AG. Contributions to the mathematical theory of epidemics—I. Bull Math Biol. 1991;53(1–2):33–55.

    Article  CAS  PubMed  Google Scholar 

  10. Song PX, Wang L, Zhou Y, He J, Zhu B, Wang F, et al. An epidemiological forecast model and software assessing interventions on COVID-19 epidemic in China. medRxiv. 2020; Available from:

  11. Zhou Y, Wang L, Zhang L, Shi L, Yang K, He J, et al. A Spatiotemporal Epidemiological Prediction Model to Inform County-Level COVID-19 Risk in the United States. Harv Data Sci Rev. 2020; [cited 2020 Aug 3]; Available from:

  12. Wu JT, Leung K, Leung GM. Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study. Lancet. 2020;395(10225):689–97.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Hao X, Cheng S, Wu D, Wu T, Lin X, Wang C. Reconstruction of the full transmission dynamics of COVID-19 in Wuhan. Nature. 2020; [cited 2020 Aug 18]; Available from:

  14. Bai Y, Yao L, Wei T, Tian F, Jin D-Y, Chen L, et al. Presumed asymptomatic carrier transmission of COVID-19. JAMA. 2020;323(14):1406–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Tong Z-D, Tang A, Li K-F, Li P, Wang H-L, Yi J-P, et al. Potential Presymptomatic transmission of SARS-CoV-2, Zhejiang Province, China, 2020. Emerg Infect Dis. 2020;26(5):1052–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Bertozzi AL, Franco E, Mohler G, Short MB, Sledge D. The challenges of modeling and forecasting the spread of COVID-19. Proc Natl Acad Sci. 2020;2:202006520.

    Google Scholar 

  17. Bhardwaj R. A predictive model for the evolution of COVID-19. Trans Indian Natl Acad Eng. 2020;5(2):133–40.

    Article  Google Scholar 

  18. Bhaduri R, Kundu R, Purkayastha S, Kleinsasser M, Beesley LJ, Mukherjee B. Extending the susceptible-exposed-infected-removed (SEIR) model to handle the high false negative rate and symptom-based administration of COVID-19 diagnostic tests: SEIR-fansy [Internet]. Epidemiology. 2020; [cited 2021 Feb 20]. Available from:

  19. Unwin HJT, Mishra S, Bradley VC, Gandy A, Mellan TA, Coupland H, et al. State-level tracking of COVID-19 in the United States [Internet]. Public Glob Health. 2020; [cited 2020 Sep 16]. Available from:

  20. Mellan TA, Hoeltgebaum HH, Mishra S, Whittaker C, Schnekenberg RP, Gandy A, et al. Subnational analysis of the COVID-19 epidemic in Brazil [Internet]. Epidemiology. 2020; [cited 2020 Sep 16]. Available from:

  21. Vollmer MAC, Mishra S, Unwin HJT, Gandy A, Mellan TA, Bradley V, et al. A sub-national analysis of the rate of transmission of COVID-19 in Italy [Internet]. Public Glob Health. 2020; [cited 2020 Sep 16]. Available from:

  22. Lau H, Khosrawipour T, Kocbach P, Ichii H, Bania J, Khosrawipour V. Evaluating the massive underreporting and undertesting of COVID-19 cases in multiple global epicenters. Pulmonology. 2021;27(2):110–15.

  23. Wang D, Hu B, Hu C, Zhu F, Liu X, Zhang J, et al. Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus–infected pneumonia in Wuhan, China. JAMA. 2020;323(11):1061–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Wangping J, Ke H, Yang S, Wenzhe C, Shengshu W, Shanshan Y, et al. Extended SIR prediction of the epidemics trend of COVID-19 in Italy and compared with Hunan, China. Front Med. 2020;7:169.

    Article  Google Scholar 

  25. Wang L, Zhou Y, He J, Zhu B, Wang F, Tang L, et al. An epidemiological forecast model and software assessing interventions on COVID-19 epidemic in China [Internet]. Infect Dis (except HIV/AIDS). 2020; [cited 2021 Mar 19]. Available from:

  26. Bhaduri R, Kundu R, Purkayastha S, Beesley LJ, Kleinsasser M, Mukherjee B. SEIRfansy: extended susceptible-exposed-infected-recovery model [Internet]. 2020. Available from:

    Google Scholar 

  27. Gelman A. Bayesian data analysis. 3rd ed. Boca Raton: CRC Press; 2014. p. 661. (Chapman & Hall/CRC texts in statistical science)

    Google Scholar 

  28. R Core Team. R: A Language and Environment for Statistical Computing [Internet]. Vienna: R Foundation for Statistical Computing; 2017. Available from:

    Google Scholar 

  29. Butcher JC. Numerical methods for ordinary differential equations. 2nd ed. Chichester; Hoboken: Wiley; 2008. p. 463.

    Book  Google Scholar 

  30. Liu Y, Gayle AA, Wilder-Smith A, Rocklöv J. The reproductive number of COVID-19 is higher compared to SARS coronavirus. J Travel Med. 2020;27(2):taaa021.

    Article  PubMed  Google Scholar 

  31. Cori A, Ferguson NM, Fraser C, Cauchemez S. A new framework and software to estimate time-varying reproduction numbers during epidemics. Am J Epidemiol. 2013;178(9):1505–12.

    Article  PubMed  Google Scholar 

  32. Verity R, Okell LC, Dorigatti I, Winskill P, Whittaker C, Imai N, et al. Estimates of the severity of coronavirus disease 2019: a model-based analysis. Lancet Infect Dis. 2020s;20(6):669–77.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Plummer M. rjags: Bayesian graphical models using MCMC. R package version 4-10. 2019.

  34. Li R, Pei S, Chen B, Song Y, Zhang T, Yang W, et al. Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2). Science. 2020;368(6490):489–93.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. He X, Lau EHY, Wu P, Deng X, Wang J, Hao X, et al. Temporal dynamics in viral shedding and transmissibility of COVID-19. Nat Med. 2020s;26(5):672–5.

    Article  CAS  PubMed  Google Scholar 

  36. Li Q, Guan X, Wu P, Wang X, Zhou L, Tong Y, et al. Early transmission dynamics in Wuhan, China, of novel coronavirus–infected pneumonia. N Engl J Med. 2020;382(13):1199–207.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Ferretti L, Wymant C, Kendall M, Zhao L, Nurtay A, Abeler-Dörner L, et al. Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing. Science. 2020;368(6491):eabb6936.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Mishra V, Burma A, Das S, Parivallal M, Amudhan S, Rao G. COVID-19-hospitalized patients in Karnataka: survival and stay characteristics. Indian J Public Health. 2020;64(6):221.

    Article  Google Scholar 

  39. Garg S, Kim L, Whitaker M, O’Halloran A, Cummings C, Holstein R, et al. Hospitalization rates and characteristics of patients hospitalized with laboratory-confirmed coronavirus disease 2019 — COVID-NET, 14 states, march 1–30, 2020. MMWR Morb Mortal Wkly Rep. 2020;69(15):458–64.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Rahmandad H, Lim TY, Sterman J. Estimating the Global Spread of COVID-19. SSRN Electron J. 2020; [cited 2021 Mar 18]; Available from:

  41. Diekmann O, Heesterbeek JAP, Roberts MG. The construction of next-generation matrices for compartmental epidemic models. J R Soc Interface. 2010;7(47):873–85.

    Article  CAS  PubMed  Google Scholar 

  42. Robert CP, Casella G. Monte Carlo statistical methods [internet]. New York: Springer New York; 2004. [cited 2020 Aug 14]. (Springer Texts in Statistics). Available from:

    Book  Google Scholar 

  43. Scott J, Gandy A, Mishra S, Unwin J, Flaxman S, Bhatt S. epidemia: Modeling of Epidemics using Hierarchical Bayesian Models [Internet]. 2020. Available from:

    Google Scholar 

  44. Bi Q, Wu Y, Mei S, Ye C, Zou X, Zhang Z, et al. Epidemiology and transmission of COVID-19 in 391 cases and 1286 of their close contacts in Shenzhen, China: a retrospective cohort study. Lancet Infect Dis. 2020;20(8):911–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Bhattacharyya R, Bhaduri R, Kundu R, Salvatore M, Mukherjee B. Reconciling epidemiological models with misclassified case-counts for SARS-CoV-2 with seroprevalence surveys: A case study in Delhi, India [Internet]. Infect Dis (except HIV/AIDS). 2020; Aug [cited 2021 Mar 19]. Available from:

  46. Murhekar MV, Bhatnagar T, Selvaraju S, Saravanakumar V, Thangaraj JWV, Shah N, et al. SARS-CoV-2 antibody seroprevalence in India, august–September, 2020: findings from the second nationwide household serosurvey. Lancet Glob Health. 2021;9(3):e257–66.

    Article  PubMed  PubMed Central  Google Scholar 

  47. Walker PGT, Whittaker C, Watson OJ, Baguelin M, Winskill P, Hamlet A, Djafaara BA, Cucunubá Z, Olivera Mesa D, Green W, Thompson H, Nayagam S, Ainslie KEC, Bhatia S, Bhatt S, Boonyasiri A, Boyd O, Brazeau NF, Cattarino L, Cuomo-Dannenburg G, Dighe A, Donnelly CA, Dorigatti I, van Elsland SL, FitzJohn R, Fu H, Gaythorpe KAM, Geidelberg L, Grassly N, Haw D, Hayes S, Hinsley W, Imai N, Jorgensen D, Knock E, Laydon D, Mishra S, Nedjati-Gilani G, Okell LC, Unwin HJ, Verity R, Vollmer M, Walters CE, Wang H, Wang Y, Xi X, Lalloo DG, Ferguson NM, Ghani AC. The impact of COVID-19 and strategies for mitigation and suppression in low- and middle-income countries. Science. 2020;369(6502):413–22. Epub 2020 Jun 12.

  48. Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, et al. Stan : A Probabilistic Programming Language. J Stat Softw. 2017;76(1) [cited 2020 Aug 29]. Available from:

  49. India C-19. Coronavirus Outbreak in India [Internet]. 2020 [cited 2020 May 21]. Available from:

    Google Scholar 

  50. Johns Hopkins University. COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) [Internet]. 2020 [cited 2020 May 21]. Available from:

    Google Scholar 

  51. Lin LI-K. A concordance correlation coefficient to evaluate reproducibility. Biometrics. 1989;45(1):255–68.

    Article  CAS  PubMed  Google Scholar 

  52. Group C-I-19 S. COVID-19 Outbreak in India [Internet]. 2020 [cited 2020 May 21]. Available from:

    Google Scholar 

  53. Ray D, Salvatore M, Bhattacharyya R, Wang L, Du J, Mohammed S, et al. Predictions, Role of Interventions and Effects of a Historic National Lockdown in India’s Response to the the COVID-19 Pandemic: Data Science Call to Arms. Harv Data Sci Rev. 2020; Available from:

  54. Enrique Amaro J, Dudouet J, Nicolás OJ. Global analysis of the COVID-19 pandemic using simple epidemiological models. Appl Math Model. 2021;90:995–1008.

    Article  PubMed  Google Scholar 

  55. Orzechowska M, Bednarek AK. Forecasting COVID-19 pandemic in Poland according to government regulations and people behavior [Internet]. Infect Dis (except HIV/AIDS). 2020; [cited 2021 Mar 19]. Available from:

  56. Singh BC, Alom Z, Rahman MM, Baowaly MK, Azim MA. COVID-19 Pandemic Outbreak in the Subcontinent: A data-driven analysis. ArXiv200809803 Cs. 2020; [cited 2021 Mar 19]; Available from:

  57. Gu X, Mukherjee B, Das S, Datta J. COVID-19 prediction in South Africa: estimating the unascertained cases- the hidden part of the epidemiological iceberg [Internet]. Epidemiology. 2020; [cited 2021 Mar 21]. Available from:

  58. Vehtari A, Gelman A, Gabry J. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat Comput. 2017;27(5):1413–32.

    Article  Google Scholar 

  59. Bürkner P-C, Gabry J, Vehtari A. Approximate leave-future-out cross-validation for Bayesian time series models. J Stat Comput Simul. 2020;90(14):2499–523.

    Article  Google Scholar 

  60. Unwin HJT, Mishra S, Bradley VC, Gandy A, Mellan TA, Coupland H, et al. State-level tracking of COVID-19 in the United States. Nat Commun. 2020;11(1):6189.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Candido DS, Claro IM, de Jesus JG, Souza WM, Moreira FRR, Dellicour S, et al. Evolution and epidemic spread of SARS-CoV-2 in Brazil. Science. 2020;369(6508):1255–60.

    Article  CAS  PubMed  Google Scholar 

  62. Mishra S, Scott J, Zhu H, Ferguson NM, Bhatt S, Flaxman S, et al. A COVID-19 Model for Local Authorities of the United Kingdom [Internet]. Infect Dis (except HIV/AIDS). 2020; [cited 2021 Mar 20]. Available from:

  63. Gandy A, Mishra S. ImperialCollegeLondon/covid19local: Website Release for Wednesday 1tth Mar 2021, new doi for the week [Internet]. Zenodo. 2021; [cited 2021 Mar 20]. Available from:

  64. Scottish Government. Coronavirus (COVID-19): modelling the epidemic [Internet]. Available from:

  65. Cuomo AM. American crisis; 2020.

    Google Scholar 

  66. Salvatore M, Basu D, Ray D, Kleinsasser M, Purkayastha S, Bhattacharyya R, et al. Comprehensive public health evaluation of lockdown as a non-pharmaceutical intervention on COVID-19 spread in India: national trends masking state-level variations. BMJ Open. 2020;10(12):e041778.

    Article  PubMed  PubMed Central  Google Scholar 

  67. Rahmandad H, Lim TY, Sterman J. Estimating COVID-19 under-reporting across 86 nations: implications for projections and control [Internet]. Epidemiology. 2020; [cited 2020 Sep 16]. Available from:

  68. Balabdaoui F, Mohr D. Age-stratified discrete compartment model of the COVID-19 epidemic with application to Switzerland. Sci Rep. 2020;10(1):21306.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references


The authors would like to thank the Center for Precision Health Data Sciences at the University of Michigan School of Public Health, The University of Michigan Rogel Cancer Center and the Michigan Institute of Data Science for internal funding that supported this research. The authors are grateful to Professors Eric Fearon, Aubree Gordon and Parikshit Ghosh for useful conversations that helped formulating the ideas in this manuscript.


The authors would like to thank the Center for Precision Health Data Sciences at the University of Michigan School of Public Health, The University of Michigan Rogel Cancer Center and the Michigan Institute of Data Science. The funding bodies provided internal funding that supported this project and funded computational resources used to analyse and draw inferences from the data.

Author information

Authors and Affiliations



SP drafted the main paper and prepared all numerical items (Tables and Figures). RB1 and MS (eSIR), XG (SAPHIRE), RK and RB2 (SEIR-fansy) and SM (ICM) implemented the different models. DR helped with planning analysis and writing strategies to address reviewer concerns in the revised version. BM designed the study, revised the draft, provided strategic guidance and oversaw the analysis and the writing. All authors participated in writing and reviewing this manuscript. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Bhramar Mukherjee.

Ethics declarations

Ethics approval and consent to participate

Not applicable (uses publicly available data).

Consent for publication

Not Applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Supplementary Table S1.

Summary of initial values and parameter settings for application of the SEIR-fansy model in the context of COVID-19 data from India. Unless mentioned otherwise, we use these parameter settings for all other models when applicable. Supplementary Table S2. Overview of projected COVID-counts for each model considered. Supplementary Table S3. Comparison of estimated projections and posterior estimates of model parameters across different sensitivity analysis scenarios under 21-day lockdown with moderate return, using observed data till April 14. Prior SD for R0 is 1.0. Reproduced from Ray et al., 2020 [53]. Supplementary Table S4. National and state-levels lockdown measures implemented over the course of COVID-19 pandemic in India. Reproduced from Salvatore et al., 2021 [66].

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Purkayastha, S., Bhattacharyya, R., Bhaduri, R. et al. A comparison of five epidemiological models for transmission of SARS-CoV-2 in India. BMC Infect Dis 21, 533 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: