The capture-recapture method estimates the total number of cases of a disease after matching cases reported in at least two sources .
Description of the three data sources
The mandatory HIV case reporting (DOVIH)
The mandatory HIV case reporting system was implemented in 2003 by the French Institute for Public Health Surveillance (InVS) to follow the epidemic trends of HIV and to describe the characteristics of HIV infections in newly diagnosed individuals . For adults, HIV mandatory notifications are initiated by microbiologists and then completed by clinicians. For children under 13 years of age, case reporting is performed only by paediatricians. All HIV-positive cases are notified using an assigned unique anonymous code that allows for the detection of duplicates. To take into account reporting delays, all notifications through March 31st, 2010 were selected for the study.
The ANRS French Perinatal Cohort (ANRS-EPF CO1/CO10/CO11)
Since 1984, the French Perinatal Cohort, supported by the French National Agency for AIDS Research (ANRS), has prospectively collected data on HIV-infected pregnant women and their children in approximately one hundred centres throughout France . The coverage of the cohort was estimated at 70% of cases throughout France. The objectives of this cohort study are to identify factors associated with HIV MTCT, to evaluate tolerance to ART prophylaxis, and to assess the prognosis of paediatric HIV infection. Informed consent was obtained from all of the mothers. Since 2005, the inclusion criteria were extended to all children <13 years of age diagnosed with HIV and born to mothers who were not included in the EPF, with parental consent. For these children, data were collected retrospectively for 2003 and 2004 and prospectively since 2005. Duplicates were deleted. The cases were selected based on a database that was updated in April 2008.
The HIV laboratory surveillance (LaboVIH)
Since 2001, the InVS has implemented a national surveillance of the HIV testing activity in France. The number of HIV tests performed and the number of new HIV-positive confirmed diagnoses are collected from 4,200 French microbiological laboratories each year . The participation rate of this laboratory surveillance is approximately 85%.
Laboratories that reported at least one new HIV diagnosis in children less than 13 years of age from 2003 to 2006 were asked to complete a questionnaire to collect individual information for each paediatric diagnosis. Duplicate notifications were deleted.
Imputation of the variable “country of birth” in the source LaboVIH
We wanted to estimate the total number of new HIV diagnoses according to the place of birth: “born in France” or “born abroad”. This binary variable was not collected in the LaboVIH source. However, this variable was collected in the DOVIH and EPF sources. Therefore, we were able to obtain the place of birth for the cases in LaboVIH that matched the two other sources of information (DOVIH and the EPF). The variable was missing in two cases in DOVIH and was unavailable for 66/126 cases globally (30.6%). We estimated the missing values through a multiple imputation (MI) method, in which the distribution of the observed data is used to estimate a set of plausible values for the missing observations . Multiple data sets were created, and an estimate was calculated for each imputed data set. The estimates were then combined to calculate overall estimates, variances and confidence intervals.
The applied MI method was multiple imputation by chained equations using STATA's user-written program ice (STATA ® 11.0, Stata Corporation, College Station, Texas, USA) [12, 13]. The variables “age” (continuous), “region of diagnosis” (categorical) and “year of diagnosis” (categorical) contained no missing values and were used as predictors in the imputation model. One hundred imputed databases were generated.
The reliability of the estimates depended on the following underlying assumptions: (1) identification of all and only true common cases, (2) closed population, (3) independence between sources and (4) capture homogeneity . Two sources are independent if the probability of a case being reported in one source does not depend on its probability of being reported in the other source. For analyses involving three or more sources, the independence assumption is not crucial because interaction terms can be incorporated into regression models to adjust for source dependencies; however, in these cases, highest-order independence has to be assumed. Homogeneity of capture is fulfilled when the probability of a case being reported in a source is the same for all cases or, more simply, when the probability of registration does not depend on the characteristics of the case (i.e., age, sex, place of birth etc.). This probability may vary from one source to another or be constant overall .
Dependence between sources was first assessed by comparing the estimates provided by each pair of sources [14, 15] and calculating the odds ratio (95% CI) between the two sources, as proposed by Wittes .
A preliminary three-source analysis was performed by fitting eight log-linear models to the data arranged in a 23 contingency table, according to the presence or absence of each case in each source. The dependent variable for each model was the logarithm of the number of cases in each of the 7 non-empty cells of the contingency table. These preliminary analyses assumed homogeneity of capture within each source and were performed using STATA’s user-written program “recap” , a STATA module providing standard three-source capture-recapture analyses without covariates. The confidence interval estimates for the population size were computed according to a goodness-of-fit based method proposed by Regal and Hook .
Three variables of potential heterogeneous catchability were considered: place of birth (born in France; born abroad), region of diagnosis (Paris area; other regions), and year of diagnosis (2003 to 2006). The data were then arranged in a 23x2x2x4 contingency table. Log-linear models were fitted via the STATA ‘glm’ command, which specified a logarithmic link and a Poisson distribution. Stratified analyses were performed according to the three variables of heterogeneous catchability. The log-linear models included two-way interactions between sources, between sources and each variable of catchability, and between the variables of catchability, when applicable. Log-linear modelling was jointly performed for the 100 imputed data sets using the STATA 11.0 analysis module “mi estimate” applying Rubin’s rules.
Population size estimates, calculated as a sum of exponentiated regression coefficients, were obtained through commands specific to MI. Their respective variances were estimated using the delta method. The confidence intervals (CI) were computed using Student’s t-statistics with degrees of freedom specific to each coefficient, depending both on the number of imputations and on the proportion of missing values.
Classically, in capture-recapture studies, the choice of the final model is based on the likelihood ratio test statistic (G
), the Akaike Information Criterion (AIC) and the Bayesian Information Criterion adapted by Draper (DIC), which are functions of the likelihood ratio statistic [18, 19]. AIC and DIC criteria were derived for each imputed data set according to the following formulas: AIC = G
2 − 2(df) and DIC = G
2 − (ln(N
/2π)) · (df), where df is the number of degrees of freedom associated with any model.
The naïve approach that averages the likelihood ratio statistic over the imputed data sets does not provide accurate p-values . The pooled likelihood ratio test statistic and its corresponding p-value were calculated using the Meng and Rubin approach , recently illustrated by Marshall et al.. Each log-linear model was constrained to the regression coefficients obtained from the joint analysis (i.e., the average over the 100 imputed data sets, according to Rubin’s rules). The AIC and DIC estimates were the average of the 100 AICs and DICs. We selected the most parsimonious model among the models with a goodness-of-fit p-value >0.05, and with the lowest AIC and DIC values. We also considered the relevance of including variables of heterogeneous catchability in the model, both from an epidemiological and a public health point of view.
The completeness for each source was estimated by dividing the number of new HIV diagnoses reported in each source by the total number estimated by the final log-linear model. The completeness was also calculated for each stratum of “place of birth”, “year of diagnosis” and “region of diagnosis”.
The annual rate of new HIV diagnoses was the estimated number of new HIV diagnoses divided by the size of the population of children under 13 years old living in mainland France up to December 2007 . The rate was also calculated according to the place of birth, using the number of children less than 13 years of age born in France or abroad.
Access to the 3 databases was authorised by the French Commission Nationale de l' Informatique et des Libertés (CNIL). No ethical approval was required for this research.