The BCPP study enrolled participants from 2013 to 2018 . Approximately 20% of households within 30 communities were randomly selected and eligible participants were enrolled. Each community had an average population size of 5855 for a total trial population of 175,664 . Individuals aged 16 or older completed questionnaires and had blood drawn . We collected questionnaire data in two broad categories: (1) Socio-demographics and community environment: participant and community variables, residency and mobility, education, employment, (2) HIV risk behaviour: HIV testing history and sexual behaviour (see  for protocols, questionnaires and data). Negative participants were followed up with yearly HIV tests.
HIV-1 full genome sequencing
HIV samples from all positively diagnosed participants were sequenced, regardless of antiretroviral therapy (ART) status and viral load. The majority of people with known HIV were already on ART with viral suppression defined as viral load ≤ 400 copies/mL; in these instances, integrated virus was sequenced from viral RNA and proviral DNA templates. Next-generation sequencing (NGS) was performed by the BioPolymers Facility at Harvard Medical School  and through collaboration with the PANGEA HIV consortium [26,27,28] using Illumina platforms MiSeq and HiSeq, as previously described [29,30,31]. In brief, the nucleic acids were reverse-transcribed and PCR amplified. Amplicons were pooled in equimolar amounts for Illumina library preparation. Sequence assembly was performed de novo using SPAdes version 2.4.0. The HIV-1 reference strain, HXB2 (NC_001802), was used for sequence alignment and a consensus sequence was generated using Abacas version 1.3.1 and MUMmer version 3.2. Next, sequence reads were mapped against the consensus sequence using SMALT version 0.5.0 .
HIV consensus sequences were subtyped using COMET , and only HIV-1 subtype C sequences were included in our analysis (accounting for > 99% of BCPP sequences). Based on the NGS reads, we were provided with nucleotide frequency files for each patient, detailing the relative frequency of each nucleotide at each site in the alignment. HIV sequences and basic demographic and clinical data are available upon request to the PANGEA consortium .
We stratified participants with HIV-1 into three groups as follows: 2995 with previously diagnosed HIV (henceforth referred to as “known cases”), 601 persons with newly diagnosed HIV at enrolment (“new cases”), and 147 persons who seroconverted to HIV-positive during follow-up (“incident cases”). As study participants who tested negative at enrolment were then tested yearly, incident cases were known to have been infected for ≤ 1 year when they were diagnosed. Baseline characteristics and descriptive statistics for all participants have been previously described .
We compared responses to the questionnaire across our three groups. We included variables known to be associated with HIV infection and variables known to be associated with undiagnosed infection. We employed logistic regression to compare the characteristics of new cases versus known cases to identify predictors associated with undiagnosed infection. For each variable of interest, we performed univariate analyses comparing the two groups (new cases versus known cases). A single multivariable analysis was fit including demographic and behavioural predictors that were significant (p < 0.05) in the univariate analyses. We did not adjust for missing data in the multivariable analysis, and the proportion of complete cases in the data was 53.9%. Demographic predictors evaluated included sex, age, marital status, number of children, religious affiliation, education, and employment status. The behavioural predictors assessed were previous number of HIV tests, sexual activity (yes/no), number of partners, partner concurrency, condom use, condom use frequency in the past year, number of nights spent away from home and partner’s HIV status. Logistic regression was adjusted by community (n = 30) with a random effect using a model with robust standard errors in R. All variables were analysed as categorical variables except for age, which was analysed as a continuous predictor in the primary analysis. A sensitivity analysis was run with age as a categorical variable with five age ranges: 16–24, 25–34, 35–44, 45–54 and 55–64 years old. Finally, we reran the analysis disaggregated by sex, in order to disentangle differential predictors for men and women.
We sought to use viral genetic data to determine whether the time from infection to diagnosis varied significantly between the three groups (new, incident, and known cases). First, we compared the assigned stage of infection based on within-host genetic diversity (< or ≥ 1 year) for each group, using a χ2 test. We then compared the recency probability distributions between the three groups using a Kruskal–Wallis test. These comparisons were conducted for participants for whom deep sequencing nucleotide frequency data were available (n = 1867). Finally, we compared the length of terminal branch lengths across the groups, and across each HIV-1 gene (gag, pol, env), in a pairwise manner, using Kolmogorov–Smirnov (KS) tests. These comparisons were conducted for participants for whom consensus genetic sequences were available for at least one gene (n = 2872). This dataset included 2339 known cases, 399 new cases and 134 incident cases. To account for multiple non-independent comparisons (across different genes), we used Bonferroni’s correction to assess statistical significance where appropriate. All statistical analyses were conducted in R (version 3.6.0) .
For participants for whom viral genetic data were available (n = 2872), we constructed phylogenies separately for each HIV gene region: gag, pol and env. Sequences were available for 2339 known cases, 399 new cases and 134 incident cases. We compared the viral characteristics between those two groups (know vs. new cases), and a third: those diagnosed with incident infections during the BCPP trial (n = 147). Because this latter group were negative at the start of the trial, then tested yearly, we knew their infections were < 1 year when they were diagnosed. We wanted to use our two reference groups (known older cases, and incident cases) to evaluate how long those diagnosed at the start of trial were likely to have been infected before they were diagnosed.
Maximum likelihood phylogenies were reconstructed using RaxML  under a GTR model with four gamma rates. Phylogenies included sequences from an additional 3277 patients from Botswana clinics, to serve as local controls. Final phylogeny sizes were: gag (n = 5631), pol (n = 6084) and env (n = 5840). We time-resolved the phylogenies using the treedater package, available in R , using sample times as tip dates. For each tip, we extracted terminal branch lengths (measured in time) using the “Analyses of Phylogenetics and Evolution” (APE)  and phytools  R packages. We compared the distributions of terminal branch lengths across our three groups (known cases, new cases, incident cases) using Kolmogorov–Smirnov tests.
Inference of stage of HIV infection
For each study participant for whom deep sequencing nucleotide frequency data were available (n = 1867), we calculated the probability (0–1) of their infection being recent (< 1 year) based on within-host genetic diversity, demographic (age, sex) and clinical (treatment status, viral load) predictors using an xgboost gradient boosting  machine learning algorithm. The machine learning classifier is trained on a dataset of known recent (< 1 year) and chronic (≥ 1 year) infections to classify stage of HIV infection for individuals in each of our three groups. This analysis was conducted on all individuals for whom NGS coverage was sufficient to derive site-specific nucleotide frequency distributions (n = 1867). We selected the threshold that optimized for accuracy (the highest number of overall correct classifications). The algorithm has been previously developed for and trained on the BCPP dataset on participants with known duration of infection . For each participant in the present study, we used the algorithm to predict the probability of recency. We carefully excluded individuals comprised in the present study from the dataset used to train the classifier.