Evaluation of HIV testing algorithms in Ethiopia: the role of the tie-breaker algorithm and weakly reacting test lines in contributing to a high rate of false positive HIV diagnoses

Background In Ethiopia a tiebreaker algorithm using 3 rapid diagnostic tests (RDTs) in series is used to diagnose HIV. Discordant results between the first 2 RDTs are resolved by a third ‘tiebreaker’ RDT. Médecins Sans Frontières uses an alternate serial algorithm of 2 RDTs followed by a confirmation test for all double positive RDT results. The primary objective was to compare the performance of the tiebreaker algorithm with a serial algorithm, and to evaluate the addition of a confirmation test to both algorithms. A secondary objective looked at the positive predictive value (PPV) of weakly reactive test lines. Methods The study was conducted in two HIV testing sites in Ethiopia. Study participants were recruited sequentially until 200 positive samples were reached. Each sample was re-tested in the laboratory on the 3 RDTs and on a simple to use confirmation test, the Orgenics Immunocomb Combfirm® (OIC). The gold standard test was the Western Blot, with indeterminate results resolved by PCR testing. Results 2620 subjects were included with a HIV prevalence of 7.7%. Each of the 3 RDTs had an individual specificity of at least 99%. The serial algorithm with 2 RDTs had a single false positive result (1 out of 204) to give a PPV of 99.5% (95% CI 97.3%-100%). The tiebreaker algorithm resulted in 16 false positive results (PPV 92.7%, 95% CI: 88.4%-95.8%). Adding the OIC confirmation test to either algorithm eliminated the false positives. All the false positives had at least one weakly reactive test line in the algorithm. The PPV of weakly reacting RDTs was significantly lower than those with strongly positive test lines. Conclusion The risk of false positive HIV diagnosis in a tiebreaker algorithm is significant. We recommend abandoning the tie-breaker algorithm in favour of WHO recommended serial or parallel algorithms, interpreting weakly reactive test lines as indeterminate results requiring further testing except in the setting of blood transfusion, and most importantly, adding a confirmation test to the RDT algorithm. It is now time to focus research efforts on how best to translate this knowledge into practice at the field level. Trial registration Clinical Trial registration #: NCT01716299


Background
Ethiopia has a HIV prevalence of 1.3% (95% CI 1.2%-1.5%) in the adult population [1]. Considerable progress has been made over the last decade in scaling up access to testing across the country. In 2011, the number of people accessing HIV testing reached close to 10 million [2]. Scale up of HIV testing is possible due to diagnostic algorithms that employ HIV rapid diagnostic tests (RDTs). In Ethiopia, a tiebreaker regimen consisting of 3 RDTs in series is the national algorithm chosen after a thorough evaluation period. It uses HIV (1 + 2) Antibody Colloidal Gold (KHB, Shanghai Kehua Bio-engineering Co Ltd, China) as a screening test, followed by HIV 1/2 STAT-PAK® (Chembio Diagnostics, USA) if positive. Where the result of STAT-PAK® is discordant with KHB, a third test, Unigold™ HIV (Trinity Biotech, Ireland), is used as a tiebreaker to determine the result.
Rapid diagnostic tests are essential tools to screen for HIV, and are designed for use with confirmatory tests such as Western Blot to diagnose infection. However given resource constraints, WHO has developed testing guidelines that use 2-3 RDTs together in an algorithm to diagnose HIV [3]. While these algorithms allow decentralisation of HIV testing and scale up of access, they can come with the compromise of false positive results. This risk is described in a number of different studies [4][5][6][7][8] and has been shown to vary geographically and over time [9]. The risk of false positive results is linked to cross-reactivity, whereby non-HIV antibodies or protein in the blood falsely react with the antigens of the HIV RDTs. Where both RDTs cross react, a false positive diagnosis will result.
A number of studies have shown a link between false positive results and a weaker than usual test line on the RDT [4,5,8,[10][11][12][13]. However current manufacturers' recommendations are that any colour of the test line is interpreted as positive, which can result in misdiagnosis of HIV if a second test is also positive.
Médecins Sans Frontières-Operational Centre Amsterdam (MSF) opened a project in Humera to support the Tigray Regional Health Bureau with diagnosing and treating visceral leishmaniasis (VL) in 1997. HIV testing activities started the following year, with anti-retroviral treatment (ART) available from 2004. In 2008, MSF handed over the project to the Health Bureau. MSF has worked in Abdurafi providing diagnosis and treatment for VL and HIV since 2004. In both sites, health care staff and patients identified concerns about possible misclassification of HIV infection. In response, MSF introduced an alternate algorithm in Abdurafi, using two RDTs in series, followed by a confirmation test. The confirmation test used, the Orgenics Immunocomb® II, HIV 1&2 Combfirm (OIC), separately detects the p24, p31, gp41, gp120, and gp36 antibodies. It is performed by peripheral laboratory staff and produces results in two hours at a cost of $5 per test. The OIC in the first 15 months of use in Abdurafi identified 7.1% of positive results on a serial algorithm of Determine HIV-1/2 (Abbott Laboratories) and Unigold™ to be false positive reactions [14]. However the OIC is not part of the national algorithm, and there is little published on its performance characteristics.
We designed a study using standard WHO methodology to evaluate the performance of different RDT algorithms including the addition of the OIC confirmation test. We included a secondary objective on the predictive value of weak positive test lines. Two other objectives looking at the potential association between VL infection and false positive results and a novel method of confirmation testing, are reported separately.

Setting
The study site was in a MSF supported health centre in Abdurafi and a zonal hospital in Humera. The populations included residents as well as high numbers of migrant workers who are present seasonally. Testing took place in the designated counselling and testing (CT) centres in each site, in addition to the antenatal clinic, hospital and outpatient department in Humera.

Inclusion criteria
All clients, aged > = 5 years, presenting to be tested for HIV in the study sites were invited to participate in the study. Study participants underwent informed consent procedures and had a written consent form signed by the participant or the guardian.

Sample size
A sample size of 200 serial algorithm positive and 200 algorithm negative participants was chosen based on the WHO guidance for evaluation of RDTs [15]. To achieve the sample, all KHB positive samples were included along with every 10th KHB negative sample until a minimum of 200 positive samples were reached.

Information collected
Information recorded included age, sex, village, residency status (migrant worker, settler, resident, commercial sex worker, other), reason for testing, and time of most recent risk exposure. Clinical information recorded included CD4 count and presence/absence of active VL infection.

Testing
Initial testing was done on whole blood using the KHB-STAT/PAK®/Unigold™ tiebreaker algorithm in Humera, and on plasma using the KHB/STAT-PAK® serial algorithm in Abdurafi. In the laboratory, samples were tested with 3 RDTs (KHB, STAT-PAK®, and Unigold™) on whole blood and plasma. Laboratory technicians were blinded to the CT results. All tests were performed according to manufacturer's instructions. Invalid tests where the control line did not appear were discarded and repeated on new test devices according to manufacturers' instructions.
Tests were interpreted as positive when there was any colouration of the test line. Positive results were further classified as weak positive when the test line was significantly thinner and weaker than normal. A photo card was developed to guide interpretation.
All samples underwent testing by OIC and Western Blot with technicians blinded to earlier results. OIC was performed at each study site while the Western Blots were all performed at the MSF laboratory in Abdurafi. Interpretation of OIC was based on the stricter criteria employed in the Agence Nationale d'Accréditation et d'Evaluation en Santé guidelines [16]. Three to four spots were considered positive, two as indeterminate, and zero or one spot as negative. Gp36 was not included in determining the number of reactions.
Western Blot (WB) testing was performed using MP Diagnostics HIV Blot 2.2 at the MSF Abdurafi laboratory. Interpretation of results was based on American Red Cross recommendations [17].
Samples indeterminate on WB or OIC were repeated and if still indeterminate, underwent DNA PCR examination using Roche Amplicor DNA v1.5 on dry blood spots (DBS) at the Ethiopian Health and Nutrition Research Institute (EHNRI) laboratory based in Addis Ababa, allowing diagnosis of subtypes A-D. The Global Clinical and Viral Laboratory in South Africa provided quality control for PCR and confirmed results using Cobas® AmpliPrep/Cobas® TaqMan® HIV-1 Qual test which detects HIV subtypes A to H. Where results between the two labs were discrepant, the result from South Africa was used.
The final gold standard result was that of the Western Blot, and where Western Blot was indeterminate, the PCR result.

Quality control
All staff underwent training on the study standard operating procedures by the MSF laboratory supervisor, and received regular monitoring and supervision. As previously described all RDT results were controlled at the laboratory, and discrepant results repeated.

Analysis
The RDT result on plasma was considered the final result for the purposes of the analysis. As the samples received all 3 RDTs in the laboratory, each sample regardless of the initial algorithm was evaluated for both the serial and tiebreaker algorithm. It was also possible to calculate an alternate algorithm to give results for a serial KHB/Unigold™ algorithm, and a KHB/Unigold™ /STAT-PAK® tiebreaker algorithm. Discordant test results (one RDT positive, the other negative) were classified as negative, and indeterminate OIC results as positive for the calculation of predictive values and sensitivity/specificity.
Predictive values and sensitivity and specificity were estimated from the 2 × 2 table of observed results after weighting based on the sampling proportion of the KHB positive and negative samples. Confidence intervals for each of the test parameters were calculated using exact binomial intervals. PPVs of RDT algorithms from the main sample were compared using an analogue of McNemar's statistic (Z score) derived from a marginal logistic regression model [18]. Where the logistic regression model was unable to give a score a bias-corrected bootstrap was employed. Fisher's exact test was used to compare categorical variables for the false positive analysis with the Mann-Whitney test for continuous variables.
Statistical analysis was done using Stata version 12 (StataCorp, Texas, USA).

Ethical review
The study received approval from the MSF Ethics Review Board, the EHNRI Research and Ethical Clearance Committee, and the National Research Ethics Review Committee, Ministry of Science and Technology in Ethiopia.

Results
2622 individuals were screened from December 2010 to July 2011, 1297 (59.2%) in Humera and 895 (40.8%) in Abdurafi. 430 individuals were eligible for analysis, representing all KHB positives and 10% of KHB negatives. One sample was excluded due to missing WB and PCR results, and another was excluded due to a duplicate identity number. This resulted in a total sample of 428. HIV prevalence was 7.7% (203/2620). A description of the demographics of the study participants is found in Table 1.
The WB identified 201 positives, 166 negatives and 61 indeterminates (59 of which were negative on PCR and 2 positive). The OIC identified 198 positives, 223 negatives and 7 indeterminates (5 positive on PCR and 2 negative). There were no positive HIV-2 results on either the OIC or WB.
Test performance measures of the individual RDTs are in Table 2. Each RDT had a specificity of 99% or greater.

Serial Algorithms
The KHB/STAT-PAK® serial algorithm had one false positive result and no false negatives. There were 22 discordant results, all of which were resolved negative by WB and PCR.
The addition of the OIC confirmatory test to the KHB/STATPAK® algorithm eliminated the false positive result. There was no statistically significant difference in PPV between the serial KHB/STAT-PAK® algorithm and the corresponding serial OIC algorithm (p = 0.33).
An alternate serial algorithm with KHB/Unigold™ resulted in 16 false positives, 0 false negatives, and 9 discordants, 7 (77.8%) of which were resolved to be negative. The addition of the OIC test to the algorithm significantly improved the PPV (p = 0.004). Details are in Tables 3 and 4.

Tiebreaker algorithm
The KHB/STAT-PAK®/Unigold™ tiebreaker algorithm yielded 16 false positive results and 0 false negatives as shown in Table 3. Addition of the OIC test to the algorithm eliminated the false positive results and added 6 indeterminate results, 5 of which were resolved positive. The PPV of the OIC algorithm was significantly improved compared to the tiebreaker alone (p < 0.001).
An alternate tiebreaker algorithm of KHB/Unigold™/ STAT-PAK® also resulted in 16 false positives and no false negatives. Addition of the OIC test eliminated the false positives and added 6 indeterminate results. Five of the indeterminates were resolved positive. The OIC test significantly improved the PPV compared with the alternate tiebreaker algorithm alone (p < 0.001).
Compared to the serial KHB/STAT-PAK® algorithm, the tiebreaker did significantly worse (p = 0.004). Looking at the alternate algorithm, the difference between the serial KHB/Unigold™ and the corresponding tiebreaker, was not significant (p = 0.16).
Details are in Table 4. The kappa statistic for inter-reader agreement between the CT and the laboratory in Humera for weak versus strong positives on whole blood was 0.85 (p < 0.001) for KHB and 0.32 (p < 0.001) for STAT-PAK®. The kappa statistic for inter-reader agreement between the CT and the laboratory in Abdurafi for weak versus strong positives on plasma was 0.79 (p < 0.001) for KHB and 0.49 (p < 0.001) for STAT-PAK®.
The PPV of a single weak positive RDT result versus that of a strong positive is found in Table 6.

Description of false positives
A total of 16 individuals were identified on the tiebreaker algorithm as false positives (FP). Table 7 contains a breakdown of the characteristics of the false positive samples.

Discussion
Our results demonstrate that the current Ethiopian algorithm, a 3 RDT tiebreaker algorithm, has a high proportion of false positives (7.7%) in our study population with a HIV prevalence of 7.7%. There were 16 individuals falsely identified as HIV infected. Altering the algorithm, such that the final tiebreaker RDT would be STAT-PAK® rather than Unigold™ did not improve the performance. The 3 RDTs, all exceeded the WHO criteria for specificity (>98%) yet did not achieve the target PPV for the algorithm of >99% [3]. This suggests that it is the choice of algorithm rather than a poorly performing RDT that is responsible for the high percentage of false positives.
Similar poor performances of the tiebreaker have been reported elsewhere. A study from the Rakai cohort in Uganda, reports a false positive proportion of 43.7% (129/295) with a tiebreaker algorithm of Determine/ STAT-PAK®/Unigold™ at a HIV prevalence of 11.2% [4]. A separate Ugandan study in a higher prevalence population looked at the performance of the tiebreaker when two out of three tests were positive. 14 of 29 (48.2%) were confirmed negative on DNA PCR [19]. In a large study conducted in Lusaka, Zambia and Kigali, Rwanda, samples with 2 out of 3 RDTs positive were found to be negative for HIV infection in 17 out of 37 (46%) of cases [20].
Many authorities mistakenly assume the tiebreaker is a WHO recommended algorithm. In fact, the WHO recommends serial or parallel testing with 2 RDTs for high prevalence populations (>5%), and 3 tests in series for low prevalence populations (<5%). Positive results should only be given if all 3 tests are positive. Those with 2 out 3 tests positive are advised they need further testing [3]. The discordant result between two RDTs that triggers the use of the third RDT is an indicator that cross reactivity is occurring. In our sample, 100% of discordant results on the serial KHB/STAT-PAK® algorithm were resolved as negative as were 77.8% of the discordants on the alternate KHB/Unigold™ algorithm. In the setting of discordant results between 2 RDTs, a confirmatory test is needed because a third RDT will be vulnerable to similar cross-reactivity. The KHB/STAT-PAK® serial algorithm yielded a single false positive result. When the serial algorithm was changed to KHB/Unigold™, the results were markedly worse and similar to the tiebreaker regimen due to the poor performance of Unigold™. The addition of the OIC confirmation test to either the serial or the tiebreaker algorithm eliminated all the false positive results. In 3 of the 4 algorithms tested, it significantly improved the PPV compared to no confirmation test. The exception was the serial KHB/STAT-PAK® algorithm which performed adequately without a confirmation test. However the addition of the OIC did identify one individual who otherwise would have been falsely labelled as HIV positive.
It is important to state that these results were obtained with a stricter interpretation of OIC, as described in the methodology section. The results confirmed this choice; there were no misclassifications of positive or negative results on OIC compared to the gold standard. A drawback to the OIC is that similar to other serological confirmation tests, there were indeterminate tests for which a result could not be given on the same day. There were 6 algorithm positive results that were indeterminate on OIC for  Finally, the OIC is one type of confirmation test that is suitable for use in peripheral laboratories. There is an urgent research need to develop and evaluate other simple confirmation tests, which are affordable, cold-chain independent, and can be performed at peripheral level.

Description of false positives
There were 5 characteristics that were significantly associated with false positive results on univariate analysis: age, male sex, active VL infection, the Abdurafi study site and the time of enrolment in the study. It was not possible to do multivariate analysis in order to explore these associations further due to small sample size. It is therefore difficult to conclude on the clinical significance of these findings.
The commonest band present on the false positives samples was p24, as 61.1% of FPs were positive for p24 by WB and 16.7% by OIC. This suggests that much of the cross-reactivity responsible for the falsely positive RDTs may be due to p24 and is consistent with previous findings from Ethiopia [21]. Antibody to p24 antigen is one of the earliest antibodies to appear, therefore raising the possibility that our false positives were seroconverting. However the fact that all of these cases had negative PCR testing indicates that this is cross-reactivity rather than early seroconversion.

Weakly reactive test lines
All of the false positive results in this review resulted from weak positive RDT results. Excluding weak positives from the analysis results in a 100% PPV for all the algorithms studied. This contrasts sharply with the PPV of the weak positive tiebreaker algorithm currently in use in Ethiopia, which was found to be just 23.8% (95% CI 8.2-47.2). All 3 RDTs demonstrated weak positivity, though STAT-PAK® had fewer weak lines than Unigold™ and KHB. This is the first report of which we are aware reporting KHB weak positives. There is evidence of Determine, STAT-PAK®, and Unigold™ showing this phenomenon in multiple countries which suggests this is a class effect rather than one specific to a particular RDT or geographic location [4,5,8,[10][11][12][13]. One report suggests that weak positive  results are more frequent with plasma versus whole blood [11]. Our results do not support this finding. The poor specificity of weak positive test lines is consistent with that found by other researchers and is felt to reflect the occurrence of cross reactivity. Our results further reinforce the recommendation of a growing body of researchers that weak positive test lines should be interpreted as indeterminate, and require further testing before giving a result [4,5,8,11,12]. The exception is in the setting of blood transfusions where any colouration of the test line should be read as positive. Table 8 provides a summary of the data used to support these recommendations. Given this body of evidence, it is now time to focus research efforts on how to implement this change in field conditions. The test algorithm will need adaptation to incorporate the strength of the test line. A key feasibility issue will be the subjective nature of interpretation of the lines. We used a reference card with photographs and were able to achieve good agreement as evidenced by the kappa levels of 0.79 and 0.85 for KHB. Agreement for STAT-PAK® was less consistent, however the number of weak positives was small. Several authors have reported good agreement between field staff interpretation of the test line strength and that of the reference laboratory [4,22]. However this needs to be evaluated outside of study conditions particularly given the different cadres of staff involved in testing. Bench aids to guide interpretation and a simple training package need to be developed. Finally, to avoid losing individuals to follow up, it will be important to have timely access to confirmation testing as well as good counselling to explain the need for follow up testing. While our study did not have any false positives without weak positives on one or more RDTs, false positive results are documented to occur with strongly reacting test lines [5,12]. Re-classifying weak positives as inconclusive will not therefore eliminate the risk of false positive reactions.
The strengths of this study are its use of a standard WHO design for evaluation of RDTs, as well as the use of DNA PCR to resolve indeterminate OIC and Western Blot results. The latter avoids misclassification of severely immunosuppressed patients or seroconvertors as false positives. The identification of the weak positive test lines was strengthened by the use of a photo card. A limitation is that we were unable to confirm all of the negative results, and instead tested a random sample of 10% of the negatives. We adjusted for this in the analysis using statistical methods. In total, 16 individuals were misclassified as HIV positive using the national algorithm. The consequences of receiving a diagnosis of HIV can be devastating to an individual and family. Four of these individuals had CD4 counts less than 350, and 7 had counts less than 500. This suggests that without confirmation testing, these individuals could have been started on ART using new guidelines for ARV initiation. The programmatic consequences of following these individuals in HIV clinics, along with the costs for ancillary laboratory tests are significant [13,23].
Our study suggests that these risks and costs can be eliminated through 3 measures. Firstly, to abandon the tiebreaker regimen in favour of a WHO approved serial or parallel algorithm. Secondly, to consider weak positive results as inconclusive and resolve their status through further testing. And thirdly and most importantly, to introduce a confirmation test to the RDT algorithm using a test that can be performed in peripheral laboratories.

Conclusion
The risk of false positive HIV diagnosis with the tiebreaker algorithm is significant. False diagnosis of HIV has major consequences for individuals and for health systems. The OIC test improves the diagnostic accuracy of the RDT algorithm and shows good agreement with the gold standard. Weak positive reactions on RDTs are associated with false positive HIV results, and require further testing prior to giving a HIV diagnosis. It is now time to focus research efforts on how best to translate this knowledge into practice at the field level ensuring feasibility in the variety of settings where HIV testing takes place.