 Research Article
 Open Access
 Published:
Comparison of predictive models for hepatitis C coinfection among HIV patients in Cambodia
BMC Infectious Diseases volume 20, Article number: 209 (2020)
Abstract
Background
Hepatitis C virus (HCV) infection is a major global health problem. WHO guidelines recommend screening all people living with HIV for hepatitis C. Considering the limited resources for health in low and middle income countries, targeted HCV screening is potentially a more feasible screening strategy for many HIV cohorts. Hence there is an interest in developing clinicianfriendly tools for selecting subgroups of HIV patients for whom HCV testing should be prioritized. Several statistical methods have been developed to predict a binary outcome. Multiple studies have compared the performance of different predictive models, but results were inconsistent.
Methods
A crosssectional HCV diagnostic study was conducted in the HIV cohort of Sihanouk Hospital Center of Hope in Phnom Penh, Cambodia. We compared the performance of logistic regression, SpiegelhalterKnillJones and CART to predict Hepatitis C coinfection in this cohort. We estimated the number of HCV coinfections that would be missed. To correct for overoptimism, the leaveoneout bootstrap estimator was used for estimating this quantity.
Results
Logistic regression misses the fewest HCV coinfections (8%), but would still refer 98% of HIV patients for HCV testing. SpiegelhalterKnillJones (SKJ) and CART respectively miss 12% and 29% of HCV coinfections but would only refer about 30% for HCV testing.
Conclusions
In our dataset, logistic regression has the highest loglikelihood and smallest proportions of HCV coinfections missed but SpiegelhalterKnillJones has the highest area under the ROC curve. The likelihood ratios estimated by SpiegelhalterKnillJones might be easier to interpret for clinicians than odds ratios estimated by logistic regression or the decision tree from CART. CART is the most flexible method, and no model has to be specified regarding presence of interactions and form of the relationship between outcome and predictor variables.
Background
Hepatitis C virus (HCV) infection is a major global health problem. 71 million people are chronically infected and HCVattributable mortality kept rising the last 20 years to 495.000 annual deaths in 2015 [1]. Until recently, treating HCV was complex, not affordable, poorly successful and not considered for programming in low and middle income countries (LMIC). Recently, with the advent of affordable generic HCV Direct Acting Antivirals this changed. The new global HCV cascade targets—90% of infected diagnosed and 80% of diagnosed treated by 2030—reflect this paradigm shift [2]. To allow timely scale up of treatment, efficient HCV testing strategies will thus be crucial. Less than 15% of those living with hepatitis C know their status, with even lower proportions in LMIC [3]. WHO guidelines recommend screening all people living with HIV for hepatitis C. For the general population, the recommendation is tailored according to prevalence; universal screening if prevalence above 2 or 5%, and targeted screening if lower [4]. Considering the limited resources for health in LMIC, and recent data indicative of lowtointermediate HCV/HIV coinfection rates among HIV populations without specific risk profile [5, 6], targeted HCV screening is potentially a more feasible and costeffective screening strategy for many HIV cohorts in LMIC (except for HIV populations with higher risk profile, as men having sex with men, and people who use drugs), especially in this initial phase of HCV care scaleup. Simple tools or scores to guide targeted screening, other than birthcohort screening, do not exist. However, HCV screening based on older age as sole criterion might be too restrictive for LMIC where drivers of generalized HCV exposure were often removed much later or only partially [7].
Hence there is an interest in developing other, more sensitive, but clinicianfriendly tools for selecting subgroups of HIV patients for whom HCV testing should be prioritized, i.e. in predicting active HCV coinfection (defined as HCVRNA detected). When developing a predictive model, multiple items might be of prognostic value. Since these items are typically correlated, the predictive model should take this dependency into account. Logistic regression [8] is widely used when the outcome is a binary variable. However, several other approaches have been developed, e.g. classification and regression trees (CART) [9] and the SpiegelhalterKnillJones (SKJ) approach [10]. Several studies have compared the performance of different predictive models, but results were inconsistent [11, 12]. While the SKJ method requires all predictors to be categorical, the logistic regression model and CART are able to incorporate continuous predictors too. Another advantage of CART is that it does not require a predefined underlying relationship between the predictors and the outcome. The goal of this paper is to compare the performance of these three methods to predict HCV coinfection in a cohort of Cambodian HIVinfected patients.
Methods
Data source
We compared the performance of the predictive models on a dataset of a crosssectional HCV diagnostic study conducted in the HIV cohort of Sihanouk Hospital Center of Hope (SHCH) in Phnom Penh, Cambodia (clinicaltrials.gov NCT02361541) [5]. The information on potential predictors (by historytaking, physical examination and laboratory testing) was collected prospectively following a prespecified study protocol, and whilst results of HCV diagnostic testing were yet unknown. In total, 3045 adult HIV patients were enrolled, of whom 106 with a current HCV coinfection (i.e. HCVRNA detected). We built the predictive models including the following items: age (years), gender (female/male), platelet count (×10^{9} cells/L), aspartate aminotransferase (AST, IU/L), alanine aminotransferase (ALT, IU/L), ASTtoplatelet ratio index (APRI), having diabetes mellitus (yes/no), any of the following symptoms: fatigue, myalgia/arthralgia, anorexia/weight loss (yes/no), presenting generalized pruritus without obvious skin lesions (yes/no), having a household member and/or partner with liver disease (yes/no), and poor CD4 recovery on ART, i.e. CD4 below 200 after 3 years or more on ART (yes/no).
Performance of predictive model
In this setting, we wanted to select a subset of HIV patients at higher risk of HCV coinfection for whom HCV testing should be prioritized. In absence of a wellestablished threshold for HCV testing, we considered the harm/benefit of testing and not testing (at patient and public health level). We intended a lower threshold than the WHO recommended threshold (25% depending on resource availability) for HCV testing in the general population [4], because HIV populations in resourceconstrained settings remain at higher risk of advanced HCV disease as they have often started antiretroviral therapy late or with less optimal regimens. A 1% probability threshold for the decision rule (i.e. giving false negatives much more weight than false positives) seems low enough as the risk score, if easily applicable, can be repeated yearly. Hepatitis C treatment is in most cases not urgent. Hence our aim was to build a prediction model where the probability of HCV coinfection in the group who is classified as negative is smaller than 1%. To compare performance of the prediction models obtained with the different methods, we estimated the loglikelihood, the area under the ROC curve, the number of HCV coinfections that would be missed, the sensitivity, specificity, positive and negative predictive value. To correct for overoptimism, the leaveoneout bootstrap estimator [13] was used. Furthermore, we compared the proportion of participants who would be referred to HCV testing.
Logistic regression
The logistic regression model is
where x_{1},⋯,x_{p} are the different predictors. The coefficients β_{i} represent the adjusted log odds ratio (OR) for each difference of one unit in x_{i}. The intercept, α, is the log odds when all predictors are equal to zero. The logistic regression model can include continuous, binary and categorical predictors. Missingness was added as a factor level to variables for which there are missing values.
A logistic regression model was fitted with all candidate predictors as independent variables. Because of sparse data, Firth correction was applied. The predictor score was calculated by rounding \(\hat {\beta }_{1}x_{1}+\cdots +\hat {\beta }_{p}x_{p}\). A cutoff was chosen as the minimal value such that in the group of subjects with this score, the proportion of subjects with HCV coinfection was larger than 1%. All subjects with a score of at least this cutoff were classified as needing HCV testing.
Classification and regression trees
Classification and regression trees (CART) use recursive binary partitions to divide the predictor space into a set of subregions [9]. More specifically, the covariate space of the root node is split into two child nodes, based on the predictor and cutoff that yields the largest decrease in impurity (i.e. less heterogeneity in outcome within each node). Next, one of these child nodes is split into two more nodes. This procedure is repeated under the following conditions: a node has to contain at least 20 observations to be considered for splitting and a terminal leaf has to contain at least 7 observations. Since this process likely overfitted the data, the tree was pruned to a smaller subtree. A penalty is added to the error of the tree, relative to the size of the tree. A sequence of trees was fitted with each time a different costcomplexity parameter (i.e. penalty for the size of the tree). The smallest tree whose error lies within one standard error of the minimal error over the sequence of trees was selected. The weight for false negatives was chosen so that the proportion of true HCV coinfections in the group who are classified as negative is smaller than 1%. For each split a surrogate variable is identified which approximates the split using another predictor variable. Any observation which is missing the split variable is then classified using the surrogate variable [14].
SpiegelhalterKnillJones
The SpiegelhalterKnillJones (SKJ) approach adapted by Berkley et al. [10, 15] estimates likelihood ratios. Because the SKJ approach requires binary predictor variables, the continuous candidate predictors were dichotomized using the cutoff which maximizes the Youden index. In a first step, unadjusted likelihood ratios (LR) for all candidate predictors are estimated, and the predictors with an unadjusted LR ≥2 or ≤0.5 are included in a next step, in the multivariable logistic regression model:
where w_{i} is the crude log positive/negative LR for positive/negative test results respectively. The adjusted likelihood ratios (aLR) are then given by
where β_{i} is the shrinkage factor from crude LR to adjusted LR. The predictors with an aLR ≥1.5 or ≤0.67 were selected for the final predictive model. The aLRs were transformed to their natural logarithm, and rounded to the nearest integer to calculate the score (relative weight) of each predictor. By summing the scores of all predictors presented by a patient the total predictor score for each patient was obtained. A value of 0 was assigned to missing data, assuming that a missing value is not predictive. A cutoff was chosen as the minimal value such that in the group of subjects with this score, the proportion of subjects with HCV coinfection was larger than 1%. All subjects with a score of at least this cutoff were classified as needing HCV testing.
Statistical analysis was performed in Stata 15.1 [16] and R 3.5.0 [17].
Results
A total of 3045 ambulatory HIV patients of Sihanouk Hospital Center of Hope were included. Their median age was 43 years (interquartile range (IQR): 36–48), 43% were male patients, 98% were on antiretroviral therapy (ART), and 1% (N=31) reported past or current sex work, being homosexual, or a history of injecting drug use. In this cohort, 106 patients had a detectable HCVRNA (our outcome of interest), but none among the abovementioned 31 HIV patients with higher risk profile. Distribution of the candidate predictors in the cohort and the missing values are further specified in Table 1.
Predictive models: logistic regression, cART, spiegelhalterKnillJones
The adjusted odds ratios from the logistic regression model are shown in Table 2. A higher age, ALT, APRI and having a partner or household member with liver disease increase the probability of HCV coinfection, while higher platelet levels and being a male decrease the probability of HCV coinfection. The number of observed HCV coinfections for each score are shown in Table 3. A score of −2 is the lowest score for which the proportion of subjects who are HCV coinfected is larger than 1%. Thus all subjects with a prediction score of −2 or higher would be referred to HCV screening.
In CART, to ensure that the proportion of true HCV coinfections in the group who are classified as negative is smaller than 1%, the selected weight for false negatives was 58. The predictors used in the tree (Fig. 1) are: age, gender, platelets, AST, ALT, APRI, any of fatigue, myalgia/arthralgia, anorexia/weight loss and generalized pruritus. Of the 106 subjects with HCV coinfections, 105 would be referred for HCV screening, compared to 839 of the 2939 subjects without HCV coinfection.
The unadjusted and adjusted likelihood ratios of the candidate predictors resulting from the Spiegelhalter KnillJones method are reported in Table 4. The predictors retained for the score were: age ≥50 years, platelets <200×10^{9} cells/L, AST ≥30 IU/L, APRI ≥0.45, diabetes mellitus, generalized pruritus and household member and/or partner with liver disease (Table 4). The number of observed HCV coinfections for each score are shown in Table 3. A score of 0 is the lowest score for which the proportion of subjects who are HCV coinfected is larger than 1%. Thus all subjects with a prediction score of 0 or higher would be referred to HCV screening.
Predictive performance of the different models
The predictive performance of the different models is shown in Table 5. Logistic regression obtains the highest loglikelihood and misses the fewest HCV coinfections, but would still refer 98% of HIV patients for HCV testing. SpiegelhalterKnillJones has a higher area under the ROC curve and misses fewer HCV coinfections than CART but has a lower specificity and positive predictive value. Both methods would refer about 30% for HCV testing. This would yield a high cost reduction compared to testing all HIV patients for HCV.
Discussion
In our dataset, logistic regression has the highest loglikelihood and smallest proportions of HCV coinfections missed but refers more subjects for HCV screening. Depending on the specific setting, a balance needs to be made between the number of HCV coinfections missed and the number of HCV tests to perform. In general for a triage test (like a clinical scoring system), a higher sensitivity is preferred, and the specificity is determined by the resources available. A limitation of our study is that our goal was not to compare the predictive performance of logistic regression, CART and SKJ in general, but only in this specific case of predicting HCV coinfection in the study population of Cambodian HIVinfected patients. Our findings may not be generalizable to other outcomes. Also generalizability of the different derived models for our outcome (HCV coinfection) could not be ascertained, this would require further external validation.
When the aim is to predict a binary outcome, logistic regression is widely used. The association of each predictor with the outcome is expressed as an adjusted odds ratio, which might be difficult for clinicians to interpret. However if the goal is to build a prediction model, the interpretation of the relationship between predictor and response is probably not of interest. Furthermore, for classification, the score needs to be calculated, which is not very userfriendly. Although an app could be developed that calculates this score based on values of the predictor variables. The usefulness of a clinical prediction rule is also determined by its ease of use. The SKJ method estimates adjusted likelihood ratios, positive or negative if key predictors are present or absent, and this more nuanced information is preferred above odds ratios by clinicians. Moreover the score can be easily calculated, as a sum of integers. Also CART results in a decision tool that can be easily applied in clinical practice. However the relationship between predictor and reponse is harder to interpret than with logistic regression or SKJ.
In logistic regression, missing values were considered as an extra level of the covariate factor. However this approach is known to be biased, even when missingness is completely at random. Other methods to handle missing data are available, like multiple imputation, but all of them depend on untestable assumptions. They are also more complex and would yield a score not feasible to apply in clinical practice. On the other hand, missing values are naturally handled by SKJ making the assumption that a missing value is not predictive of the outcome (the score corresponds to 0 and does not affect the prediction in confirmation or exclusion). Using CART, for subjects with a missing value for a splitting variable a surrogate split is used.
The SKJ corrects for confounding, but does not allow interactions between predictors, and the shrinkage used is similar for a negative or a positive test result, i.e. LR+ and LR. Interactions can be included in the logistic regression model, but they have to be specified. In practice often only twoway interactions are included, if any. Because of the way they are built, CART naturally includes higherorder interactions, derived from the data. In that sense CART is the most flexible method, and no model has to be specified.
The performance of CART can be improved by using random forests or boosted trees [13]. Both methods aggregate information from multiple decision trees, developed on different bootstrap samples. Although their predictive performance surpasses that of a single tree, both random forests and boosted trees do not yield a simple decision rule. Hence we did not consider them in this paper since our aim was to develop a prediction rule that can be easily applied in clinical practice.
Conclusions
When the goal is to predict a binary outcome, often logistic regression is chosen as method to build a prediction score. However other methods like SKJ and CART may perform better and should be considered. More research is needed on how to select the best prediction method in a certain setting.
Availability of data and materials
The data supporting the findings of this study are retained at the Institute of Tropical Medicine, Antwerp and will not be made openly accessible due to ethical and privacy concerns. Data can however be made available after approval of a motivated and written request to the Institute of Tropical Medicine at ITMresearchdataaccess@itg.be.
Abbreviations
 aLR:

adjusted likelihood ratio
 ALT:

alanine aminotransferase
 APRI:

ASTtoplatelet ratio index
 ART:

antiretroviral therapy
 AST:

aspartate aminotransferase
 CART:

classification and regression trees
 HCV:

Hepatitis C virus
 IQR:

interquartile range
 LMIC:

low and middle income countries
 LR:

likelihood ratio
 OR:

odds ratio
 SHCH:

Sihanouk Hospital Center of Hope
 SKJ:

SpiegelhalterKnillJones
References
 1
Polaris observatory HCV collaborators. Global prevalence and genotype distribution of hepatitis C virus infection in 2015: a modelling study. Lancet Gastroenterol Hepatol. 2017; 2:161–76.
 2
WHO. Global health sector strategy on viral hepatitis 20162021: towards ending viral hepatitis. 2016. http://apps.who.int/iris/bitstream/10665/246177/1/WHOHIV2016.06eng.pdf?ua=1. Accessed 12 Sept 2019.
 3
WHO. Global Hepatitis report. 2017. http://apps.who.int/iris/bitstream/10665/255016/1/9789241565455eng.pdf?ua=1.
 4
WHO. Guidelines on Hepatitis B and C testing. 2017. http://apps.who.int/iris/bitstream/10665/254621/1/9789241549981eng.pdf?ua=1.
 5
De Weggheleire A, An S, De Baetselier I, Soeung P, Keath H, So V, et al.A crosssectional study of hepatitis C among people living with HIV in Cambodia: Prevalence, risk factors, and potential for targeted screening. PLoS One. 2017; 12:e0183530.
 6
Loarec A, Molfino L, Walter K, Muyindike W, Carnimeo V, AndrieuxMeyer I, et al.Low hepatitis C virus prevalence among human immunodeficiency virus+ individuals in SubSaharan Africa. J Hepatol. 2017; 66:S270–1.
 7
Thursz M, Fontanet A. HCV transmission in industrialized countries and resourceconstrained areas. Nat Rev Gastroenterol Hepatol. 2014; 11(1):28–35.
 8
Agresti A. Categorical Data Analysis. 2nd ed.Hoboken: Wiley; 2002.
 9
Breiman L, Friedman J, Stone CJ, Olshen R. Classification and Regression Trees. Monterey: Wadsworth and Brooks; 1984.
 10
Spiegelhalter DJ, KnillJones RP. Statistical and knowledgebased approaches to clinical decision support systems, with an application to gastroenterology. J R Stat Soc Ser A. 1984; 147:35–77.
 11
Austin PC. A comparison of regression trees, logistic regression, generalized additive models, and multivariate adaptive regression splines for predicting AMI mortality. Stat Med. 2007; 26(15):2937–57.
 12
Mansiaux Y, Carrat F. Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infectio. BMC Med Res Methodol. 2014; 14:99.
 13
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. 2nd ed.New York: Springer; 2009.
 14
Therneau T, Atkinson B. rpart: Recursive Partitioning and Regression Trees. 2018. R package version 4.113.
 15
Berkley J, Ross A, Mwangi I, Osier F, Mohammed M, Shebbe M, et al.Prognostic indicators of early and late death in children admitted to district hospital in Kenya: cohort study. BMJ. 2003; 326(7385):361.
 16
StataCorp. Stata Statistical Software: Release 15. College Station: StataCorp LLC; 2017.
 17
R Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical; 2018. https://www.Rproject.org/.
Acknowledgements
The authors of this paper thank Dr An Sokkab and Dr Thai Sopheak, and the team of clinicians, laboratory technicians and data managers of Sihanouk Hospital Center of Hope in Phnom Penh Cambodia for their contribution to the collection of the data which were used for this study.
Funding
This work was supported by a research grant from the Flemish Government  Department of Economy, Science & Innovation.
Author information
Affiliations
Contributions
JB, ADW, JVG and LL contributed to the conceptualization of this study. JB performed the analysis and wrote the manuscript. ADW, JVG and LL critically revised the manuscript and the final content. JB, ADW, JVG and LL approved the final manuscript.
Corresponding author
Correspondence to Jozefien Buyze.
Ethics declarations
Ethics approval and consent to participate
The epidemiological study from which the data were used, was approved by the Institutional Review Board of ITM Antwerp, the Ethics Committee of the Antwerp University Hospital (Belgium) and the Cambodian National Ethics Committee for Health Research.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Buyze, J., Weggheleire, A.D., van Griensven, J. et al. Comparison of predictive models for hepatitis C coinfection among HIV patients in Cambodia. BMC Infect Dis 20, 209 (2020). https://doi.org/10.1186/s128790204909z
Received:
Accepted:
Published:
Keywords
 HCVHIV coinfection
 SpiegelhalterKnillJones
 Logistic regression
 CART