- Research article
- Open Access
- Open Peer Review
Risk of using logistic regression to illustrate exposure-response relationship of infectious diseases
© Ren et al.; licensee BioMed Central Ltd. 2014
- Received: 8 May 2014
- Accepted: 25 September 2014
- Published: 4 October 2014
In most biological experiments, especially infectious disease, the exposure-response relationship is interrelated by a multitude of factors rather than many independent factors. Little is known about the suitability of ordinary, categorical exposures, and logarithmic transformation which have been presented in logistic regression models to assess the likelihood of an infectious disease as a function of a risk or exposure. This study aims to examine and compare the current approaches.
A simulated human immunodeficiency virus (HIV) population, dynamic infection data for 100,000 individuals with 1% initial prevalence and 2% infectivity, was created. Using the Monte Carlo method (computational algorithm) to repeat random sampling to obtain numerical results, linearity between log odds and exposure, and suitability in practice were examined in the three model approaches.
Despite diverse population prevalence, the linearity was not satisfied between log odds and raw exposures. Logarithmic transformation of exposures improved the linearity to a certain extent, and categorical exposures satisfied the linear assumption (which was important for modelling). When the population prevalence was low (assumed < 10%), performances of the three models were significantly different. Comparing to ordinary logistic regression, the logarithmic transformation approach demonstrated better accuracy of estimation except that at the two inflection points: likelihood of infection increased from slowly to sharply, then slowly again. The approach using categorical exposures had better estimations around the real values, but the measurement was coarse due to categorization.
It is not suitable to directly use ordinary logistic regression to explore the exposure-response relationship of HIV as an infectious disease. This study provides some recommendations for practical implementations including: 1) utilize categorical exposure if a large sample size and low population prevalence are provided; 2) utilize a logarithmic transformed exposure if the sample size is insufficient or the population prevalence is too high (such as 30%).
- Logistic regression
- Measurement error
- Infectious disease
- Exposure-response relationship
- Computer simulation
Our motivating example arises from a study of the exposure-response relationship between the number of sex partners and prevalent HIV infection among men who have sex with men (MSM), leading to methodological challenges. Data were collected on 1,072 MSM from a retrospective epidemiological survey in Shanghai during the period between April 2008 and September 2009, including the binary response variable Y (current HIV status), a quantitative risk factor X (number of sex partners in the past 6 months), and other covariates Z (social-economic factors, pattern of sex partners and condom use).
For this reason, it might fail to meet the linearity assumption between the log odds [logitPr(Y = 1|X)] and the continuous covariate (X) if logistic regression is used to analyze this data . For this reason, two approaches were found in the previous studies: 1) group X into a categorical variable [4, 5], and 2) use of a logarithmic transformation of ln(X + 1). However, little is known about the accuracy and suitability of these approaches under these circumstances.
Logistic regression is widely used to assess the likelihood of an infectious disease as a function of a risk or exposure factor (and covariates), to illustrate the exposure-response relationship [4–8]. However, it should be noted that logistic regression assumes a linear relationship between independent variables and log odds . Whilst it does not require the dependent and independent variables to be related linearly, it requires that the independent variables are linearly related to the log odds.
Do the studies of infectious disease satisfy the linear assumption? For infectious diseases, the exposure of individual is not independent yet, and the exposure-response relationship is an intricate net instead of independent factors. The data of exposure is usually present with a skewed distribution and heterogeneity of variance [9–13]. In our motivating example, it was an approximate negative binomial distribution. Thus, it is critical to examine the linear assumption before we apply logistic regression to analyze the exposure-response relation.
But, it is still unclear which approach has a better calibration, and minimizes the errors between the predicted values and the real data. Therefore, this study aims to compare the suitability of these approaches which have been broadly used to study the exposure-response relationship of infection disease using simulated and real-world data.
Simulation of infected population
We used PROC IML in SAS 9.3 to create a dynamic HIV infection model among 100,000 enclosed MSM individuals under certain conditions (Additional file 1). The initial prevalence and the infectivity rate in the population were set as 1% and 2%, respectively [4, 14]. The infectivity rate here could be a little higher than that in the real-world MSM population because we wanted to save the runtime of SAS. During an incubation period, the number of persons (including infected and non-infected) whom each individual could contact was assumed to be a negative binomial distribution (p = 0.1, r = 1) because we assumed the chance of exposure was not equal for everyone (such as HIV infection). The model was stopped when it reached a targeted prevalence (10%, 20%, 30% and 40%). The immunization, treatment and intervention were not considered in this simulation, so the disease status would be permanent once a person became infected. The simulation process is described as below:
We set a closed population with 100,000 individuals, and 1% of them with HIV positive.We randomly set a number of sex partners (exposure) for each individual based on a negative binomial distribution (p = 0.1, r = 1), so that the distribution of sex partners liked that Figure 1 showed.
A dynamic HIV infection started. Each person randomly selected his own sex partners according to the number of sex partners which was set in the step 2. For example, if a person was set 0 sex partner, he was not allowed to find any sex partner. If a person was set 2 sex partners, he could select two sex partners who were available. It called one generation when all individuals had reached the preset number of sex partners. Then a new generation started. A person was considered HIV positive when the individual had ≥50 sex contacts with HIV positive persons.
The model stopped when it reached the preset targeted prevalence (10%, 20%, 30% or 40%).
The outcome was a binary dependent Y (infection or non-infection), and the continuous covariate X was defined as the number of exposures (contact other persons) during an incubation period. We didn’t define X as the number of exact exposures (only contact patients) because we might not know whether the contacted persons are infected in the real world. Thus, based on this simulated data, the population exposure-response relationship between Y and X could be easily drawn. Also, it would be possible to examine the linear assumption between the log odds and the exposure variable.
Logistic regression models with maximum likelihood estimation
We attempted three approaches of logistic regression to analyze the simulated data as below.
where X c was considered as a dummy/nominal variable.
Monte Carlo experiment for comparison of models
We repeated the simulation 3000 times to randomly select 10% of population. In order to estimate the population parameter (β), the three models (A, B and C) were all run for each sample, then each model had 3000 sample statistics (β). PROC LOGISTIC, PROC SQL and %MACRO in SAS 9.3 were used for this experiment (Additional file 2).
The means and standard deviations of sample statistics β s were recorded, and the predicted probability (likelihood of HIV infection given a certain number of sex partners) was scored by each model. A model would be considered better if it could satisfy the following criteria: 1) predicted values were closer to the true values, and 2) smaller standard deviation.
Real-world data for validation
This study also used real-world data to validate the findings of simulation. A total of 1,072 MSM were recruited in Shanghai through the snowball sampling method  during the period between April 2008 and September 2009.
The survey questionnaire was based on that used in the National Sentinel Surveillance Program since 1995, with modifications based on Chinese MSM community feedback. Local Centers for Disease Control and Prevention (CDC) staff who conducted the surveys by interview were given intensive training and a detailed protocol. Interview settings had at least one private interview/counseling room, a testing room, and a waiting room, and were usually located within a hospital or local CDC facility. Blood samples were collected from each subject, and tested in the laboratory of Shanghai CDC for HIV. Counseling was provided before and after testing. Participants who tested negative were noticed by local CDC staff, whereas those who tested positive were referred to the National AIDS Program and/or a local hospital or clinic.
The primary measure of this study was to examine the prevalence changes of HIV (outcome) along with an increase of the number of sex partners (exposure), controlling other confounders (such as age, marital status, race, highest education level achieved, monthly income, self-identified sexual orientation, condom use, commercial sex behavior and sexual activity with a female).
The real-world data in this study was the Shanghai component of the national cross-sectional survey of 61 cities in China . The national survey was reviewed and approved by the Institutional Review Board of the National Center for AIDS/STD Control and Prevention, China CDC. All participants provided informed consent that information from surveys and blood tests could be used for scientific studies when they were recruited, and all the data in the study were de-identified. So, this study is in compliance with the Helsinki Declaration.
Target exposure-response curves
Linear assumption of logistic regression
To categorize, the exposure was another way of addressing linearity. Linear assumption of logistic regression was satisfied when the exposure factor was measured as a categorical/dummy variable.
Comparison of logistic regression models
Means and standard deviations of 3000 sample statistics (HIV prevalence = 10%)
Coefficient of variation
> = 41 partners
C (Log transformation)
Validation of real-world data
Meanwhile, we found that the exposure-response curves in the real-world data were different from that in the simulated data. The former was an approximate exponential curve, and the latter was a generalized logistic curve.
This study focused on assessing the risk of using logistic regression to illustrate an exposure-response relationship of HIV as an infectious disease, which is different from previous simulation studies discussed diverse measurement errors in logistic regression [16–21]. Logistic regression requires a linearity between independent variables and log odds . However, this study found that the linear assumption usually could not be satisfied when an ordinary logistic regression was used to explore the exposure-response relationship of HIV as an infectious disease. Although it could improve the linearity in a certain extent, logarithmic transformation might not correct the linearity when the exposure is very little or huge. So, the performance of these logistic regression models would certainly be affected by the non-linear circumstance. In order to overcome the linear issue, categorical exposures (dummy variables) are used in logistic regression because dummy variables are only expressed by 0 and 1.In this study, we found that the non-linear circumstance (two inflection points) mainly affected the prediction of Model A (raw exposure) and Model C (logarithmic transformation) when the population prevalence was low. If the prevalence rate is high, individuals could be more likely to be infected even if they have a few exposures, so that the infectious disease could spread quickly. That is to say, the first inflection point in Figure 3 should be closer to zero, which would improve the linearity. Therefore, it is not surprise to see the three models are very similar in Figure 4c.
It was proved that Model B (categorical exposure) was not related to the linear assumption in this study. So, Model B could get appropriate estimates whatever the population rate would be. However, Model B is not a perfect solution either, because different studies could have different categorical rules and we might have to use a coarse category due to a small sample size. And it should be noted that an inappropriate categorization could significantly underestimate or overestimate the real odds ratio.
In this study, the real-world data about HIV infection among MSM supported our findings in the simulated data overall, but we also found that the exposure-response curves were obviously different between the two data. The reason could be related to more risk factors and confounders in the real-world data. We could not get the same benchmark between the two data, although the logistic regression models adjusted some confounders (such as demographics and sexual behaviors). The simulated data was pure because the number of exposures was a risk factor only. But, there are many known and unknown factors which could affect HIV infection in the real world even if there were no sex partners in the past six months.
This study provides lots of valuable findings, nevertheless there are limitations to consider when the results are interpreted. Primarily, the simulated data couldn’t consider all circumstances (such as observation errors), so this study only simulated different population prevalence with fixed infectivity and other conditions. Secondly, this study still couldn’t provide an optimal solution about this linear issue, but some recommendations for practical implementations could be concluded: 1) utilize categorical exposure if a large sample size and low population prevalence are provided; 2) utilize a logarithmic transformed exposure if the sample size is insufficient or the population prevalence is too high (such as 30%).
The authors thank Shanghai Municipal Center for Disease Control and Prevention for providing the real-world data, also thank Ms. Marie McWhirter who contributed to proof the grammar and figures.
Declaration of funding
This paper was not supported by any funding.
- Negative binomial distribution. 2014, Accessed September 4, 2014, at http://en.wikipedia.org/wiki/Negative_binomial_distribution
- Binet FE: Fitting the negative binomial distribution. Biometrics. 1986, 42: 989-992. 10.2307/2530715.View ArticlePubMedGoogle Scholar
- Elswick RK, Schwartz PF, Welsh JA: Interpretation of the odds ratio from logistic regression after a transformation of the covariate vector. Stat Med. 1997, 16: 1695-1703. 10.1002/(SICI)1097-0258(19970815)16:15<1695::AID-SIM601>3.0.CO;2-V.View ArticlePubMedGoogle Scholar
- Wu Z, Xu J, Liu E, Mao Y, Xiao Y, Sun X, Liu Y, Jiang Y, McGongan JM, Don Z, Mi G, Wang N, Sun J, Liu Z, Wang L, Rou K, Pang L, Xing W, Wang S, Cui Y, Li Z, Bulterys M, Lin W, Zhao J, Yip R, Wu Y, Hao Y, Wang Y: HIV and syphilis prevalence among men who have sex with men: a cross-sectional survey of 61 cities in China. Clin Infect Dis. 2013, 57: 298-309. 10.1093/cid/cit210.View ArticlePubMedPubMed CentralGoogle Scholar
- Patavino GM, de Almeida-Neto C, Liu J, Wringt DJ, Mendrone-Junior A, Ferreira MI, Carneiro AB, Custer B, Ferreira JE, Busch MP, Sabino EC: Number of recent sexual partners among blood donors in Brazil: associations with donor demographics, donation characteristics, and infectious disease markers. Transfusion. 2012, 52: 151-159. 10.1111/j.1537-2995.2011.03248.x.View ArticlePubMedGoogle Scholar
- Rosenberg ES, Sullivan PS, Dinenno EA, Salazar LF, Sanchez TH: Number of casual male sexual partners and associated factors among men who have sex with men: results from the National HIV Behavioral Surveillance system. BMC Public Health. 2011, 11: 189-10.1186/1471-2458-11-189.View ArticlePubMedPubMed CentralGoogle Scholar
- Linkins RW, Chonwattana W, Holtz TH, Wasinrapee P, Chaikummao S, Varangrat A, Tongtoyai J, Mock PA, Curlin ME, Sirivongrangson P, Van Griensven F, McNicholl JM: Hepatitis A and hepatitis B infection prevalence and associated risk factors in men who have sex with men, Bangkok, 2006-2008. J Med Virol. 2013, 85: 1499-1505. 10.1002/jmv.23637.View ArticlePubMedGoogle Scholar
- Bini EJ, Perumalswami PV: Hepatitis B virus infection among American patients with chronic hepatitis C virus infection: prevalence, racial/ethnic differences, and viral interactions. Hepatology. 2010, 51: 759-766.PubMedGoogle Scholar
- Ypma RJ, Altes HK, van Soolingen D, Wallinga J, van Ballegooijen WM: A sign of superspreading in tuberculosis: highly skewed distribution of genotypic cluster sizes. Epidemiology. 2013, 24: 395-400. 10.1097/EDE.0b013e3182878e19.View ArticlePubMedGoogle Scholar
- Noremark M, Hakansson N, Lewerin SS, Lindberg A, Jonsson A: Network analysis of cattle and pig movements in Sweden: measures relevant for disease control and risk based surveillance. Prev Vet Med. 2011, 99: 78-90. 10.1016/j.prevetmed.2010.12.009.View ArticlePubMedGoogle Scholar
- Rothenberg R, Muth SQ: Large-network concepts and small-network characteristics: fixed and variable factors. Sex Transm Dis. 2007, 34: 604-612.PubMedGoogle Scholar
- Nishiura H: Early efforts in modeling the incubation period of infectious diseases with an acute course of illness. Emerg Themes Epidemiol. 2007, 4: 2-10.1186/1742-7622-4-2.View ArticlePubMedPubMed CentralGoogle Scholar
- Lloyd-Smith JO, Schreiber SJ, Kopp PE, Getz WM: Superspreading and the effect of individual variation on disease emergence. Nature. 2005, 438: 355-359. 10.1038/nature04153.View ArticlePubMedGoogle Scholar
- HIV Transmission Risk. Accessed April 16, 2014, at http://www.cdc.gov/hiv/policies/law/risk.html
- Sheu SJ, Wei IL, Chen CH, Yu S, Tang FI: Using snowball sampling method with nurses to understand medication administration errors. J Clin Nurs. 2009, 18: 559-569. 10.1111/j.1365-2702.2007.02048.x.View ArticlePubMedGoogle Scholar
- Thoresen M: A note on correlated errors in exposure and outcome in logistic regression. Am J Epidemiol. 2007, 166: 465-471. 10.1093/aje/kwm107.View ArticlePubMedGoogle Scholar
- Sugar EA, Wang CY, Prentice RL: Logistic regression with exposure biomarkers and flexible measurement error. Biometrics. 2007, 63: 143-151. 10.1111/j.1541-0420.2006.00632.x.View ArticlePubMedGoogle Scholar
- Gossl C, Kuchenhoff H: Bayesian analysis of logistic regression with an unknown change point and covariate measurement error. Stat Med. 2001, 20: 3109-3121. 10.1002/sim.928.View ArticlePubMedGoogle Scholar
- Thoresen M, Laake P: A simulation study of measurement error correction methods in logistic regression. Biometrics. 2000, 56: 868-872. 10.1111/j.0006-341X.2000.00868.x.View ArticlePubMedGoogle Scholar
- Schorgendorfer A, Branscum AJ, Hanson TE: A Bayesian goodness of fit test and semiparametric generalization of logistic regression with measurement data. Biometrics. 2013, 69: 508-519. 10.1111/biom.12007.View ArticlePubMedGoogle Scholar
- Horton NJ, Laird NM: Maximum likelihood analysis of logistic regression models with incomplete covariate data and auxiliary information. Biometrics. 2001, 57: 34-42. 10.1111/j.0006-341X.2001.00034.x.View ArticlePubMedGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2334/14/540/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.