Inter-rater agreement in the assessment of abnormal chest X-ray findings for tuberculosis between two Asian countries
- Shinsaku Sakurada†1,
- Nguyen TL Hang†2,
- Naoki Ishizuka1,
- Emiko Toyota3,
- Le D Hung4,
- Pham T Chuc4,
- Luu T Lien4,
- Pham H Thuong4,
- Pham TN Bich2,
- Naoto Keicho1Email author and
- Nobuyuki Kobayashi1
© Sakurada et al; BioMed Central Ltd. 2012
Received: 6 May 2011
Accepted: 1 February 2012
Published: 1 February 2012
Inter-rater agreement in the interpretation of chest X-ray (CXR) films is crucial for clinical and epidemiological studies of tuberculosis. We compared the readings of CXR films used for a survey of tuberculosis between raters from two Asian countries.
Of the 11,624 people enrolled in a prevalence survey in Hanoi, Viet Nam, in 2003, we studied 258 individuals whose CXR films did not exclude the possibility of active tuberculosis. Follow-up films obtained from accessible individuals in 2006 were also analyzed. Two Japanese and two Vietnamese raters read the CXR films based on a coding system proposed by Den Boon et al. and another system newly developed in this study. Inter-rater agreement was evaluated by kappa statistics. Marginal homogeneity was evaluated by the generalized estimating equation (GEE).
CXR findings suspected of tuberculosis differed between the four raters. The frequencies of infiltrates and fibrosis/scarring detected on the films significantly differed between the raters from the two countries (P < 0.0001 and P = 0.0082, respectively, by GEE). The definition of findings such as primary cavity, used in the coding systems also affected the degree of agreement.
CXR findings were inconsistent between the raters with different backgrounds. High inter-rater agreement is a component necessary for an optimal CXR coding system, particularly in international studies. An analysis of reading results and a thorough discussion to achieve a consensus would be necessary to achieve further consistency and high quality of reading.
Despite its several disadvantages, chest radiography remains an important supporting tool in tuberculosis (TB) surveys and clinical management of active disease [1–3]. Chest X-ray (CXR) findings should be carefully assessed because of its potential problems such as low specificity and insufficient reproducibility .
In this context, reading methods that are less influenced by raters are required and several CXR coding systems have been proposed [5–7]. In general, complex interpretation codes hamper intra- and inter-rater agreement and simple codes are preferred [6, 7], because reproducible and validated coding system may be useful in monitoring disease in clinical and epidemiological studies [8, 9].
Previous studies suggest that variability in CXR interpretation among raters is attributed to subjective reading accompanied by insufficient experience or different professional background of the raters [7, 10–12]. However, the relationship between agreement levels and relevant factors that may cause disagreement, particularly influence of medical background including different national origins has not been characterized.
In the present study, Vietnamese and Japanese raters studied the readings of suspected TB lesions on CXR films taken during a survey of TB prevalence in Hanoi, Viet Nam . The follow-up films were also compared with the initial films. As analytical tools, two different types of coding systems were used: One was previously reported by another group  and the other was newly developed in this study. The aim of the study was to highlight inter-rater agreement between raters with different medical backgrounds. We also attempted to characterize the optimal codes or coding systems used in international studies for a simple and objective evaluation of CXR findings suspected of TB.
This study was approved by the ethics committees of the Ministry of Health, Viet Nam and the National Center for Global Health and Medicine (formerly, International Medical Center of Japan). Written informed consent was obtained from each participant prior to the investigations, including the prevalence survey and the follow-up study.
A population-based TB prevalence survey of 11,624 people aged 15 and over was conducted in Hanoi in 2003 as reported previously . Briefly, subjects suspected of having active TB based on CXR or on symptoms underwent sputum smear microscopy and/or mycobacterial culture. Details of HIV status were not obtained from the study subjects. According to the report of World Health Organization during this period, estimated prevalence of HIV co-infection in new TB patients aged 15-49 was relatively low (2.8%) in Viet Nam .
Barring 317 individuals, active TB was radiographically excluded for the rest. Of these 317 individuals, 22 (6.9%) were diagnosed by bacteriological methods, including sputum culture . In 2004, individuals who presented with radiographic findings during the initial survey were advised to undergo sputum smear and culture tests following the World Health Organization recommendation [14, 15]. In the 2006 follow-up, in which the same group of individuals was recalled for plain chest radiographic examination (AGFA X-ray film, Beijing, China; Shimadzu UD 150L-30V, Kyoto, Japan) and sputum test, including direct smear and culture. Using a questionnaire, we collected information regarding individual history, additional examinations performed, and treatment for TB undergone after the initial survey. Demographic information (including addresses) collected during the prevalence survey was used to trace the target group in the follow-up period.
The CXR films analyzed in this study were those in which active TB had not been radiographically excluded during the prevalence survey and were those taken during the follow-up in 2006. In total, 258 of the 317 films in the prevalence survey and 93 follow-up films were available at the time of analysis in this study. The rest of TB-suspected films in the prevalence survey were missing.
CXR coding systems and reading of films
Two Japanese pulmonary physicians (E.T. and N.K.) and two Vietnamese radiologists (L.D.H. and P.T.C.) read the CXR films. These readers were different from those who read the CXR films during the initial survey. All CXR films were first read using CRRS. After the completion of readings by CRRS, CXR films were read using JVCS without the results of CRRS being made known to the readers. Each reader was also blinded to the others' readings and clinical information. Instruction and training regarding the two coding systems were given prior to the actual reading. The four raters were asked to reach a consensus while assessing 10 standard films from Japan and another 10 films from Viet Nam.
We adopted a double entry system of data entry. JMP version 7.0.1 (SAS Institute Inc., Cary, NC, USA) and SAS version 9.1 (SAS Institute Inc.) were used for analysis. Kappa statistics were used to investigate inter-rater agreement on the presence or absence of lesions of interest. We adopted the following guidelines for interpretation of kappa coefficients: < 0, poor agreement; 0-0.20, slight; 0.21-0.40, fair; 0.41-0.60, moderate; 0.61-0.80, good; and 0.81-1.00, very good [16–18]. Weighted kappa was used to assess inter-rater agreement on variables with more than two categories. McNemar's test or its extension, Bowker's test of symmetry, was used to investigate the symmetry of disagreement between two raters, which tests whether the frequency of an abnormality detected by one rater is significantly different from that by another rater. The generalized estimation equation (GEE) was also used to test the similarities in frequencies of positive findings between groups of raters (marginal homogeneity). No symmetry or non-marginal homogeneity was considered to be significant when P < 0.05.
Follow-up after TB prevalence survey
In 2004, one year after the prevalence survey, 204 (64.4%) of the 317 individuals who presented with radiographic findings of suspected TB underwent a sputum smear test, one of whom tested positive. The initial CXR film of this case showed infiltrates, fibrosis/scarring, and calcification. The follow-up radiograph in 2006 showed improvement after treatment.
In total, five individuals were reported to have active TB during the 3-year follow-up period. Two were diagnosed bacteriologically and three were diagnosed based on self-reported TB episodes. All the films were randomly mixed in the study set.
Inter-rater agreement on CXR findings
Using the two coding systems, four raters assessed the 258 films taken during the 2003 prevalence survey; two raters assessed the 93 films taken in the 2006 follow-up. A total of 2,436 readings were conducted (Figure 2).
Inter-rater agreement with respect to general and parenchymal findings for each coding system (n = 258)
Kappa with 95% confidence interval and the absolute number of films (--/-+/+-/++)
of tested films
Inter-rater agreement with respect to parenchymal findings for each coding system (n = 258)
Kappa with 95% confidence interval
of tested films
CRRS primary *
Although agreement levels relating to fibrosis/scarring were also low, kappa values for secondary fibrosis/scarring with CRRS revealed fair levels of agreement between raters from the same country (kappa = 0.28 [JP-JP] and 0.22 [VN-VN]), but revealed only slight agreement between raters from different countries (kappa = 0.11 to 0.20 [JP-VN]). Among all Japanese-Vietnamese pairs, the Vietnamese raters specified secondary fibrosis/scars more frequently than the Japanese raters (P = 0.0001 or P < 0.0001 by McNemar test). The frequency of positive findings of secondary fibrosis with CRRS by both Vietnamese raters was 26/255 (10.2%), whereas that by both Japanese raters was only 7/245 (2.9%) (Table not shown). The frequency of positive findings of fibrosis/scarring with JVCS by both Vietnamese raters (56/255 = 22.0%) also tended to be higher than that by both Japanese raters (42/245 = 17.1%). GEE further confirmed the significant difference in frequencies of fibrosis/scarring between raters from different countries (P = 0.0082).
Agreement levels regarding infiltrates between the two raters from the same country were considered as moderate (kappa = 0.49 [JP-JP] and 0.57 [VN-VN]) and as fair between two raters from different countries (kappa = 0.21 to 0.30 [JP-VN]) according to JVCS (Table 2). The Japanese raters detected infiltrates more frequently than the Vietnamese raters (P < 0.0001 by McNemar test) in all comparisons. The frequency of positive findings of primary infiltrates with CRRS by both Japanese raters was 68/245 (27.8%), whereas that by both Vietnamese raters was only 22/255 (8.6%) (Table not shown). The frequency of positive findings of infiltrates with JVCS by both Japanese raters (119/245 = 48.6%) also tended to be higher than that by both Vietnamese raters (46/255 = 18.0%). The different frequencies of positive infiltrate readings between the raters from the two countries were also confirmed by using GEE (P < 0.0001).
The levels of inter-rater agreement were considered slight to fair for nodules, irrespective of the raters' home country or the coding system used.
Overall assessment of radiographic findings after 3 years
Weighted kappa = 0.40 [0.22-0.57]
Weighted kappa = 0.47 [0.31-0.63]
Our study confirmed that the readings of CXR findings of suspected TB vary significantly among the raters. Differences in the backgrounds of the raters and different coding systems were considered potential factors affecting the levels of agreement. We found the following two patterns of marked tendency toward inconsistency in the CXR findings: 1) disagreement presumably attributed to the raters' home country and typically observed for infiltrates and secondary fibrosis/scarring and 2) disagreement observed for nodules, irrespective of the rater background. Through discussions conducted with the four raters after the trial, we identified some possible causes of this disagreement, though pre-existing problems were not disclosed when the standard films were checked prior to commencement of the study.
First, it is likely that this disagreement was partly caused by differences between countries regarding the definition of pulmonary lesions. For example, the Vietnamese raters limited the definition of infiltrates to relatively homogenous opacities greater than 10 mm in size, whereas the Japanese raters also included groups of smaller-sized scattered lesions with unclear margins in this classification. As a result, positive findings of infiltrates were more frequently reported by the Japanese raters.
Second, spontaneously cured mild TB resulting in parenchymal fibrosis or scarring, which is commonly seen in countries with high prevalence of TB, is a probable reason for the more frequent detection of these lesions by the Vietnamese raters. In addition, CT scans are compared with plain CXRs more commonly in Japan than in Viet Nam. This practice in TB diagnosis and management might affect the interpretations of the Japanese raters.
Disagreement between the raters from the two Asian countries could be attributed to many background factors, including the medical educational systems and on-the-job training imparted after graduation. In Japan, plain CXR films are read predominantly by clinicians, while in Viet Nam, radiologists also perform this role. Such differences are likely to affect the reading and should be taken into consideration in international studies. Even within a single country, inter-rater agreement depends on the experience of the raters [7, 10, 12] and is relatively low between raters in different centers .
The tested coding systems had both advantages and disadvantages in the context of our study. With CRRS, parenchymal abnormalities are classified into primary and secondary lesions, and it is not easy for raters to differentiate between the two. The Japanese raters emphasized on cavitation and presence of infiltrates as primary lesions of active TB, but the Vietnamese raters objectively judged the primary lesions on the basis of the size of lesions and proportion of the lung involved.
Although fairly reproducible, a disadvantage of JVCS is that it cannot provide any information regarding the significance of active lesions. Thus, CRRS is more informative. Activity, however, is a subjective term and the reproducibility of this description apparently worsens when included in a coding system. This implies the limitations of the plain CXR as a classic imaging tool. It may be assumed that defining necessary medical terms carefully through training and in-depth discussion prior to actual reading would minimize misunderstandings, even with a detailed coding system. However, this was not effective in our study, possibly because of language barriers, different medical backgrounds, and insufficient recognition of the problems. Collectively, our results support the concept of reproducibility of a simplified coding system [6, 7, 19], which may be critical when a system is shared by raters from different countries, such as even Asian countries.
On comparing CXR findings 3 years after the prevalence survey, Japanese raters detected deterioration in more cases than Vietnamese raters. The fact that the Japanese raters more frequently detected infiltrates may partly explain this discrepancy, because infiltrates generally signify active lesions, though unknown factors may also have affected their readings. This should be considered when CXRs are used for follow-up because the radiological appearance of lesions will not provide sufficient information for monitoring TB unless patient history and bacteriological examination are combined [8, 10, 19].
Our study has several limitations. First, caution should be exercised when extrapolating the results to describe the way CXRs are generally read in the two Asian countries. Although different medical backgrounds in the countries were obvious after reviewing and discussing the results, the raters' qualifications should also be considered. Second, in the present study, the overall sensitivity and specificity of CXR-based diagnosis of tuberculosis were not determined because the number of active TB cases detected in our cohort study was rather small (< 10%) and because these parameters would be influenced more by individual raters' skills and experiences than by the coding system used. Third, the coverage rate of the radiographic follow-up study after 3 years was not high, one of the reasons being the rapid speed of urbanization and an increasingly mobile population in Hanoi, which caused difficulties when tracing particular individuals. Nevertheless, our findings present an important point to be considered in international studies of TB using a CXR coding system.
In our study, CXR findings of suspected TB were inconsistent between raters with different backgrounds, presumably because of differences in medical practice and education between the two countries. Although each coding system has its advantages and disadvantages, a simplified classification system is suitable for maintaining sufficient agreement between raters from different countries. To improve the quality of future international collaborative studies, harmony could be obtained between raters of different nationalities by thorough discussion regarding the possible causes of disagreement in CXR readings, using standard films and descriptions of major findings.
The authors would like to thank Dr. Vu Cao Cuong, Dr. Nguyen Phuong Hoang, Dr. Pham Tuan Phuong, Dr. Pham Thu Anh (Hanoi Lung Hospital), Dr. Phan Thi Minh Ngoc (NCGM-BMH Medical Collaboration Center), Dr. Takahiro Terakawa, and the staff of the district TB centers in Hanoi for supporting site implementation. The authors also thank Dr. Takuro Shimbo for his technical advice. This study was supported by grants from the Program of Japan Initiative for Global Research Network on Infectious Diseases (J-GRID), MEXT, Japan.
- Hopewell PC, Pai M, Maher D, Uplekar M, Raviglione MC: International standards for tuberculosis care. Lancet Infect Dis. 2006, 6: 710-725. 10.1016/S1473-3099(06)70628-4.View ArticlePubMedGoogle Scholar
- Golub JE, Mohan CI, Comstock GW, Chaisson RE: Active case finding of tuberculosis: historical perspective and future prospects. Int J Tuberc Lung Dis. 2005, 9: 1183-1203.PubMedPubMed CentralGoogle Scholar
- Horie T, Lien LT, Tuan LA, et al: A survey of tuberculosis prevalence in Hanoi, Vietnam. Int J Tuberc Lung Dis. 2007, 11: 562-566.PubMedGoogle Scholar
- Koppaka R, Bock N: How reliable is chest radiography?. Toman's Tuberculosis Case detection, treatment, and monitoring--questions and answers. 2004, Geneva: World Health Organization, 51-60. 2Google Scholar
- Den Boon S, Bateman ED, Enarson DA, et al: Development and evaluation of a new chest radiograph reading and recording system for epidemiological surveys of tuberculosis and lung disease. Int J Tuberc Lung Dis. 2005, 9: 1088-1096.PubMedGoogle Scholar
- Graham S, Das GK, Hidvegi RJ, et al: Chest radiograph abnormalities associated with tuberculosis: reproducibility and yield of active cases. Int J Tuberc Lung Dis. 2002, 6: 137-142.PubMedGoogle Scholar
- Zellweger JP, Heinzer R, Touray M, Vidondo B, Altpeter E: Intra-observer and overall agreement in the radiological assessment of tuberculosis. Int J Tuberc Lung Dis. 2006, 10: 1123-1126.PubMedGoogle Scholar
- Linh NN, Marks GB, Crawford AB: Radiographic predictors of subsequent reactivation of tuberculosis. Int J Tuberc Lung Dis. 2007, 11: 136-1142.Google Scholar
- Ralph AP, Ardian M, Wiguna A, et al: A simple, valid, numerical score for grading chest x-ray severity in adult smear-positive pulmonary tuberculosis. Thorax. 2010, 65: 863-869. 10.1136/thx.2010.136242.View ArticlePubMedGoogle Scholar
- Balabanova Y, Coker R, Fedorin I, et al: Variability in interpretation of chest radiographs among Russian clinicians and implications for screening programmes: observational study. BMJ. 2005, 331: 379-382. 10.1136/bmj.331.7513.379.View ArticlePubMedPubMed CentralGoogle Scholar
- Brealey S, Westwood M: Are you reading what we are reading? The effect of who interprets medical images on estimates of diagnostic test accuracy in systematic reviews. Br J Radiol. 2007, 80: 674-677. 10.1259/bjr/83042364.View ArticlePubMedGoogle Scholar
- Abubakar I, Story A, Lipman M, et al: Diagnostic accuracy of digital chest radiography for pulmonary tuberculosis in a UK urban population. Eur Respir J. 2010, 35: 689-692. 10.1183/09031936.00136609.View ArticlePubMedGoogle Scholar
- Global tuberculosis control--surveillance, planning, financing. WHO Report 2005. WHO/HTM/TB/2005.349, [http://www.who.int/tb/publications/global_report/2005/en/index.html]
- Kantor IN, Kim SJ, Frieden T, et al: Laboratory Service in Tuberculosis Control Part II: Microscopy. WHO/TB/98.258. 1998Google Scholar
- Kantor IN, Kim SJ, Frieden T, et al: Laboratory Service in Tuberculosis Control Part III: Culture. 1998, WHO/TB/98.258Google Scholar
- Landis JR, Koch GG: The measurement of observer agreement for categorical data. Biometrics. 1977, 33: 159-174.Google Scholar
- Kundel HL, Polansky M: Measurement of observer agreement. Radiology. 2003, 228: 303-308. 10.1148/radiol.2282011860.View ArticlePubMedGoogle Scholar
- Taplin SH, Rutter CM, Elmore JG, Seger D, White D, Brenner RJ: Accuracy of screening mammography using single versus independent double interpretation. AJR Am J Roentgenol. 2000, 174: 1257-1262.View ArticlePubMedGoogle Scholar
- Van Cleeff MR, Kivihya-Ndugga LE, Meme H, Odhiambo JA, Klatser PR: The role and performance of chest X-ray for the diagnosis of tuberculosis: a cost-effectiveness analysis in Nairobi, Kenya. BMC Infect Dis. 2005, 5: 111-10.1186/1471-2334-5-111.View ArticlePubMedPubMed CentralGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2334/12/31/prepub