Inter-rater agreement in the assessment of abnormal chest X-ray findings for tuberculosis between two Asian countries

Background Inter-rater agreement in the interpretation of chest X-ray (CXR) films is crucial for clinical and epidemiological studies of tuberculosis. We compared the readings of CXR films used for a survey of tuberculosis between raters from two Asian countries. Methods Of the 11,624 people enrolled in a prevalence survey in Hanoi, Viet Nam, in 2003, we studied 258 individuals whose CXR films did not exclude the possibility of active tuberculosis. Follow-up films obtained from accessible individuals in 2006 were also analyzed. Two Japanese and two Vietnamese raters read the CXR films based on a coding system proposed by Den Boon et al. and another system newly developed in this study. Inter-rater agreement was evaluated by kappa statistics. Marginal homogeneity was evaluated by the generalized estimating equation (GEE). Results CXR findings suspected of tuberculosis differed between the four raters. The frequencies of infiltrates and fibrosis/scarring detected on the films significantly differed between the raters from the two countries (P < 0.0001 and P = 0.0082, respectively, by GEE). The definition of findings such as primary cavity, used in the coding systems also affected the degree of agreement. Conclusions CXR findings were inconsistent between the raters with different backgrounds. High inter-rater agreement is a component necessary for an optimal CXR coding system, particularly in international studies. An analysis of reading results and a thorough discussion to achieve a consensus would be necessary to achieve further consistency and high quality of reading.


Background
Despite its several disadvantages, chest radiography remains an important supporting tool in tuberculosis (TB) surveys and clinical management of active disease [1][2][3]. Chest X-ray (CXR) findings should be carefully assessed because of its potential problems such as low specificity and insufficient reproducibility [4].
In this context, reading methods that are less influenced by raters are required and several CXR coding systems have been proposed [5][6][7]. In general, complex interpretation codes hamper intra-and inter-rater agreement and simple codes are preferred [6,7], because reproducible and validated coding system may be useful in monitoring disease in clinical and epidemiological studies [8,9].
Previous studies suggest that variability in CXR interpretation among raters is attributed to subjective reading accompanied by insufficient experience or different professional background of the raters [7,[10][11][12]. However, the relationship between agreement levels and relevant factors that may cause disagreement, particularly influence of medical background including different national origins has not been characterized.
In the present study, Vietnamese and Japanese raters studied the readings of suspected TB lesions on CXR films taken during a survey of TB prevalence in Hanoi, Viet Nam [3]. The follow-up films were also compared with the initial films. As analytical tools, two different types of coding systems were used: One was previously reported by another group [5] and the other was newly developed in this study. The aim of the study was to highlight inter-rater agreement between raters with different medical backgrounds. We also attempted to characterize the optimal codes or coding systems used in international studies for a simple and objective evaluation of CXR findings suspected of TB.

Ethics approval
This study was approved by the ethics committees of the Ministry of Health, Viet Nam and the National Center for Global Health and Medicine (formerly, International Medical Center of Japan). Written informed consent was obtained from each participant prior to the investigations, including the prevalence survey and the follow-up study.

Study population
A population-based TB prevalence survey of 11,624 people aged 15 and over was conducted in Hanoi in 2003 as reported previously [3]. Briefly, subjects suspected of having active TB based on CXR or on symptoms underwent sputum smear microscopy and/or mycobacterial culture. Details of HIV status were not obtained from the study subjects. According to the report of World Health Organization during this period, estimated prevalence of HIV co-infection in new TB patients aged 15-49 was relatively low (2.8%) in Viet Nam [13].
Barring 317 individuals, active TB was radiographically excluded for the rest. Of these 317 individuals, 22 (6.9%) were diagnosed by bacteriological methods, including sputum culture [3]. In 2004, individuals who presented with radiographic findings during the initial survey were advised to undergo sputum smear and culture tests following the World Health Organization recommendation [14,15]. In the 2006 follow-up, in which the same group of individuals was recalled for plain chest radiographic examination (AGFA X-ray film, Beijing, China; Shimadzu UD 150L-30V, Kyoto, Japan) and sputum test, including direct smear and culture. Using a questionnaire, we collected information regarding individual history, additional examinations performed, and treatment for TB undergone after the initial survey. Demographic information (including addresses) collected during the prevalence survey was used to trace the target group in the follow-up period.
The CXR films analyzed in this study were those in which active TB had not been radiographically excluded during the prevalence survey and were those taken during the follow-up in 2006. In total, 258 of the 317 films in the prevalence survey and 93 follow-up films were available at the time of analysis in this study. The rest of TB-suspected films in the prevalence survey were missing.

CXR coding systems and reading of films
Two coding systems were used to classify the CXR findings. The chest radiograph reading and recording system (CRRS) was developed in 2005 to detect TB and other forms of lung disease [5]. Profusion score and details of abnormalities unrelated to TB were omitted. All the other coding items of this system were retained. A Japan-Vietnam CXR coding system (JVCS) (Figure 1) consisting of rather simple codes was also used: We newly developed this system, considering a registration form used in a public payment system for TB treatment expenses in Japan and reading practice in Viet Nam. CRRS classifies parenchymal abnormalities as primary or secondary lesions depending on the significance of the lesion. In contrast, JVCS does not consider the significance of the lesion, though it records pleural effusion and thickening separately. Additionally, CRRS classifies nodules based on their size and calcification, whereas JVCS separately records nodules and calcification.
Two Japanese pulmonary physicians (E.T. and N.K.) and two Vietnamese radiologists (L.D.H. and P.T.C.) read the CXR films. These readers were different from those who read the CXR films during the initial survey. All CXR films were first read using CRRS. After the completion of readings by CRRS, CXR films were read using JVCS without the results of CRRS being made known to the readers. Each reader was also blinded to the others' readings and clinical information. Instruction and training regarding the two coding systems were given prior to the actual reading. The four raters were asked to reach a consensus while assessing 10 standard films from Japan and another 10 films from Viet Nam.

Statistical analysis
We adopted a double entry system of data entry. JMP version 7.0.1 (SAS Institute Inc., Cary, NC, USA) and SAS version 9.1 (SAS Institute Inc.) were used for analysis. Kappa statistics were used to investigate inter-rater agreement on the presence or absence of lesions of interest. We adopted the following guidelines for interpretation of kappa coefficients: < 0, poor agreement; 0-0.20, slight; 0.21-0.40, fair; 0.41-0.60, moderate; 0.61-0.80, good; and 0.81-1.00, very good [16][17][18]. Weighted kappa was used to assess inter-rater agreement on variables with more than two categories. McNemar's test or its extension, Bowker's test of symmetry, was used to investigate the symmetry of disagreement between two raters, which tests whether the frequency of an abnormality detected by one rater is significantly different from that by another rater. The generalized estimation equation (GEE) was also used to test the similarities in frequencies of positive findings between groups of raters (marginal homogeneity). No symmetry or non-marginal homogeneity was considered to be significant when P < 0.05.  radiographic findings of suspected TB underwent a sputum smear test, one of whom tested positive. The initial CXR film of this case showed infiltrates, fibrosis/scarring, and calcification. The follow-up radiograph in 2006 showed improvement after treatment.

Follow-up after TB prevalence survey
In the follow-up in 2006, 93 individuals were checked, one of whom was diagnosed by smear and culture as TB positive (Figure 2). Besides calcification, which was seen in the initial CXR film, infiltrates were present in the follow-up film. All raters evaluated this case as "worse" based on the radiographic findings.
In total, five individuals were reported to have active TB during the 3-year follow-up period. Two were diagnosed bacteriologically and three were diagnosed based on self-reported TB episodes. All the films were randomly mixed in the study set.

Inter-rater agreement on CXR findings
Using the two coding systems, four raters assessed the 258 films taken during the 2003 prevalence survey; two raters assessed the 93 films taken in the 2006 follow-up. A total of 2,436 readings were conducted ( Figure 2). Agreement levels regarding overall parenchymal abnormalities assessed by CRRS varied. Their kappa values were interpreted as fair to good, ranging from 0.24 to 0.63, from the following six comparisons: a comparison between the two Japanese raters (JP-JP); four comparisons between Japanese and Vietnamese raters (JP-VN (1) to (4)); and a comparison between the two Vietnamese raters (VN-VN) ( Table 1). Agreement levels regarding calcification also varied. They were considered as fair to good with JVCS and slight to fair with CRRS. Kappa values for pleural effusion with JVCS were interpreted as moderate to good, ranging from 0.54 to 0.77, indicating high level of agreement irrespective of country or rater.
Major parenchymal findings, cavity, fibrosis/scarring, infiltrates, and nodules were assessed in a similar way, as shown in Table 2. Agreement levels regarding primary and secondary cavities in CRRS were rather low (kappa values ranged from -0.02 to 0.36) except for relatively high agreement levels regarding a primary cavity between the Japanese raters (kappa = 0.60), and a secondary cavity between the Vietnamese raters (kappa = 0.43). Cavitation was, thus, mainly classified as a primary lesion by the Japanese raters and as a secondary lesion by the Vietnamese raters.
Although agreement levels relating to fibrosis/scarring were also low, kappa values for secondary fibrosis/scarring with CRRS revealed fair levels of agreement between raters from the same country (kappa = 0.28 [JP-JP] and 0.22 [VN-VN]), but revealed only slight agreement between raters from different countries (kappa = 0.11 to 0.20 [JP-VN]). Among all Japanese-Vietnamese pairs, the Vietnamese raters specified secondary fibrosis/scars more frequently than the Japanese raters (P = 0.0001 or P < 0.0001 by McNemar test). The frequency of positive findings of secondary fibrosis with CRRS by both Vietnamese raters was 26/255 (10.2%), whereas that by both Japanese raters was only 7/245 (2.9%) (Table not shown). The frequency of positive findings of fibrosis/scarring with JVCS by both Vietnamese raters (56/255 = 22.0%) also tended to be higher than that by both Japanese raters (42/245 = 17.1%). GEE further confirmed the significant difference in frequencies of fibrosis/scarring between raters from different countries (P = 0.0082).
Agreement levels regarding infiltrates between the two raters from the same country were considered as moderate    infiltrate readings between the raters from the two countries were also confirmed by using GEE (P < 0.0001).

2,436 readings in total
The levels of inter-rater agreement were considered slight to fair for nodules, irrespective of the raters' home country or the coding system used.  [4] comparisons between Japanese-Vietnamese raters, VN-VN a comparison between the Vietnamese raters, (-/-+/+-/+ +) (negative findings by both raters/positive findings only by the second rater/positive findings only by the first rater/positive findings by both raters) Table 2 Inter-rater agreement with respect to parenchymal findings for each coding system (n = 258) * Primary and secondary lesions are described in CRRS JVCS Japan-Vietnam chest X-ray coding system, CRRS chest radiograph reading and recording system, TB tuberculosis, NA not applicable, JP-JP a comparison between the Japanese raters, JP-VN [1] to [4] comparisons between Japanese-Vietnamese raters, VN-VN a comparison between the Vietnamese raters An overall assessment of CXR changes after 3 years was conducted by one of the two raters from each country. Agreement was moderate for both coding systems (weighted kappa = 0.47 and 0.40). The Japanese rater indicated deterioration more frequently than the Vietnamese rater (Table 3); this difference was considered highly significant for both JVCS and CRRS by the symmetry test (P = 0.0002 and 0.0008, respectively, by Bowker's test). When assessing changes in specific findings after 3 years, the Japanese rater detected infiltrates more frequently than the Vietnamese rater (P < 0.0001; Figure  3). Among 55 cases of infiltrates, 12 (22%) were assessed as "further spread" by the Japanese rater while 2 (8%) out of 25 cases of infiltrates were assessed as "further spread" by the Vietnamese rater (data not shown).

Discussion
Our study confirmed that the readings of CXR findings of suspected TB vary significantly among the raters. Differences in the backgrounds of the raters and different coding systems were considered potential factors affecting the levels of agreement. We found the following two patterns of marked tendency toward inconsistency in the CXR findings: 1) disagreement presumably attributed to the raters' home country and typically observed for infiltrates and secondary fibrosis/scarring and 2) disagreement observed for nodules, irrespective of the rater background. Through discussions conducted with the four raters after the trial, we identified some possible causes of this disagreement, though pre-existing problems were not disclosed when the standard films were checked prior to commencement of the study.
First, it is likely that this disagreement was partly caused by differences between countries regarding the definition of pulmonary lesions. For example, the Vietnamese raters limited the definition of infiltrates to relatively homogenous opacities greater than 10 mm in size, whereas the Japanese raters also included groups of smaller-sized scattered lesions with unclear margins in this classification. As a result, positive findings of infiltrates were more frequently reported by the Japanese raters.
Second, spontaneously cured mild TB resulting in parenchymal fibrosis or scarring, which is commonly seen in countries with high prevalence of TB, is a probable reason for the more frequent detection of these lesions by the Vietnamese raters. In addition, CT scans are compared with plain CXRs more commonly in Japan than in Viet Nam. This practice in TB diagnosis and management might affect the interpretations of the Japanese raters.
Disagreement between the raters from the two Asian countries could be attributed to many background factors, including the medical educational systems and onthe-job training imparted after graduation. In Japan, plain CXR films are read predominantly by clinicians, while in Viet Nam, radiologists also perform this role. Such differences are likely to affect the reading and should be taken into consideration in international studies. Even within a single country, inter-rater agreement depends on the experience of the raters [7,10,12] and is relatively low between raters in different centers [10]. Infiltrates: Figure 3 Infiltrates on chest X-rays after a 3-year interval. Evaluation of infiltrates on chest X-ray films between 2003 and 2006 using JVCS codes shows changes in the number of affected areas (upper, middle, and lower zones of each side of the lung). "No affected zone" indicates that the rater did not indicate the presence of infiltrates in either film. The Japanese rater detected infiltrates more commonly than did the Vietnamese rater (P < 0.0001), which corresponds with the results of a greater proportion of "increased" and a smaller proportion of "no detected zone" readings after 3 years. JVCS = Japan-Vietnam chest X-ray coding system The tested coding systems had both advantages and disadvantages in the context of our study. With CRRS, parenchymal abnormalities are classified into primary and secondary lesions, and it is not easy for raters to differentiate between the two. The Japanese raters emphasized on cavitation and presence of infiltrates as primary lesions of active TB, but the Vietnamese raters objectively judged the primary lesions on the basis of the size of lesions and proportion of the lung involved.
Although fairly reproducible, a disadvantage of JVCS is that it cannot provide any information regarding the significance of active lesions. Thus, CRRS is more informative. Activity, however, is a subjective term and the reproducibility of this description apparently worsens when included in a coding system. This implies the limitations of the plain CXR as a classic imaging tool. It may be assumed that defining necessary medical terms carefully through training and in-depth discussion prior to actual reading would minimize misunderstandings, even with a detailed coding system. However, this was not effective in our study, possibly because of language barriers, different medical backgrounds, and insufficient recognition of the problems. Collectively, our results support the concept of reproducibility of a simplified coding system [6,7,19], which may be critical when a system is shared by raters from different countries, such as even Asian countries.
On comparing CXR findings 3 years after the prevalence survey, Japanese raters detected deterioration in more cases than Vietnamese raters. The fact that the Japanese raters more frequently detected infiltrates may partly explain this discrepancy, because infiltrates generally signify active lesions, though unknown factors may also have affected their readings. This should be considered when CXRs are used for follow-up because the radiological appearance of lesions will not provide sufficient information for monitoring TB unless patient history and bacteriological examination are combined [8,10,19].
Our study has several limitations. First, caution should be exercised when extrapolating the results to describe the way CXRs are generally read in the two Asian countries. Although different medical backgrounds in the countries were obvious after reviewing and discussing the results, the raters' qualifications should also be considered. Second, in the present study, the overall sensitivity and specificity of CXR-based diagnosis of tuberculosis were not determined because the number of active TB cases detected in our cohort study was rather small (< 10%) and because these parameters would be influenced more by individual raters' skills and experiences than by the coding system used. Third, the coverage rate of the radiographic follow-up study after 3 years was not high, one of the reasons being the rapid speed of urbanization and an increasingly mobile population in Hanoi, which caused difficulties when tracing particular individuals. Nevertheless, our findings present an important point to be considered in international studies of TB using a CXR coding system.

Conclusions
In our study, CXR findings of suspected TB were inconsistent between raters with different backgrounds, presumably because of differences in medical practice and education between the two countries. Although each coding system has its advantages and disadvantages, a simplified classification system is suitable for maintaining sufficient agreement between raters from different countries. To improve the quality of future international collaborative studies, harmony could be obtained between raters of different nationalities by thorough discussion regarding the possible causes of disagreement in CXR readings, using standard films and descriptions of major findings.