Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy


      Use adjudication to quantify errors in diabetic retinopathy (DR) grading based on individual graders and majority decision, and to train an improved automated algorithm for DR grading.


      Retrospective analysis.


      Retinal fundus images from DR screening programs.


      Images were each graded by the algorithm, U.S. board-certified ophthalmologists, and retinal specialists. The adjudicated consensus of the retinal specialists served as the reference standard.

      Main Outcome Measures

      For agreement between different graders as well as between the graders and the algorithm, we measured the (quadratic-weighted) kappa score. To compare the performance of different forms of manual grading and the algorithm for various DR severity cutoffs (e.g., mild or worse DR, moderate or worse DR), we measured area under the curve (AUC), sensitivity, and specificity.


      Of the 193 discrepancies between adjudication by retinal specialists and majority decision of ophthalmologists, the most common were missing microaneurysm (MAs) (36%), artifacts (20%), and misclassified hemorrhages (16%). Relative to the reference standard, the kappa for individual retinal specialists, ophthalmologists, and algorithm ranged from 0.82 to 0.91, 0.80 to 0.84, and 0.84, respectively. For moderate or worse DR, the majority decision of ophthalmologists had a sensitivity of 0.838 and specificity of 0.981. The algorithm had a sensitivity of 0.971, specificity of 0.923, and AUC of 0.986. For mild or worse DR, the algorithm had a sensitivity of 0.970, specificity of 0.917, and AUC of 0.986. By using a small number of adjudicated consensus grades as a tuning dataset and higher-resolution images as input, the algorithm improved in AUC from 0.934 to 0.986 for moderate or worse DR.


      Adjudication reduces the errors in DR grading. A small set of adjudicated DR grades allows substantial improvements in algorithm performance. The resulting algorithm's performance was on par with that of individual U.S. Board-Certified ophthalmologists and retinal specialists.

      Abbreviations and Acronyms:

      AUC (area under the curve), DME (diabetic macular edema), DR (diabetic retinopathy), ICDR (International Clinical Diabetic Retinopathy), MA (microaneurysm)
      To read this article in full you will need to make a payment


      Subscribe to Ophthalmology
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect


        • Early Treatment Diabetic Retinopathy Study Research Group
        Grading diabetic retinopathy from stereoscopic color fundus photographs–an extension of the modified Airlie House classification. ETDRS report number 10.
        Ophthalmology. 1991; 98: 786-806
      1. Diabetic Retinopathy Screening Services in Scotland: A Training Handbook – July 2003: page 17.
        (Available at:) (Accessed June 21, 2017)
        • American Academy of Ophthalmology
        International clinical diabetic retinopathy disease severity scale, detailed table.
        (Available at:) (Accessed October 14, 2016)
        • Scott I.U.
        • Bressler N.M.
        • Bressler S.B.
        • et al.
        Agreement between clinician and reading center gradings of diabetic retinopathy severity level at baseline in a phase 2 study of intravitreal bevacizumab for diabetic macular edema.
        Retina. 2008; 28: 36-40
        • Li H.K.
        • Hubbard L.D.
        • Danis R.P.
        • et al.
        Digital versus film fundus photography for research grading of diabetic retinopathy severity.
        Invest Ophthalmol Vis Sci. 2010; 51: 5846-5852
        • Gangaputra S.
        • Lovato J.F.
        • Hubbard L.
        • et al.
        Comparison of standardized clinical classification with fundus photograph grading for the assessment of diabetic retinopathy and diabetic macular edema severity.
        Retina. 2013; 33: 1393-1399
        • Ruamviboonsuk P.
        • Teerasuwanajak K.
        • Tiensuwan M.
        • Yuttitham K.
        Thai Screening for Diabetic Retinopathy Study Group. Interobserver agreement in the interpretation of single-field digital fundus images for diabetic retinopathy screening.
        Ophthalmology. 2006; 113: 826-832
        • Elmore J.G.
        • Wells C.K.
        • Lee C.H.
        • et al.
        Variability in radiologists' interpretations of mammograms.
        N Engl J Med. 1994; 331: 1493-1499
        • Elmore J.G.
        • Longton G.M.
        • Carney P.A.
        • et al.
        Diagnostic concordance among pathologists interpreting breast biopsy specimens.
        JAMA. 2015; 313: 1122-1132
        • LeCun Y.
        • Yoshua B.
        • Geoffrey H.
        Deep learning.
        Nature. 2015; 521: 436-444
        • Esteva A.
        • Kuprel B.
        • Novoa R.A.
        • et al.
        Dermatologist-level classification of skin cancer with deep neural networks.
        Nature. 2017; 542: 115-118
      2. Liu Y, Gadepalli K, Norouzi M, et al. Detecting cancer metastases on Gigapixel Pathology Images. arXiv [csCV] 2017. Available at:

        • Bejnordi B.E.
        • Veta M.
        • van Diest P.J.
        • et al.
        Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer.
        JAMA. 2017; 318: 2199-2210
        • Gulshan V.
        • Peng L.
        • Coram M.
        • et al.
        Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.
        JAMA. 2016; 316: 2402-2410
        • Gargeya R.
        • Leng T.
        Automated identification of diabetic retinopathy using deep learning.
        Ophthalmology. 2017; 124: 962-969
        • Ting D.S.W.
        • Cheung C.Y.-L.
        • Lim G.
        • et al.
        Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes.
        JAMA. 2017; 318: 2211-2223
        • Decencière E.
        • Etienne D.
        • Xiwei Z.
        • et al.
        Feedback on a publicly distributed image database: the Messidor Database.
        Image Anal Stereol. 2014; (Available at: 0
        • Quellec G.
        • Lamard M.
        • Josselin P.M.
        • et al.
        Optimal wavelet transform for the detection of microaneurysms in retina photographs.
        IEEE Trans Med Imaging. 2008; 27: 1230-1241
      3. EyePACS. EyePACS digital retinal image grading protocol narrative. Available at:

        • Lecun Y.
        • Bottou L.
        • Bengio Y.
        • Haffner P.
        Gradient-based learning applied to document recognition.
        Proc IEEE. 1998; 86: 2278-2324
        • Cuadros J.
        • Bresnick G.
        EyePACS: an adaptable telemedicine system for diabetic retinopathy screening.
        J Diabetes Sci Technol. 2009; 3: 509-516
      4. Golovin D, Solnik B, Moitra S, et al. Google Vizier: A Service for Black-Box Optimization. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2017:1487–1495. Accessed June 30, 2017.

        • Kolb H.
        Facts and figures concerning the human retina.
        in: Kolb H. Fernandez E. Nelson R. Webvision: The Organization of the Retina and Visual System. University of Utah Health Sciences Center, Salt Lake City, UT2005
      5. A Practical Manual of Diabetic Retinopathy Management.
        Clin Exp Optom. 2009; 92: 527-528
      6. Szegedy C, Vanhouke V, Ioffe S, et al. Rethinking the inception architecture for computer vision. arXiv preprint arXiv. 2015;1502.03167.

      7. Szegedy C, Ioffe S, Vanhoucke V, Alemi A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. 2016. Thirty-First AAAI Conference on Artificial Intelligence 2016. Available at: Accessed December 13, 2017.

        • Cohen J.A.
        Coefficient of agreement for nominal scales.
        Educ Psychol Meas. 1960; 20: 37-46
        • Gonzalez M.E.
        • Gonzalez C.
        • Stern M.P.
        • et al.
        Concordance in diagnosis of diabetic retinopathy by fundus photography between retina specialists and a standardized reading center. Mexico City Diabetes Study Retinopathy Group.
        Arch Med Res. 1995; 26: 127-131
        • Abràmoff M.D.
        • Folk J.C.
        • Han D.P.
        • et al.
        Automated analysis of retinal images for detection of referable diabetic retinopathy.
        JAMA Ophthalmol. 2013; 131: 351-357