Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/108595
PIRA download icon_1.1View/Download Full Text
DC FieldValueLanguage
dc.contributorSchool of Optometryen_US
dc.contributorResearch Centre for SHARP Visionen_US
dc.creatorWang, Yen_US
dc.creatorHan, Xen_US
dc.creatorLi, Cen_US
dc.creatorLuo, Len_US
dc.creatorYin, Qen_US
dc.creatorZhang, Jen_US
dc.creatorPeng, Gen_US
dc.creatorShi, Den_US
dc.creatorHe, Men_US
dc.date.accessioned2024-08-20T01:52:23Z-
dc.date.available2024-08-20T01:52:23Z-
dc.identifier.issn1439-4456en_US
dc.identifier.urihttp://hdl.handle.net/10397/108595-
dc.language.isoenen_US
dc.publisherJMIR Publications, Inc.en_US
dc.rights© Yueye Wang, Xiaotong Han, Cong Li, Lixia Luo, Qiuxia Yin, Jian Zhang, Guankai Peng, Danli Shi, Mingguang He. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 14.08.2024. This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.en_US
dc.rightsThe following publication Wang Y, Han X, Li C, Luo L, Yin Q, Zhang J, Peng G, Shi D, He M. Impact of Gold-Standard Label Errors on Evaluating Performance of Deep Learning Models in Diabetic Retinopathy Screening: Nationwide Real-World Validation Study. J Med Internet Res 2024;26:e52506 is available at https://doi.org/10.2196/52506.en_US
dc.subjectArtificial intelligenceen_US
dc.subjectDiabetesen_US
dc.subjectDiabetic retinopathyen_US
dc.subjectReal worlden_US
dc.titleImpact of gold-standard label errors on evaluating performance of deep learning models in diabetic retinopathy screening : nationwide real-world validation studyen_US
dc.typeJournal/Magazine Articleen_US
dc.identifier.volume26en_US
dc.identifier.doi10.2196/52506en_US
dcterms.abstractBackground: For medical artificial intelligence (AI) training and validation, human expert labels are considered the gold standard that represents the correct answers or desired outputs for a given data set. These labels serve as a reference or benchmark against which the model’s predictions are compared.en_US
dcterms.abstractObjective: This study aimed to assess the accuracy of a custom deep learning (DL) algorithm on classifying diabetic retinopathy (DR) and further demonstrate how label errors may contribute to this assessment in a nationwide DR-screening program.en_US
dcterms.abstractMethods: Fundus photographs from the Lifeline Express, a nationwide DR-screening program, were analyzed to identify the presence of referable DR using both (1) manual grading by National Health Service England–certificated graders and (2) a DL-based DR-screening algorithm with validated good lab performance. To assess the accuracy of labels, a random sample of images with disagreement between the DL algorithm and the labels was adjudicated by ophthalmologists who were masked to the previous grading results. The error rates of labels in this sample were then used to correct the number of negative and positive cases in the entire data set, serving as postcorrection labels. The DL algorithm’s performance was evaluated against both pre- and postcorrection labels.en_US
dcterms.abstractResults: The analysis included 736,083 images from 237,824 participants. The DL algorithm exhibited a gap between the real-world performance and the lab-reported performance in this nationwide data set, with a sensitivity increase of 12.5% (from 79.6% to 92.5%, P<.001) and a specificity increase of 6.9% (from 91.6% to 98.5%, P<.001). In the random sample, 63.6% (560/880) of negative images and 5.2% (140/2710) of positive images were misclassified in the precorrection human labels. High myopia was the primary reason for misclassifying non-DR images as referable DR images, while laser spots were predominantly responsible for misclassified referable cases. The estimated label error rate for the entire data set was 1.2%. The label correction was estimated to bring about a 12.5% enhancement in the estimated sensitivity of the DL algorithm (P<.001).en_US
dcterms.abstractConclusions: Label errors based on human image grading, although in a small percentage, can significantly affect the performance evaluation of DL algorithms in real-world DR screening.en_US
dcterms.accessRightsopen accessen_US
dcterms.bibliographicCitationJournal of medical Internet research, 2024, v. 26, e52506en_US
dcterms.isPartOfJournal of medical Internet researchen_US
dcterms.issued2024-
dc.identifier.eissn1438-8871en_US
dc.identifier.artne52506en_US
dc.description.validate202408 bcchen_US
dc.description.oaVersion of Recorden_US
dc.identifier.FolderNumberOA_Others-
dc.description.fundingSourceOthersen_US
dc.description.fundingTextNational Natural Science Foundation of China; Global STEM Professorship Scheme; Fundamental Research Funds of the State Key Laboratory of Ophthalmology; Outstanding PI Research Funds of the State Key Laboratory of Ophthalmologyen_US
dc.description.pubStatusPublisheden_US
dc.description.oaCategoryCCen_US
Appears in Collections:Journal/Magazine Article
Files in This Item:
File Description SizeFormat 
jmir-2024-1-e52506.pdf1.98 MBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Access
View full-text via PolyU eLinks SFX Query
Show simple item record

Page views

107
Citations as of Nov 10, 2025

Downloads

71
Citations as of Nov 10, 2025

WEB OF SCIENCETM
Citations

7
Citations as of Dec 18, 2025

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.