Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/115683
PIRA download icon_1.1View/Download Full Text
Title: Intra- and inter-observer reliability of ChatGPT-4o in thyroid nodule ultrasound feature analysis based on ACR TI-RADS : an image-based study
Authors: Chen, Z 
Chambara, N
Liu, SYW
Chow, TCM
Lai, CMS
Ying, MTC 
Issue Date: Oct-2025
Source: Diagnostics, Oct. 2025, v. 15, no. 20, 2617
Abstract: Background/Objectives: Advances in large language models like ChatGPT-4o have extended their use to medical image analysis. Accurate assessment of thyroid nodule ultrasound features using ACR TI-RADS is crucial for clinical practice. This study aims to evaluate ChatGPT-4o’s intra-observer consistency and its agreement with an expert in analyzing these features from ultrasound image assessments based on ACR TI-RADS.
Methods: This cross-sectional study used ultrasound images from 100 thyroid nodules collected prospectively between May 2019 and August 2021. Ultrasound images were analyzed by ChatGPT-4o, following ACR TI-RADS guidelines, to assess features of thyroid nodule including composition, echogenicity, shape, margin, and echogenic foci. The analysis was repeated after one week to evaluate intra-observer reliability. The ultrasound images were also analyzed by another ultrasound expert for the evaluation of inter-observer reliability. Agreement was measured using Cohen’s Kappa coefficient, and concordance rates were calculated based on alignment with the expert’s reference classifications.
Results: Intra-observer agreement for ChatGPT-4o was moderate for composition (Kappa = 0.449) and echogenic foci (Kappa = 0.404), with substantial agreement for echogenicity (Kappa = 0.795). Agreement was notably low for shape (Kappa = −0.051) and margin (Kappa = 0.154). Inter-observer agreement between ChatGPT-4o and the expert was generally low, with Kappa values ranging from −0.006 to 0.238, the highest being for echogenic foci. Overall concordance rates between ChatGPT-4o and expert evaluations ranged from 46.6% to 48.2%, with the highest for shape (65%) and the lowest for echogenicity (29%).
Conclusions: ChatGPT-4o showed favorable consistency in assessing some thyroid nodule features in intra-observer analysis, but notable variability in others. Inter-observer comparisons with expert evaluations revealed generally low agreement across all features, despite acceptable concordance for certain imaging characteristics. While promising for specific ultrasound features, ChatGPT-4o’s consistency and accuracy still vary significantly compared to expert assessments.
Keywords: ChatGPT
Large language model
Observer agreement
Thyroid nodule
Ultrasound features
Publisher: MDPI AG
Journal: Diagnostics 
DOI: 10.3390/diagnostics15202617
Rights: Copyright: © 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
The following publication Chen, Z., Chambara, N., Liu, S. Y. W., Chow, T. C. M., Lai, C. M. S., & Ying, M. T. C. (2025). Intra- and Inter-Observer Reliability of ChatGPT-4o in Thyroid Nodule Ultrasound Feature Analysis Based on ACR TI-RADS: An Image-Based Study. Diagnostics, 15(20), 2617 is available at https://doi.org/10.3390/diagnostics15202617.
Appears in Collections:Journal/Magazine Article

Files in This Item:
File Description SizeFormat 
diagnostics-15-02617.pdf646.02 kBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Access
View full-text via PolyU eLinks SFX Query
Show full item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.