ChatGPT-5–based large language model analysis versus an FDA-approved AI-CAD system for thyroid nodule ultrasound evaluation

Chen, Z; Ye, M; Liang, J; Chen, F; Ying, MTC

doi:10.1016/j.ejrad.2025.112639

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/118223

Title:	ChatGPT-5–based large language model analysis versus an FDA-approved AI-CAD system for thyroid nodule ultrasound evaluation
Authors:	Chen, Z Ye, M Liang, J Chen, F Ying, MTC
Issue Date:	Feb-2026
Source:	European journal of radiology, Feb. 2026, v. 195, 112639
Abstract:	Purpose: Recent advances in multimodal large language models (LLMs) have demonstrated promising potential for medical image analysis, yet their diagnostic capability in thyroid ultrasound remains unverified. This study explored the feasibility of ChatGPT-5, the latest multimodal LLM, for thyroid nodule classification and contextualized its diagnostic performance against S-Detect, an FDA-approved commercial computer-aided diagnosis system. Methods: In this prospective study, 141 patients with 186 nodules who underwent preoperative ultrasound and subsequent surgery were enrolled. For S-Detect, the largest transverse grayscale ultrasound image of each nodule was analyzed with automated contouring for binary classification. For ChatGPT-5, cropped transverse and longitudinal nodule ultrasound images were analyzed using a standardized diagnostic prompt for binary classification. Agreement with histopathology was assessed using Kappa statistics; sensitivity, specificity, accuracy, and area under the receiver operating characteristic curve (AUC) were calculated. Results: Both systems showed statistically significant ability to distinguish benign from malignant nodules ( P < 0.05). Agreement with histopathology was fair for ChatGPT-5 ( Kappa = 0.224) and moderate for S-Detect ( Kappa = 0.579). ChatGPT-5 demonstrated sensitivity 50.8 %, specificity 75.8 %, and accuracy 59.1 %, whereas S-Detect achieved higher sensitivity (91.9 %) and accuracy (82.3 %) but lower specificity (62.9 %). The AUC for S-Detect (77.4 %) was significantly greater than that for ChatGPT-5 (63.3 %, P < 0.001). Conclusions: ChatGPT-5 demonstrated feasibility for thyroid nodule classification but showed lower diagnostic performance than the licensed, pre-trained S-Detect system and is not yet adequate for medical imaging applications.
Keywords:	ChatGPT Large language model S-Detect Thyroid nodule Ultrasound
Publisher:	Elsevier Ireland Ltd.
Journal:	European journal of radiology
ISSN:	0720-048X
EISSN:	1872-7727
DOI:	10.1016/j.ejrad.2025.112639
Appears in Collections:	Journal/Magazine Article

Open Access Information

Status	embargoed access
Embargo End Date	2027-02-28

Access

View full-text via PolyU eLinks

Show full item record

Google Scholar^TM

Check

Open Access Information

Access

Google ScholarTM

Altmetric

Google Scholar^TM