Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/115813
PIRA download icon_1.1View/Download Full Text
Title: DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning
Authors: Xu, P 
Wu, Y 
Jin, K
Chen, X 
He, M 
Shi, D 
Issue Date: Aug-2025
Source: Advances in ophthalmology practice and research, Aug.-Sept 2025, v. 5, no. 3, p. 189-195
Abstract: Purpose: To evaluate the accuracy and reasoning ability of DeepSeek-R1 and three recently released large language models (LLMs) in bilingual complex ophthalmology cases.
Methods: A total of 130 multiple-choice questions (MCQs) related to diagnosis (n ​= ​39) and management (n ​= ​91) were collected from the Chinese ophthalmology senior professional title examination and categorized into six topics. These MCQs were translated into English. Responses from DeepSeek-R1, Gemini 2.0 Pro, OpenAI o1 and o3-mini were generated under default configurations between February 15 and February 20, 2025. Accuracy was calculated as the proportion of correctly answered questions, with omissions and extra answers considered incorrect. Reasoning ability was evaluated through analyzing reasoning logic and the causes of reasoning errors.
Results: DeepSeek-R1 demonstrated the highest overall accuracy, achieving 0.862 in Chinese MCQs and 0.808 in English MCQs. Gemini 2.0 Pro, OpenAI o1, and OpenAI o3-mini attained accuracies of 0.715, 0.685, and 0.692 in Chinese MCQs (all P ​<0.001 compared with DeepSeek-R1), and 0.746 (P ​= ​0.115), 0.723 (P ​= ​0.027), and 0.577 (P ​<0.001) in English MCQs, respectively. DeepSeek-R1 achieved the highest accuracy across five topics in both Chinese and English MCQs. It also excelled in management questions conducted in Chinese (all P ​<0.05). Reasoning ability analysis showed that the four LLMs shared similar reasoning logic. Ignoring key positive history, ignoring key positive signs, misinterpretation of medical data, and overuse of non–first-line interventions were the most common causes of reasoning errors.
Conclusions: DeepSeek-R1 demonstrated superior performance in bilingual complex ophthalmology reasoning tasks than three state-of-the-art LLMs. These findings highlight the potential of advanced LLMs to assist in clinical decision-making and suggest a framework for evaluating reasoning capabilities.
Keywords: Clinical decision support
DeepSeek
Gemini
Large language models
OpenAI
Ophthalmology professional examination
Reasoning ability
Publisher: Elsevier Inc.
Journal: Advances in ophthalmology practice and research 
EISSN: 2667-3762
DOI: 10.1016/j.aopr.2025.05.001
Rights: © 2025 The Author(s). Published by Elsevier Inc. on behalf of Zhejiang University Press. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
The following publication Xu, P., Wu, Y., Jin, K., Chen, X., He, M., & Shi, D. (2025). DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning. Advances in Ophthalmology Practice and Research, 5(3), 189–195 is available at https://doi.org/10.1016/j.aopr.2025.05.001.
Appears in Collections:Journal/Magazine Article

Files in This Item:
File Description SizeFormat 
1-s2.0-S2667376225000290-main.pdf1.55 MBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Access
View full-text via PolyU eLinks SFX Query
Show full item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.