Performance of DeepSeek-R1, ChatGPT (GPT-o3-mini), and Gemini 2.0 Flash on German Medical Multiple-Choice Questions: Comparative Evaluation

Meyer, Annika ORCID: 0000-0002-8411-8799, Karay, Yassin ORCID: 0009-0005-6380-158X, Steinbicker, Andrea U ORCID: 0000-0002-5237-961X, Streichert, Thomas ORCID: 0000-0002-6588-720X and Overbeek, Remco ORCID: 0000-0002-4046-0234 (2025). Performance of DeepSeek-R1, ChatGPT (GPT-o3-mini), and Gemini 2.0 Flash on German Medical Multiple-Choice Questions: Comparative Evaluation. JMIR Formative Research, 9. pp. 1-9. JMIR Publications. ISSN 2561-326X Open Access

[thumbnail of formative-2025-1-e77357.pdf]

PDF
formative-2025-1-e77357.pdf
Bereitstellung unter der CC-Lizenz: Creative Commons Attribution.
Download (368kB)

Identification Number:10.2196/77357

Official URL: https://doi.org/10.2196/77357

Abstract

[Artikel_Nr. e77357] Background : Despite the transformative potential of artificial intelligence (AI)–based chatbots in medicine, their implementation is hindered by data privacy and security concerns. DeepSeek offers a conceivable solution through its capability for local offline operations. However, as of 2025, it remains unclear whether DeepSeek can achieve an accuracy comparable to that of conventional, cloud-based AI chatbots. Objective: This study aims to evaluate whether DeepSeek, an AI-based chatbot capable of offline operation, achieves answer accuracy on medical multiple-choice questions (MCQs) comparable to that of leading chatbots (ie, ChatGPT and Gemini) on German medical MCQs, thereby assessing its potential as a privacy-preserving alternative for clinical use. Methods: A total of 200 interdisciplinary MCQs from the German Progress Test Medicine were administered to ChatGPT (GPT-o3-mini), DeepSeek (DeepSeek-R1), and Gemini (Gemini 2.0 Flash). Accuracy was defined as the proportion of correctly solved questions. Overall differences among the 3 models were tested with the Cochran Q test, while pairwise comparisons were conducted using the McNemar test. Subgroup analyses were performed by medical domain (Fisher exact test) and question length (Wilcoxon rank-sum test). An a priori power analysis indicated a minimum sample size of 195 questions. Results: All 3 chatbots surpassed the conventional passing threshold of 60%, with accuracies of 96% (192/200) for DeepSeek, 94% (188/200) for Gemini, and 92.5% (185/200) for ChatGPT. The overall difference among models was not statistically significant (P=.10) nor were pairwise comparisons. However, incorrect responses were significantly associated with longer question length for DeepSeek (P=.049) and ChatGPT (P=.04) but not for Gemini. No significant differences in performance were observed across clinical versus preclinical domains or medical specialties (all P>.05). Conclusions: Overall, DeepSeek demonstrates outstanding performance on German medical MCQs comparable to the widely used chatbots ChatGPT and Gemini. Similar to ChatGPT, DeepSeek’s performance declined with increasing question length, highlighting verbosity as a persistent challenge for large language models. While DeepSeek’s offline capability and lower operational costs are advantageous, its safe and reliable application in clinical contexts requires further investigation.

Item Type:	Article
Creators:	Creators Email ORCID ORCID Put Code Meyer, Annika UNSPECIFIED https://orcid.org/0000-0002-8411-8799 UNSPECIFIED Karay, Yassin UNSPECIFIED https://orcid.org/0009-0005-6380-158X UNSPECIFIED Steinbicker, Andrea U UNSPECIFIED https://orcid.org/0000-0002-5237-961X UNSPECIFIED Streichert, Thomas UNSPECIFIED https://orcid.org/0000-0002-6588-720X UNSPECIFIED Overbeek, Remco UNSPECIFIED https://orcid.org/0000-0002-4046-0234 UNSPECIFIED
URN:	urn:nbn:de:hbz:38-799521
Identification Number:	10.2196/77357
Journal or Publication Title:	JMIR Formative Research
Volume:	9
Page Range:	pp. 1-9
Number of Pages:	1
Date:	18 December 2025
Publisher:	JMIR Publications
ISSN:	2561-326X
Language:	English
Faculty:	Faculty of Medicine
Divisions:	Faculty of Medicine > Anästhesiologie und Operative Intensivmedizin > Klinik für Anästhesiologie und Operative Intensivmedizin
Subjects:	Medical sciences Medicine
['eprint_fieldname_oa_funders' not defined]:	Publikationsfonds UzK
Refereed:	Yes
URI:	http://kups.ub.uni-koeln.de/id/eprint/79952

Downloads

Downloads per month over past year

Altmetric

Export

Actions (login required)

View Item

Universität zu Köln

Kölner UniversitätsPublikationsServer

Abstract

Downloads

Altmetric

Export

Actions (login required)