Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students

Ahmet Ozan Kaleci; Burcu Şahinbaş; Ezgi Ağadayı; Sümeyye İdil Çelikkaya; Ahmet Altun; Emre Kemal Kardan

doi:10.25282/ted.1729174

Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students

Authors : Ahmet Ozan Kaleci, Burcu Şahinbaş, Ezgi Ağadayı, Sümeyye İdil Çelikkaya, Ahmet Altun, Emre Kemal Kardan

Pages : 135-143

Doi:10.25282/ted.1729174

View : 113 | Download : 198

Publication Date : 2025-12-22

Article Type : Research Paper

Abstract :Background: Medical education in Türkiye is delivered through a six-year, discipline-based curriculum aligned with global trends. The assessment process largely relies on multiple-choice questions, placing a significant preparation burden on faculty members. AI-powered large language models like ChatGPT have the potential to ease exam preparation, enhance feedback quality, and support personalized learning. The aim of this study is to evaluate the success of the ChatGPT-4o model performs when answering multiple-choice (MCQ) questions on medical education exams. Additionally, by comparing exam performance and consistency to student success, we explore the potential benefits of AI-supported models to medical education. Methods: This cross-sectional, analytical investigation was carried out in Türkiye at the [XX] University Faculty of Medicine. During the 2023–2024 academic year, ChatGPT solved multiple-choice questions from seven board exams and one final exam for third-year students, the results were compared with the students\\\' achievements. Statistical analysis included descriptive statistics, correlation analyses, chi-square tests, McNemar tests, and t-tests for independent samples. Results: With a 90.2% correct response percentage, ChatGPT outperformed the entire class, outperforming 293 other students. There was no significant difference in the correct response rate between the surgical, internal, and fundamental medical sciences (p = 0.742). In several fields, such psychiatry, neurology, and medical genetics, 100% success was attained. Forensic medicine, family medicine, medical ethics, pulmonary medicine, and thoracic surgery all had success rates that were lower than 80%. A retest conducted two months later revealed that ChatGPT\\\'s success rate had somewhat risen, with response consistency standing at 91.4%. Conclusions: With a high success rate on medical education tests, ChatGPT has shown a great deal of promise to help both students and instructors. The integration of AI models into educational systems should be done strategically and with a human-centered approach, though, given the constraints in areas like clinical reasoning, ethical evaluation, and human-centered medical education. It is important to design instructional strategies in the future that combine artificial intelligence technologies with human skills.
Keywords : Tıp eğitimi, Yapay zeka, Doğal lisan işleme, Eğitimsel ölçüm

ORIGINAL ARTICLE URL