- Turkish Journal of Clinics and Laboratory
- Volume:15 Issue:4
- Comparative analysis of large language models` performance in breast ımaging
Comparative analysis of large language models` performance in breast ımaging
Authors : Muhammed Said Beşler
Pages : 542-546
Doi:10.18663/tjcl.1561361
View : 56 | Download : 98
Publication Date : 2024-12-31
Article Type : Research Paper
Abstract :Aim: To evaluate the performance of the flagship models, OpenAI\\\'s GPT-4o and Anthropic\\\'s Claude 3.5 Sonnet, in breast imaging cases. Material and Methods: The dataset consisted of cases from the publicly available Case of the Month archive by the Society of Breast Imaging. Questions were classified as text-based or containing images from mammography, ultrasound, magnetic resonance imaging, or hybrid imaging. The accuracy rates of GPT-4o and Claude 3.5 Sonnet were compared using the Mann-Whitney U test. Results: Of the total 94 questions, 61.7% were image-based. The overall accuracy rate of GPT-4o was higher than that of Claude 3.5 Sonnet (75.4% vs. 67.7%, p=0.432). GPT-4o achieved higher scores on questions based on ultrasound and hybrid imaging, while Claude 3.5 Sonnet performed better on mammography-based questions. In tumor group cases, both models reached higher accuracy rates compared to the non-tumor group (both, p>0.05). The models\\\' performance in breast imaging cases overall exceeded 75%, ranging between 64-83% for questions involving different imaging modalities. Conclusion: In breast imaging cases, although GPT-4o generally achieved higher accuracy rates than Claude 3.5 Sonnet in image-based and other types of questions, their performances were comparable.Keywords : yapay zeka, büyük dil modeli, meme görüntüleme, mamografi, ultrason