AI vs AI: clinical reasoning performance of language models in orthopedic rehabilitation

Ertuğrul Safran; Yusuf Yaşasın

doi:10.32322/jhsm.1743257

AI vs AI: clinical reasoning performance of language models in orthopedic rehabilitation

Authors : Ertuğrul Safran, Yusuf Yaşasın

Pages : 825-831

Doi:10.32322/jhsm.1743257

View : 103 | Download : 121

Publication Date : 2025-09-16

Article Type : Research Paper

Abstract :Aims: This study aimed to compare the clinical reasoning and treatment planning performance of three advanced large language models (LLMs)-ChatGPT-4o, Gemini 2.5 Pro, and DeepSeek-V3-in orthopedic rehabilitation. Their responses to standardized clinical scenarios were evaluated to determine alignment with evidence‑based physiotherapy practices, focusing on relevance, accuracy, completeness, applicability, and safety awareness. Methods: Three fictional but clinically realistic scenarios involving rotator cuff tendinopathy, lumbar disc herniation with radiculopathy, and anterior cruciate ligament (ACL) reconstruction were developed by an experienced physiotherapist. These scenarios were independently queried on the same day by three AI models using identical prompts. A blinded expert physiotherapist evaluated each model’s detailed responses using a 5-point Likert Scale across five domains: clinical accuracy, relevance, completeness, applicability, and safety awareness. Mean scores and descriptive statistics were calculated. Results: DeepSeek-V3 was consistently rated highest (5/5) across all domains and scenarios, demonstrating comprehensive and clinically rigorous plans. ChatGPT-4o showed strong performance overall, with total scores ranging from 19 to 20 out of 25, though it exhibited lower completeness scores due to less specific milestones. Gemini 2.5 Pro scored lower overall (average total score 18/25), with particular weaknesses in applicability and clinical relevance in complex cases such as lumbar disc herniation. All models provided evidence-based treatment approaches emphasizing pain management, postural correction, gradual strengthening, and return-to-activity progression. Differences arose in emphasis on lifestyle modification, patient education depth, and integration of psychosocial factors, with Gemini uniquely addressing psychological readiness in ACL rehabilitation. Conclusion: AI-generated rehabilitation plans show substantial concordance with current physiotherapy guidelines but vary in detail and clinical practicality. DeepSeek-V3 outperformed the other models in consistency and safety considerations, while ChatGPT-4o balanced clinical accuracy with moderate completeness. Gemini 2.5 Pro’s inclusion of biopsychosocial components offers valuable insights but may require further refinement for clinical applicability. These findings highlight the potential and current limitations of AI tools in orthopedic rehabilitation, suggesting careful model selection based on clinical context and user needs.
Keywords : Yapay Zeka, Klinik Akıl Yürütme, Dil Modelleri, Kas-İskelet Hastalıkları, Ortopedik Rehabilitasyon, Fizyoterapi Yöntemleri

ORIGINAL ARTICLE URL