Fabricated or accurate? Ethical concerns and citation hallucination in aI-generated scientific writing on musculoskeletal topics

Ertuğrul Safran; Adem Çalı

doi:10.38053/acmj.1746227

Fabricated or accurate? Ethical concerns and citation hallucination in aI-generated scientific writing on musculoskeletal topics

Authors : Ertuğrul Safran, Adem Çalı

Pages : 695-702

Doi:10.38053/acmj.1746227

View : 103 | Download : 114

Publication Date : 2025-09-15

Article Type : Research Paper

Abstract :Aims: Large language models (LLMs) such as ChatGPT are increasingly used in academic and clinical writing. While these tools can generate coherent and domain-specific text, concerns persist regarding the accuracy of their automatically generated references. In musculoskeletal rehabilitation—a field heavily reliant on current evidence—the reliability of citations is especially critical. Yet, systematic evaluations of citation accuracy in AI-generated scientific content are lacking. To evaluate the reference accuracy of scientific texts generated by ChatGPT (GPT-4) in response to musculoskeletal rehabilitation prompts, and to determine whether reference accuracy improves following structured post-generation verification. Methods: ChatGPT was prompted to generate four scientific paragraphs on musculoskeletal rehabilitation topics (manual therapy, ACL reconstruction, low back pain, and rotator cuff repair), each including 10 references with DOIs. A total of 40 references were analyzed using a 3-point scoring system (0=fabricated, 1=partially correct, 2=fully accurate), which was used to assess citation quality. After initial evaluation, ChatGPT was asked to verify and revise its references. Scores before and after this step were compared descriptively and with Wilcoxon signed-rank tests to assess statistical significance, and effect sizes (r) were calculated to estimate the magnitude of improvement. Results: Only 7.5% of references were fully accurate in the initial generation, while 42.5% were completely fabricated. The remaining 50% were partially correct. After verification, the proportion of fully accurate references rose to 77.5%. Wilcoxon signed-rank testing confirmed a statistically significant improvement in accuracy across all prompts (W=561.0, p<0.001, r=0.60). The most common errors included invalid DOIs, fabricated article titles, and mismatched metadata. Conclusion: ChatGPT can generate coherent scientific content, but its initial references are frequently inaccurate or fabricated. Structured post-generation verification significantly improves reference accuracy, as confirmed by statistical testing. These findings suggest that LLMs may be integrated as drafting tools in academic and clinical musculoskeletal contexts, but only when accompanied by strict human-led verification of citations.
Keywords : ChatGPT, Yapay zeka, Muskuloskeletal rehabilitasyon, bilimsel yazı, referans doğruluğu, atıf halüsinasyonu

ORIGINAL ARTICLE URL