Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability

Home Page
About
Submit A Journal
Submit A Conference
Submit Paper/Book
- Submit a Preprint
- Submit a Book
Contact

Eğitim ve Yeni Yaklaşımlar Dergisi
Cilt: 8 Sayı: 2
Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability

Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability

Pages : 191-210

View : 103 | Download : 161

Publication Date : 2025-12-31

Article Type : Research Paper

Abstract :This study investigates the reliability of large language models (LLMs) in assessing English as a Foreign Language (EFL) writing compared to human raters. Specifically, the performances of ChatGPT 4.0 and DeepSeek R1 were examined across three genres; argumentative, opinion, and persuasive essays, under rubric-free and rubric-based scoring conditions. Participants were 65 undergraduate ELT students at a Turkish university who produced a total of 162 essays. Two experienced human raters scored all essays, and their evaluations demonstrated near-perfect inter-rater reliability, providing a stable benchmark for comparison. The same essays were then rated by ChatGPT and DeepSeek under both scoring conditions. Statistical analyses included intraclass correlation coefficients (ICC), Pearson correlations, paired-samples t-tests, and ANOVAs. Findings revealed that rubric integration substantially improved alignment between AI and human scores, particularly for ChatGPT, which showed stronger sensitivity to rubric criteria than DeepSeek. Genre effects were also evident: opinion essays yielded the highest AI-human agreement, persuasive texts moderate alignment, and argumentative essays the weakest consistency. While both AI tools produced more centralized scores with less variability than human raters, they also exhibited risk-averse tendencies, especially without rubric guidance. The results indicate that AI-based scoring can complement, but not replace, human evaluation, especially in cognitively demanding genres. The study highlights the importance of rubric clarity, prompt design, and genre awareness in maximizing the educational value of AI-assisted writing assessment.
Keywords : ChatGPT, DeepSeek, otomatik yazma değerlendirmesi, Puanlama Ölçeği, değerlendirme yöntemi

ORIGINAL ARTICLE URL

* There may have been changes in the journal, article,conference, book, preprint etc. informations. Therefore, it would be appropriate to follow the information on the official page of the source. The information here is shared for informational purposes. IAD is not responsible for incorrect or missing information.