Assessing the Reliability of Open-Ended Exams: A Generalizability Theory Approach to Item and Rater Variance

Home Page
About
Submit A Journal
Submit A Conference
Submit Paper/Book
- Submit a Preprint
- Submit a Book
Contact

İnönü Üniversitesi Eğitim Bilimleri Enstitüsü Dergisi
Cilt: 12 Sayı: 24
Assessing the Reliability of Open-Ended Exams: A Generalizability Theory Approach to Item and Rater ...

Assessing the Reliability of Open-Ended Exams: A Generalizability Theory Approach to Item and Rater Variance

Authors : Mustafa Köroğlu

Pages : 56-69

Doi:10.29129/inujgse.1740879

View : 69 | Download : 657

Publication Date : 2025-10-24

Article Type : Research Paper

Abstract :This study examines the reliability of open-ended university exams through the lens of Generalizability Theory (GT), aiming to identify key sources of measurement error. Using a fully crossed person × item × rater (p × i × r) design, a five-item written exam administered to 76 students was scored by two raters. The Generalizability Study (G-Study) revealed that the largest portion of total score variance stemmed from individual student differences (62.2%) and the person × item interaction (30.7%), while item-related (3.9%) and rater-related (1.5%) variance components were relatively minor. These results suggest that the exam effectively captures individual performance differences, and that increasing item coverage may significantly reduce measurement error. Findings from the Decision Study (D-Study) indicated that expanding the number of items from 4 to 10 and raters from 1 to 5 led to substantial improvements in both relative (σ²δ) and absolute (σ²Δ) error variances. Correspondingly, generalizability and Phi coefficients increased from 0.81 to 0.95. The low rater variance implies that the use of detailed scoring rubrics and rater training contributed to consistent scoring. Moreover, residual error was minimal (1.6%), suggesting strong model fit. From a practical standpoint, results recommend increasing item count to at least eight and involving at least three raters to optimize reliability. The study demonstrates the effectiveness of GT in dissecting multiple sources of error and offers guidance for improving assessment quality in higher education. Emphasizing item diversity, rater standardization, and data-informed decision-making can strengthen the validity and fairness of exam-based evaluations.
Keywords : Genellenebilirlik kuramı, ölçme ve değerlendirme, yazılı sınav, güvenirlik

ORIGINAL ARTICLE URL

* There may have been changes in the journal, article,conference, book, preprint etc. informations. Therefore, it would be appropriate to follow the information on the official page of the source. The information here is shared for informational purposes. IAD is not responsible for incorrect or missing information.

Index of Academic Documents
İzmir Academy Association
CopyRight © 2023-2026