Is AI the future of evaluation in medical education?? AI vs. human evaluation in objective structured clinical examination

Tekin, MURAT; Yurdal, Mustafa; Toraman, ÇETİN; Korkmaz, Güneş; Uysal, İBRAHİM

doi:10.1186/s12909-025-07241-4

Is AI the future of evaluation in medical education?? AI vs. human evaluation in objective structured clinical examination

Tekin M., Yurdal M. O., Toraman Ç., Korkmaz G., Uysal İ.

BMC MEDICAL EDUCATION, cilt.25, sa.1, 2025 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 25 Sayı: 1
Basım Tarihi: 2025
Doi Numarası: 10.1186/s12909-025-07241-4
Dergi Adı: BMC MEDICAL EDUCATION
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI), Scopus, Biotechnology Research Abstracts, EMBASE, MEDLINE, Veterinary Science Database, Directory of Open Access Journals
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Çanakkale Onsekiz Mart Üniversitesi Adresli: Evet

Özet

BackgroundObjective Structured Clinical Examinations (OSCEs) are widely used in medical education to assess students' clinical and professional skills. Recent advancements in artificial intelligence (AI) offer opportunities to complement human evaluations. This study aims to explore the consistency between human and AI evaluators in assessing medical students' clinical skills during OSCE.MethodsThis cross-sectional study was conducted at a state university in Turkey, focusing on pre-clinical medical students (Years 1, 2, and 3). Four clinical skills-intramuscular injection, square knot tying, basic life support, and urinary catheterization-were evaluated during OSCE at the end of the 2023-2024 academic year. Video recordings of the students' performances were assessed by five evaluators: a real-time human assessor, two video-based expert human assessors, and two AI-based systems (ChatGPT-4o and Gemini Flash 1.5). The evaluations were based on standardized checklists validated by the university. Data were collected from 196 students, with sample sizes ranging from 43 to 58 for each skill. Consistency among evaluators was analyzed using statistical methods.ResultsAI models consistently assigned higher scores than human evaluators across all skills. For intramuscular injection, the mean total score given by AI was 28.23, while human evaluators averaged 25.25. For knot tying, AI scores averaged 16.07 versus 10.44 for humans. In basic life support, AI scores were 17.05 versus 16.48 for humans. For urinary catheterization, mean scores were similar (AI: 26.68; humans: 27.02), but showed considerable variance in individual criteria. Inter-rater consistency was higher for visually observable steps, while auditory tasks led to greater discrepancies between AI and human evaluators.ConclusionsAI shows promise as a supplemental tool for OSCE evaluation, especially for visually based clinical skills. However, its reliability varies depending on the perceptual demands of the skill being assessed. The higher and more uniform scores given by AI suggest potential for standardization, yet refinement is needed for accurate assessment of skills requiring verbal communication or auditory cues.