Comparative Evaluation of ScholarGPT and ChatGPT-4 Omni in Pediatric Dentistry: Accuracy and Completeness Analysis.

Okutan, Alev; Sezer, BERKANT

Comparative Evaluation of ScholarGPT and ChatGPT-4 Omni in Pediatric Dentistry: Accuracy and Completeness Analysis.

Pediatric dentistry, cilt.47, sa.6, ss.408-448, 2025 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 47 Sayı: 6
Basım Tarihi: 2025
Dergi Adı: Pediatric dentistry
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), CINAHL, MEDLINE
Sayfa Sayıları: ss.408-448
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Çanakkale Onsekiz Mart Üniversitesi Adresli: Evet

Özet

Purpose: Evaluate and compare the accuracy and completeness of responses provided by ScholarGPT and ChatGPT-4 Omni (ChatGPT-4o) to clinical questions in pediatric dentistry. Methods: Thirty clinical questions across six clinical topics were developed. Responses were collected from ScholarGPT and ChatGPT-4o and independently evaluated by six experienced pediatric dentists. The evaluators used a five-point Likert scale for accuracy and three-point scale for completeness. Accuracy was rated for factual correctness, relevance, and coherence, while completeness reflected how fully the response addressed the question. Statistical analysis was performed using non-parametric tests, including the Wilcoxon signed-rank and Kruskal-Wallis tests. Results: ScholarGPT received significantly higher median accuracy scores (five, interquartile range [IQR] equals zero) compared to ChatGPT-4o (four, IQR equals one) across all topics (P<0.001). In completeness scores, ScholarGPT (three, IQR equals one) also out-performed ChatGPT-4o (two, IQR equals zero; P<0.001). ScholarGPT showed the highest accuracy in the "fissure sealants" (five, IQR equals zero), while the lowest was observed in "development of dentition and occlusion" (five, IQR equals one). ChatGPT-4o yielded the lowest accuracy in "development of dentition and occlusion" (four, IQR equals two) and the highest in "fluoride" (four, IQR equals one). Accuracy scores varied significantly across topics for both ChatGPT-4o (P=0.012) and ScholarGPT (P=0.001). However, differences in completeness across topics were observed only for ScholarGPT (P=0.008). Conclusions: ScholarGPT provided more accurate and complete responses to pediatric dentistry questions than ChatGPT-4o, suggesting that domain-specific artificial intelligence tools can aid dental education and clinical support, though further refinement is needed.