Evaluation of ChatGPT-4's performance on pediatric dentistry questions: accuracy and completeness analysis

Sezer, BERKANT; Okutan, Alev

doi:10.1186/s12903-025-06791-9

Evaluation of ChatGPT-4's performance on pediatric dentistry questions: accuracy and completeness analysis

Sezer B., Okutan A. E.

BMC ORAL HEALTH, cilt.25, sa.1, 2025 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 25 Sayı: 1
Basım Tarihi: 2025
Doi Numarası: 10.1186/s12903-025-06791-9
Dergi Adı: BMC ORAL HEALTH
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, EMBASE, MEDLINE, Directory of Open Access Journals
Anahtar Kelimeler: Artificial intelligence, Chatbot, Dental caries, Dental occlusion, Dental pulp, Fissure sealant, Fluoride, Generative artificial intelligence, Large language models, Oral hygiene
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Çanakkale Onsekiz Mart Üniversitesi Adresli: Evet

Özet

BackgroundThis study aimed to evaluate the accuracy and completeness of Chat Generative Pre-trained Transformer-4 (ChatGPT-4) responses to frequently asked questions (FAQs) posed by patients and parents, as well as curricular questions related to pediatric dentistry. Additionally, it sought to determine whether the ChatGPT-4's performance varied across different question topics.MethodsResponses from ChatGPT-4 to 30 FAQs by patients and parents and 30 curricular questions covering six pediatric dentistry topics (fissure sealants, fluoride, early childhood caries, oral hygiene practices, development of dentition and occlusion, and pulpal therapy) were evaluated by 30 pediatric dentists. Accuracy was rated using a five-point Likert scale, while completeness was assessed via a three-point scale, capturing distinct aspects of response quality. Statistical analyses included Fisher's Exact test, Mann-Whitney U test, Kruskal-Wallis test, and Bonferroni-adjusted post hoc comparisons.ResultsChatGPT-4's responses demonstrated high overall accuracy across all question types. Mean accuracy scores were 4.21 +/- 0.55 for FAQs and 4.16 +/- 0.70 for curricular questions, indicating that responses were generally rated as "good" to "excellent" by pediatric dentists, with no statistically significant difference between the two groups (p = 0.942). Completeness scores were moderate overall, with means of 2.51 +/- 0.40 (median: 3) and 2.61 +/- 1.53 (median: 3) for FAQs and curricular questions, respectively (p = 0.563), reflecting a generally acceptable response coverage. Accuracy scores for curricular questions varied significantly by topic (p = 0.007), with the highest score for fissure sealants (4.45 +/- 0.62; median: 5) and the lowest for pulpal therapy (3.93 +/- 0.93; median: 4).ConclusionFrom a clinical perspective, ChatGPT-4 demonstrates promising accuracy and acceptable completeness in pediatric dental communication. However, its performance in certain curricular areas-particularly fluoride and pulpal therapy-warrants cautious interpretation and requires professional oversight.