Aesthetic Component of the Index of Orthodontic Treatment Need (IOTN): Can large language models match the performance of orthodontists?
Korean Journal of Orthodontics, cilt.56, sa.3, ss.207-220, 2026 (SCI-Expanded, Scopus)
- Yayın Türü: Makale / Tam Makale
- Cilt numarası: 56 Sayı: 3
- Basım Tarihi: 2026
- Doi Numarası: 10.4041/kjod25.320
- Dergi Adı: Korean Journal of Orthodontics
- Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
- Sayfa Sayıları: ss.207-220
- Anahtar Kelimeler: Artificial intelligence, Index of orthodontic treatment need, Large language models
- Çanakkale Onsekiz Mart Üniversitesi Adresli: Evet
Özet
Objective: The assessment of orthodontic treatment needs often involves subjective judgment, particularly when using esthetic indices such as the Aesthetic Component (AC) of the Index of Orthodontic Treatment Need (IOTN). This cross-sectional diagnostic agreement study evaluated whether large language models (LLMs) could provide consistent and reliable IOTN-AC scores comparable to those assigned by expert orthodontists. Methods: Three experienced orthodontists (with 8–15 years of clinical experience) independently scored 147 standardized frontal intraoral photographs using the IOTN-AC at two time points (Time 1 and Time 2). Five LLMs (GPT-4.0, GPT-o3, Claude, Manus, and Grok) were used to evaluate the dataset. Agreement and reliability were assessed using intraclass correlation coefficients (ICCs), Pearson and Spearman correlation values, mean absolute error (MAE), and match analyses (exact, near, and group matches). Results: The orthodontists showed moderate inter-rater reliability (ICC = 0.649; 95% confidence interval [CI], 0.563–0.742). Among the LLMs, GPT-4.0 showed the highest agreement with expert scores (ICC = 0.771; 95% CI, 0.697–0.855), followed by GPT-o3 (ICC = 0.663). GPT-4.0 also yielded the strongest correlation (r = 0.773; P < 0.001) and the lowest MAE (1.09). In the match analyses, GPT-4.0 achieved the highest exact match (28.6%), near-match (47.6%), and group match (66.0%) rates. The remaining models showed lower performance across all metrics. Conclusions: Multimodal LLMs, particularly GPT-4.0, demonstrated substantial agreement with expert orthodontists in IOTN-AC scoring. These findings suggest that LLMs may serve as adjunct tools in assessments for orthodontic treatment needs. However, clinical decision-making should continue to rely on expert judgment.