Aesthetic Component of the Index of Orthodontic Treatment Need (IOTN): Can large language models match the performance of orthodontists?

Arısan, Arda; GENÇ, CELAL; DURAN, GÖKHAN

doi:10.4041/kjod25.320

Aesthetic Component of the Index of Orthodontic Treatment Need (IOTN): Can large language models match the performance of orthodontists?

Arısan A., GENÇ C., DURAN G. S.

Korean Journal of Orthodontics, cilt.56, sa.3, ss.207-220, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 56 Sayı: 3
Basım Tarihi: 2026
Doi Numarası: 10.4041/kjod25.320
Dergi Adı: Korean Journal of Orthodontics
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
Sayfa Sayıları: ss.207-220
Anahtar Kelimeler: Artificial intelligence, Index of orthodontic treatment need, Large language models
Çanakkale Onsekiz Mart Üniversitesi Adresli: Evet

Özet

Objective: The assessment of orthodontic treatment needs often involves subjective judgment, particularly when using esthetic indices such as the Aesthetic Component (AC) of the Index of Orthodontic Treatment Need (IOTN). This cross-sectional diagnostic agreement study evaluated whether large language models (LLMs) could provide consistent and reliable IOTN-AC scores comparable to those assigned by expert orthodontists. Methods: Three experienced orthodontists (with 8–15 years of clinical experience) independently scored 147 standardized frontal intraoral photographs using the IOTN-AC at two time points (Time 1 and Time 2). Five LLMs (GPT-4.0, GPT-o3, Claude, Manus, and Grok) were used to evaluate the dataset. Agreement and reliability were assessed using intraclass correlation coefficients (ICCs), Pearson and Spearman correlation values, mean absolute error (MAE), and match analyses (exact, near, and group matches). Results: The orthodontists showed moderate inter-rater reliability (ICC = 0.649; 95% confidence interval [CI], 0.563–0.742). Among the LLMs, GPT-4.0 showed the highest agreement with expert scores (ICC = 0.771; 95% CI, 0.697–0.855), followed by GPT-o3 (ICC = 0.663). GPT-4.0 also yielded the strongest correlation (r = 0.773; P < 0.001) and the lowest MAE (1.09). In the match analyses, GPT-4.0 achieved the highest exact match (28.6%), near-match (47.6%), and group match (66.0%) rates. The remaining models showed lower performance across all metrics. Conclusions: Multimodal LLMs, particularly GPT-4.0, demonstrated substantial agreement with expert orthodontists in IOTN-AC scoring. These findings suggest that LLMs may serve as adjunct tools in assessments for orthodontic treatment needs. However, clinical decision-making should continue to rely on expert judgment.