Enhancing systematic reviews in orthodontics: a comparative examination of GPT-3.5 and GPT-4 for generating PICO-based queries with tailored prompts and configurations

Demir, Gizem; Süküt, Yağızalp; Duran, GÖKHAN; Topsakal, Kübra; Görgülü, Serkan

doi:10.1093/ejo/cjae011

Enhancing systematic reviews in orthodontics: a comparative examination of GPT-3.5 and GPT-4 for generating PICO-based queries with tailored prompts and configurations

Atıf İçin Kopyala

Demir G. B., Süküt Y., Duran G. S., Topsakal K. G., Görgülü S.

European Journal of Orthodontics, cilt.46, sa.2, 2024 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 46 Sayı: 2
Basım Tarihi: 2024
Doi Numarası: 10.1093/ejo/cjae011
Dergi Adı: European Journal of Orthodontics
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, EMBASE
Anahtar Kelimeler: artificial intelligence (AI), ChatGPT, prompt engineering, PICO-based queries, database search
Çanakkale Onsekiz Mart Üniversitesi Adresli: Hayır

Özet

Objectives: The rapid advancement of Large Language Models (LLMs) has prompted an exploration of their efficacy in generating PICO-based (Patient, Intervention, Comparison, Outcome) queries, especially in the field of orthodontics. This study aimed to assess the usability of Large Language Models (LLMs), in aiding systematic review processes, with a specific focus on comparing the performance of ChatGPT 3.5 and ChatGPT 4 using a specialized prompt tailored for orthodontics. Materials/Methods: Five databases were perused to curate a sample of 77 systematic reviews and meta-analyses published between 2016 and 2021. Utilizing prompt engineering techniques, the LLMs were directed to formulate PICO questions, Boolean queries, and relevant ∝. The outputs were subsequently evaluated for accuracy and consistency by independent researchers using three-point and six-point Likert scales. Furthermore, the PICO records of 41 studies, which were compatible with the PROSPERO records, were compared with the responses provided by the models. Results: ChatGPT 3.5 and 4 showcased a consistent ability to craft PICO-based queries. Statistically significant differences in accuracy were observed in specific categories, with GPT-4 often outperforming GPT-3.5. Limitations: The study’s test set might not encapsulate the full range of LLM application scenarios. Emphasis on specific question types may also not reflect the complete capabilities of the models. Conclusions/Implications: Both ChatGPT 3.5 and 4 can be pivotal tools for generating PICO-driven queries in orthodontics when optimally configured. However, the precision required in medical research necessitates a judicious and critical evaluation of LLM-generated outputs, advocating for a circumspect integration into scientific investigations.