Enhancing systematic reviews in orthodontics: a comparative examination of GPT-3.5 and GPT-4 for generating PICO-based queries with tailored prompts and configurations

Demir G. B., Süküt Y., Duran G. S., Topsakal K. G., Görgülü S.

European Journal of Orthodontics, vol.46, no.2, 2024 (SCI-Expanded) identifier identifier identifier

  • Publication Type: Article / Article
  • Volume: 46 Issue: 2
  • Publication Date: 2024
  • Doi Number: 10.1093/ejo/cjae011
  • Journal Name: European Journal of Orthodontics
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, EMBASE
  • Keywords: artificial intelligence (AI), ChatGPT, prompt engineering, PICO-based queries, database search
  • Çanakkale Onsekiz Mart University Affiliated: No


Objectives: The rapid advancement of Large Language Models (LLMs) has prompted an exploration of their efficacy in generating PICO-based (Patient, Intervention, Comparison, Outcome) queries, especially in the field of orthodontics. This study aimed to assess the usability of Large Language Models (LLMs), in aiding systematic review processes, with a specific focus on comparing the performance of ChatGPT 3.5 and ChatGPT 4 using a specialized prompt tailored for orthodontics. Materials/Methods: Five databases were perused to curate a sample of 77 systematic reviews and meta-analyses published between 2016 and 2021. Utilizing prompt engineering techniques, the LLMs were directed to formulate PICO questions, Boolean queries, and relevant ∝. The outputs were subsequently evaluated for accuracy and consistency by independent researchers using three-point and six-point Likert scales. Furthermore, the PICO records of 41 studies, which were compatible with the PROSPERO records, were compared with the responses provided by the models. Results: ChatGPT 3.5 and 4 showcased a consistent ability to craft PICO-based queries. Statistically significant differences in accuracy were observed in specific categories, with GPT-4 often outperforming GPT-3.5. Limitations: The study’s test set might not encapsulate the full range of LLM application scenarios. Emphasis on specific question types may also not reflect the complete capabilities of the models. Conclusions/Implications: Both ChatGPT 3.5 and 4 can be pivotal tools for generating PICO-driven queries in orthodontics when optimally configured. However, the precision required in medical research necessitates a judicious and critical evaluation of LLM-generated outputs, advocating for a circumspect integration into scientific investigations.