Application and efficacy of artificial intelligence in patient education on spinal cord injuries

  • Jonas Krueckel
  • , Melanie Ardelt
  • , David Schiffelholz
  • , Josina Straub
  • , Sebastian Siller
  • , Vanessa Hubertus
  • , Sonja Häckel
  • , Denis Bratelj
  • , Christof Wutte
  • , Helena Arias
  • , Franz Hilber
  • , Volker Alt
  • , Siegmund Lang

Research output: Journal article (peer-reviewed)Journal article

Abstract

Introduction/background: Spinal cord injuries (SCI) present complex challenges for patients, who increasingly turn to online resources for supplementary information. Large language models (LLMs) like ChatGPT and Google Gemini have emerged as potential tools for patient education. However, concerns about the accuracy, clarity, and comprehensiveness of their responses remain, particularly in specialized fields such as SCI. This study aimed to evaluate the performance of ChatGPT 4, ChatGPT 3.5, and Google Gemini in addressing common patient questions about SCI. Material and methods: A systematic process was used to identify 10 key patient questions related to SCI from online sources, PubMed, and Google Trends. These questions were submitted to ChatGPT 4, ChatGPT 3.5, and Google Gemini using a standardized prompt and a 150-word response cap to elicit expert-like responses. Eight blinded spine surgeons evaluated the chatbot-generated answers for quality, clarity, empathy, and comprehensiveness using a validated rating system. Responses were categorized as “excellent,” “satisfactory with minimal clarification,” “satisfactory with moderate clarification,” or “unsatisfactory.” Results: Across all three models, the majority of responses were rated as either excellent or requiring only minimal clarification. ChatGPT 4 achieved the highest proportion of high-quality responses, with up to almost 90% rated as “excellent” or “minimal clarification required.” ChatGPT 3.5 and Google Gemini performed similarly, with slightly lower percentages of high-quality responses. No statistically significant differences were observed between the models in overall performance. Conclusion: In a standardized single turn, 150-word setting, publicly available LLMs produced largely satisfactory answers to common SCI questions with comparable performance across models. LLMs can be recommended as adjuncts for general patient education, while their outputs should be reviewed within clinical care. Further studies should test multi turn interactions, include patient and multidisciplinary evaluators, compare chatbot responses with clinician authored answers and evaluate the performance of domain specific medical LLMs. Level of evidence: II.

Original languageEnglish
JournalEuropean Spine Journal
Early online date27 Feb 2026
DOIs
Publication statusE-pub ahead of print - 27 Feb 2026

Cite this