TY - JOUR
AU - Dennstädt, Fabio
AU - Schmerder, Max
AU - Riggenbach, Elena
AU - Mose, Lucas
AU - Bryjova, Katarina
AU - Bachmann, Nicolas
AU - Mackeprang, Paul-Henry
AU - Ahmadsei, Maiwand
AU - Sinovcic, Dubravko
AU - Windisch, Paul
AU - Zwahlen, Daniel
AU - Rogers, Susanne
AU - Riesterer, Oliver
AU - Maffei, Martin
AU - Gkika, Eleni
AU - Haddad, Hathal
AU - Peeken, Jan
AU - Putora, Paul Martin
AU - Glatzer, Markus
AU - Putz, Florian
AU - Hoefler, Daniel
AU - Christ, Sebastian M
AU - Filchenko, Irina
AU - Hastings, Janna
AU - Gaio, Roberto
AU - Chiang, Lawrence
AU - Aebersold, Daniel M
AU - Cihoric, Nikola
TI - Comparative Evaluation of a Medical Large Language Model in Answering Real-World Radiation Oncology Questions: Multicenter Observational Study.
JO - Journal of medical internet research
VL - 27
SN - 1439-4456
CY - Richmond, Va.
PB - Healthcare World
M1 - DKFZ-2025-01948
SP - e69752
PY - 2025
AB - Large language models (LLMs) hold promise for supporting clinical tasks, particularly in data-driven and technical disciplines such as radiation oncology. While prior evaluation studies have focused on examination-style settings for evaluating LLMs, their performance in real-life clinical scenarios remains unclear. In the future, LLMs might be used as general AI assistants to answer questions arising in clinical practice. It is unclear how well a modern LLM, locally executed within the infrastructure of a hospital, would answer such questions compared with clinical experts.This study aimed to assess the performance of a locally deployed, state-of-the-art medical LLM in answering real-world clinical questions in radiation oncology compared with clinical experts. The aim was to evaluate the overall quality of answers, as well as the potential harmfulness of the answers if used for clinical decision-making.Physicians from 10 departments of European hospitals collected questions arising in the clinical practice of radiation oncology. Fifty of these questions were answered by 3 senior radiation oncology experts with at least 10 years of work experience, as well as the LLM Llama3-OpenBioLLM-70B (Ankit Pal and Malaikannan Sankarasubbu). In a blinded review, physicians rated the overall answer quality on a 5-point Likert scale (quality), assessed whether an answer might be potentially harmful if used for clinical decision-making (harmfulness), and determined if responses were from an expert or the LLM (recognizability). Comparisons between clinical experts and LLMs were then made for quality, harmfulness, and recognizability.There were no significant differences between the quality of the answers between LLM and clinical experts (mean scores of 3.38 vs 3.63; median 4.00, IQR 3.00-4.00 vs median 3.67, IQR 3.33-4.00; P=.26; Wilcoxon signed rank test). The answers were deemed potentially harmful in 13
KW - Radiation Oncology
KW - Humans
KW - Language
KW - Large Language Models
KW - Llama-3 (Other)
KW - artificial intelligence (Other)
KW - benchmarking (Other)
KW - evaluation (Other)
KW - large language models (Other)
KW - natural language processing (Other)
KW - radiation oncology (Other)
LB - PUB:(DE-HGF)16
C6 - pmid:40986858
DO - DOI:10.2196/69752
UR - https://inrepo02.dkfz.de/record/304842
ER -