%0 Journal Article
%A Dennstädt, Fabio
%A Schmerder, Max
%A Riggenbach, Elena
%A Mose, Lucas
%A Bryjova, Katarina
%A Bachmann, Nicolas
%A Mackeprang, Paul-Henry
%A Ahmadsei, Maiwand
%A Sinovcic, Dubravko
%A Windisch, Paul
%A Zwahlen, Daniel
%A Rogers, Susanne
%A Riesterer, Oliver
%A Maffei, Martin
%A Gkika, Eleni
%A Haddad, Hathal
%A Peeken, Jan
%A Putora, Paul Martin
%A Glatzer, Markus
%A Putz, Florian
%A Hoefler, Daniel
%A Christ, Sebastian M
%A Filchenko, Irina
%A Hastings, Janna
%A Gaio, Roberto
%A Chiang, Lawrence
%A Aebersold, Daniel M
%A Cihoric, Nikola
%T Comparative Evaluation of a Medical Large Language Model in Answering Real-World Radiation Oncology Questions: Multicenter Observational Study.
%J Journal of medical internet research
%V 27
%@ 1439-4456
%C Richmond, Va.
%I Healthcare World
%M DKFZ-2025-01948
%P e69752
%D 2025
%X Large language models (LLMs) hold promise for supporting clinical tasks, particularly in data-driven and technical disciplines such as radiation oncology. While prior evaluation studies have focused on examination-style settings for evaluating LLMs, their performance in real-life clinical scenarios remains unclear. In the future, LLMs might be used as general AI assistants to answer questions arising in clinical practice. It is unclear how well a modern LLM, locally executed within the infrastructure of a hospital, would answer such questions compared with clinical experts.This study aimed to assess the performance of a locally deployed, state-of-the-art medical LLM in answering real-world clinical questions in radiation oncology compared with clinical experts. The aim was to evaluate the overall quality of answers, as well as the potential harmfulness of the answers if used for clinical decision-making.Physicians from 10 departments of European hospitals collected questions arising in the clinical practice of radiation oncology. Fifty of these questions were answered by 3 senior radiation oncology experts with at least 10 years of work experience, as well as the LLM Llama3-OpenBioLLM-70B (Ankit Pal and Malaikannan Sankarasubbu). In a blinded review, physicians rated the overall answer quality on a 5-point Likert scale (quality), assessed whether an answer might be potentially harmful if used for clinical decision-making (harmfulness), and determined if responses were from an expert or the LLM (recognizability). Comparisons between clinical experts and LLMs were then made for quality, harmfulness, and recognizability.There were no significant differences between the quality of the answers between LLM and clinical experts (mean scores of 3.38 vs 3.63; median 4.00, IQR 3.00-4.00 vs median 3.67, IQR 3.33-4.00; P=.26; Wilcoxon signed rank test). The answers were deemed potentially harmful in 13
%K Radiation Oncology
%K Humans
%K Language
%K Large Language Models
%K Llama-3 (Other)
%K artificial intelligence (Other)
%K benchmarking (Other)
%K evaluation (Other)
%K large language models (Other)
%K natural language processing (Other)
%K radiation oncology (Other)
%F PUB:(DE-HGF)16
%9 Journal Article
%$ pmid:40986858
%R 10.2196/69752
%U https://inrepo02.dkfz.de/record/304842