Journal Article DKFZ-2025-01872

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
The imitation game: large language models versus multidisciplinary tumor boards: benchmarking AI against 21 sarcoma centers from the ring trial.

 ;  ;  ;  ;  ;  ;  ;  ;  ;

2025
Springer Heidelberg

Journal of cancer research and clinical oncology 151(9), 248 () [10.1007/s00432-025-06304-9]
 GO

This record in other databases:  

Please use a persistent id in citations: doi:

Abstract: The study aims to compare the treatment recommendations generated by four leading large language models (LLMs) with those from 21 sarcoma centers' multidisciplinary tumor boards (MTBs) of the sarcoma ring trial in managing complex soft tissue sarcoma (STS) cases.We simulated STS-MTBs using four LLMs-Llama 3.2-vison: 90b, Claude 3.5 Sonnet, DeepSeek-R1, and OpenAI-o1 across five anonymized STS cases from the sarcoma ring trial. Each model was queried 21 times per case using a standardized prompt, and the responses were compared with human MTBs in terms of intra-model consistency, treatment recommendation alignment, alternative recommendations, and source citation.LLMs demonstrated high inter-model and intra-model consistency in only 20% of cases, and their recommendations aligned with human consensus in only 20-60% of cases. The model with the highest concordance with the most common MTB recommendation, Claude 3.5 Sonnet, aligned with experts in only 60% of cases. Notably, the recommendations across MTBs were highly heterogenous, contextualizing the variable LLM performance. Discrepancies were particularly notable, where common human recommendations were often absent in LLM outputs. Additionally, the sources for the recommendation rationale of LLMs were clearly derived from the German S3 sarcoma guidelines in only 24.8% to 55.2% of the responses. LLMs occasionally suggested potentially harmful information were also observed in alternative recommendations.Despite the considerable heterogeneity observed in MTB recommendations, the significant discrepancies and potentially harmful recommendations highlight current AI tools' limitations, underscoring that referral to high-volume sarcoma centers remains essential for optimal patient care. At the same time, LLMs could serve as an excellent tool to prepare for MDT discussions.

Keyword(s): Humans (MeSH) ; Sarcoma: therapy (MeSH) ; Sarcoma: pathology (MeSH) ; Benchmarking: methods (MeSH) ; Cancer Care Facilities (MeSH) ; Language (MeSH) ; Large Language Models (MeSH) ; Artificial intelligence ; Clinical decision ; Large language model ; Multidisciplinary tumor board ; Soft tissue sarcoma

Classification:

Contributing Institute(s):
  1. DKTK Koordinierungsstelle Berlin (BE01)
Research Program(s):
  1. 899 - ohne Topic (POF4-899) (POF4-899)

Appears in the scientific report 2025
Database coverage:
Medline ; BIOSIS Previews ; Biological Abstracts ; Clarivate Analytics Master Journal List ; Current Contents - Life Sciences ; DEAL Springer ; DEAL Springer ; Ebsco Academic Search ; Essential Science Indicators ; SCOPUS ; Science Citation Index Expanded ; Web of Science Core Collection
Click to display QR Code for this record

The record appears in these collections:
Document types > Articles > Journal Article
Public records
Publications database

 Record created 2025-09-10, last modified 2025-09-14



Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)