Benchmarking large language model-based agent systems for clinical decision tasks.

Liu, Yunsong; Jiang, Xiaofeng; Zhang, Li; Jayabalan, Sanddhya; Wölflein, Georg; Lenz, Tim; Hui, Zhouguang; Ferber, Dyke Steven; Kather, Jakob Nikolas; Carrero, Zunamys I

doi:10.1038/s41746-026-02443-6

Journal Article

DKFZ-2026-00392

Benchmarking large language model-based agent systems for clinical decision tasks.

Liu, Y. ; Carrero, Z. I. ; Jiang, X. ; Ferber, D. S.DKFZ* ; Wölflein, G. ; Zhang, L. ; Jayabalan, S. ; Lenz, T. ; Hui, Z. ; Kather, J. N.DKFZ*

2026
Macmillan Publishers Limited [Basingstoke]

npj digital medicine nn, nn (2026) [10.1038/s41746-026-02443-6]

Abstract: Agentic artificial intelligence (AI) systems, designed to autonomously reason, plan, and invoke tools, have shown promise in healthcare, yet systematic benchmarking of their real-world performance remains limited. In this study, we evaluate two such systems: the open-source OpenManus, built on Meta's Llama-4 and extended with medically customized agents; and Manus, a proprietary agent system employing a multistep planner-executor-verifier architecture. Both systems were assessed across three benchmark families: AgentClinic, a stepwise dialog-based diagnostic simulation; MedAgentsBench, a knowledge-intensive medical QA dataset; and Humanity's Last Exam (HLE), a suite of challenging text-only and multimodal questions. Despite access to advanced tools (e.g., web browsing, code development and execution, and text file editing) agent systems yielded only modest accuracy gains over baseline LLMs, reaching 60.3% and 28.0% in AgentClinic MedQA and MIMIC, 30.3% on MedAgentsBench, and 8.6% on HLE text. Multimodal accuracy remained low (15.5% on multimodal HLE, 29.2% on AgentClinic NEJM), while resource demands increased substantially, with >10× token usage and >2× latency. Although 89.9% of hallucinations were filtered by in-agent safeguards, hallucinations remained prevalent. These findings reveal that current agentic designs offer modest performance benefits at significant computational and workflow cost, underscoring the need for more accurate, efficient, and clinically viable agent systems.

Classification:

ddc:610

Note: #NCTZFB9# / epub

Contributing Institute(s):

Koordinierungsstelle NCT Heidelberg (HD02)

Research Program(s):

899 - ohne Topic (POF4-899) (POF4-899)

Appears in the scientific report 2026

Database coverage:
Medline

;

; Article Processing Charges ; Clarivate Analytics Master Journal List ; Current Contents - Clinical Medicine ; DOAJ Seal ; Essential Science Indicators ; Fees ; IF >= 15 ; JCR ; PubMed Central ; SCOPUS ; Science Citation Index Expanded ; Web of Science Core Collection

Click to display QR Code for this record

The record appears in these collections:
Document types > Articles > Journal Article
Public records
Publications database

Record created 2026-02-19, last modified 2026-02-20

Similar records

Rate this document:

(Not yet reviewed)

Add to personal basket
Export as Author List with IDs BibTeX (UTF-8), EndNote XML, EndNote Text, RIS, MARC, Print MARC, MARCXML, DC,
Request correction
Submit fulltext

guest :: login DKFZ
		Search		Submit		Personalize Your alerts Your baskets Your searches		Help