TY  - JOUR
AU  - Liu, Yunsong
AU  - Carrero, Zunamys I
AU  - Jiang, Xiaofeng
AU  - Ferber, Dyke Steven
AU  - Wölflein, Georg
AU  - Zhang, Li
AU  - Jayabalan, Sanddhya
AU  - Lenz, Tim
AU  - Hui, Zhouguang
AU  - Kather, Jakob Nikolas
TI  - Benchmarking large language model-based agent systems for clinical decision tasks.
JO  - npj digital medicine
VL  - nn
SN  - 2398-6352
CY  - [Basingstoke]
PB  - Macmillan Publishers Limited
M1  - DKFZ-2026-00392
SP  - nn
PY  - 2026
N1  - #NCTZFB9# / epub
AB  - Agentic artificial intelligence (AI) systems, designed to autonomously reason, plan, and invoke tools, have shown promise in healthcare, yet systematic benchmarking of their real-world performance remains limited. In this study, we evaluate two such systems: the open-source OpenManus, built on Meta's Llama-4 and extended with medically customized agents; and Manus, a proprietary agent system employing a multistep planner-executor-verifier architecture. Both systems were assessed across three benchmark families: AgentClinic, a stepwise dialog-based diagnostic simulation; MedAgentsBench, a knowledge-intensive medical QA dataset; and Humanity's Last Exam (HLE), a suite of challenging text-only and multimodal questions. Despite access to advanced tools (e.g., web browsing, code development and execution, and text file editing) agent systems yielded only modest accuracy gains over baseline LLMs, reaching 60.3
LB  - PUB:(DE-HGF)16
C6  - pmid:41708802
DO  - DOI:10.1038/s41746-026-02443-6
UR  - https://inrepo02.dkfz.de/record/309943
ER  -