001     309943
005     20260220120944.0
024 7 _ |a 10.1038/s41746-026-02443-6
|2 doi
024 7 _ |a pmid:41708802
|2 pmid
037 _ _ |a DKFZ-2026-00392
041 _ _ |a English
082 _ _ |a 610
100 1 _ |a Liu, Yunsong
|b 0
245 _ _ |a Benchmarking large language model-based agent systems for clinical decision tasks.
260 _ _ |a [Basingstoke]
|c 2026
|b Macmillan Publishers Limited
336 7 _ |a article
|2 DRIVER
336 7 _ |a Output Types/Journal article
|2 DataCite
336 7 _ |a Journal Article
|b journal
|m journal
|0 PUB:(DE-HGF)16
|s 1771510234_3004207
|2 PUB:(DE-HGF)
336 7 _ |a ARTICLE
|2 BibTeX
336 7 _ |a JOURNAL_ARTICLE
|2 ORCID
336 7 _ |a Journal Article
|0 0
|2 EndNote
500 _ _ |a #NCTZFB9# / epub
520 _ _ |a Agentic artificial intelligence (AI) systems, designed to autonomously reason, plan, and invoke tools, have shown promise in healthcare, yet systematic benchmarking of their real-world performance remains limited. In this study, we evaluate two such systems: the open-source OpenManus, built on Meta's Llama-4 and extended with medically customized agents; and Manus, a proprietary agent system employing a multistep planner-executor-verifier architecture. Both systems were assessed across three benchmark families: AgentClinic, a stepwise dialog-based diagnostic simulation; MedAgentsBench, a knowledge-intensive medical QA dataset; and Humanity's Last Exam (HLE), a suite of challenging text-only and multimodal questions. Despite access to advanced tools (e.g., web browsing, code development and execution, and text file editing) agent systems yielded only modest accuracy gains over baseline LLMs, reaching 60.3% and 28.0% in AgentClinic MedQA and MIMIC, 30.3% on MedAgentsBench, and 8.6% on HLE text. Multimodal accuracy remained low (15.5% on multimodal HLE, 29.2% on AgentClinic NEJM), while resource demands increased substantially, with >10× token usage and >2× latency. Although 89.9% of hallucinations were filtered by in-agent safeguards, hallucinations remained prevalent. These findings reveal that current agentic designs offer modest performance benefits at significant computational and workflow cost, underscoring the need for more accurate, efficient, and clinically viable agent systems.
536 _ _ |a 899 - ohne Topic (POF4-899)
|0 G:(DE-HGF)POF4-899
|c POF4-899
|f POF IV
|x 0
588 _ _ |a Dataset connected to CrossRef, PubMed, , Journals: inrepo02.dkfz.de
700 1 _ |a Carrero, Zunamys I
|b 1
700 1 _ |a Jiang, Xiaofeng
|b 2
700 1 _ |a Ferber, Dyke Steven
|0 P:(DE-He78)166430fa64b69b0f218010db468b95bd
|b 3
700 1 _ |a Wölflein, Georg
|b 4
700 1 _ |a Zhang, Li
|b 5
700 1 _ |a Jayabalan, Sanddhya
|b 6
700 1 _ |a Lenz, Tim
|b 7
700 1 _ |a Hui, Zhouguang
|b 8
700 1 _ |a Kather, Jakob Nikolas
|0 P:(DE-He78)761f5d0f73e0d8f170394b29448a9e8d
|b 9
|u dkfz
773 _ _ |a 10.1038/s41746-026-02443-6
|0 PERI:(DE-600)2925182-5
|p nn
|t npj digital medicine
|v nn
|y 2026
|x 2398-6352
909 C O |o oai:inrepo02.dkfz.de:309943
|p VDB
910 1 _ |a Deutsches Krebsforschungszentrum
|0 I:(DE-588b)2036810-0
|k DKFZ
|b 3
|6 P:(DE-He78)166430fa64b69b0f218010db468b95bd
910 1 _ |a Deutsches Krebsforschungszentrum
|0 I:(DE-588b)2036810-0
|k DKFZ
|b 9
|6 P:(DE-He78)761f5d0f73e0d8f170394b29448a9e8d
913 1 _ |a DE-HGF
|b Programmungebundene Forschung
|l ohne Programm
|1 G:(DE-HGF)POF4-890
|0 G:(DE-HGF)POF4-899
|3 G:(DE-HGF)POF4
|2 G:(DE-HGF)POF4-800
|4 G:(DE-HGF)POF
|v ohne Topic
|x 0
914 1 _ |y 2026
915 _ _ |a JCR
|0 StatID:(DE-HGF)0100
|2 StatID
|b NPJ DIGIT MED : 2022
|d 2025-11-05
915 _ _ |a DBCoverage
|0 StatID:(DE-HGF)0200
|2 StatID
|b SCOPUS
|d 2025-11-05
915 _ _ |a DBCoverage
|0 StatID:(DE-HGF)0300
|2 StatID
|b Medline
|d 2025-11-05
915 _ _ |a DBCoverage
|0 StatID:(DE-HGF)0320
|2 StatID
|b PubMed Central
|d 2025-11-05
915 _ _ |a DBCoverage
|0 StatID:(DE-HGF)0501
|2 StatID
|b DOAJ Seal
|d 2025-08-21T14:06:20Z
915 _ _ |a DBCoverage
|0 StatID:(DE-HGF)0500
|2 StatID
|b DOAJ
|d 2025-08-21T14:06:20Z
915 _ _ |a Peer Review
|0 StatID:(DE-HGF)0030
|2 StatID
|b DOAJ : Anonymous peer review
|d 2025-08-21T14:06:20Z
915 _ _ |a DBCoverage
|0 StatID:(DE-HGF)0199
|2 StatID
|b Clarivate Analytics Master Journal List
|d 2025-11-05
915 _ _ |a DBCoverage
|0 StatID:(DE-HGF)1110
|2 StatID
|b Current Contents - Clinical Medicine
|d 2025-11-05
915 _ _ |a DBCoverage
|0 StatID:(DE-HGF)0160
|2 StatID
|b Essential Science Indicators
|d 2025-11-05
915 _ _ |a WoS
|0 StatID:(DE-HGF)0113
|2 StatID
|b Science Citation Index Expanded
|d 2025-11-05
915 _ _ |a DBCoverage
|0 StatID:(DE-HGF)0150
|2 StatID
|b Web of Science Core Collection
|d 2025-11-05
915 _ _ |a IF >= 15
|0 StatID:(DE-HGF)9915
|2 StatID
|b NPJ DIGIT MED : 2022
|d 2025-11-05
915 _ _ |a Article Processing Charges
|0 StatID:(DE-HGF)0561
|2 StatID
|d 2025-11-05
915 _ _ |a Fees
|0 StatID:(DE-HGF)0700
|2 StatID
|d 2025-11-05
920 1 _ |0 I:(DE-He78)HD02-20160331
|k HD02
|l Koordinierungsstelle NCT Heidelberg
|x 0
980 _ _ |a journal
980 _ _ |a VDB
980 _ _ |a I:(DE-He78)HD02-20160331
980 _ _ |a UNRESTRICTED


LibraryCollectionCLSMajorCLSMinorLanguageAuthor
Marc 21