000309943 001__ 309943
000309943 005__ 20260220120944.0
000309943 0247_ $$2doi$$a10.1038/s41746-026-02443-6
000309943 0247_ $$2pmid$$apmid:41708802
000309943 037__ $$aDKFZ-2026-00392
000309943 041__ $$aEnglish
000309943 082__ $$a610
000309943 1001_ $$aLiu, Yunsong$$b0
000309943 245__ $$aBenchmarking large language model-based agent systems for clinical decision tasks.
000309943 260__ $$a[Basingstoke]$$bMacmillan Publishers Limited$$c2026
000309943 3367_ $$2DRIVER$$aarticle
000309943 3367_ $$2DataCite$$aOutput Types/Journal article
000309943 3367_ $$0PUB:(DE-HGF)16$$2PUB:(DE-HGF)$$aJournal Article$$bjournal$$mjournal$$s1771510234_3004207
000309943 3367_ $$2BibTeX$$aARTICLE
000309943 3367_ $$2ORCID$$aJOURNAL_ARTICLE
000309943 3367_ $$00$$2EndNote$$aJournal Article
000309943 500__ $$a#NCTZFB9# / epub
000309943 520__ $$aAgentic artificial intelligence (AI) systems, designed to autonomously reason, plan, and invoke tools, have shown promise in healthcare, yet systematic benchmarking of their real-world performance remains limited. In this study, we evaluate two such systems: the open-source OpenManus, built on Meta's Llama-4 and extended with medically customized agents; and Manus, a proprietary agent system employing a multistep planner-executor-verifier architecture. Both systems were assessed across three benchmark families: AgentClinic, a stepwise dialog-based diagnostic simulation; MedAgentsBench, a knowledge-intensive medical QA dataset; and Humanity's Last Exam (HLE), a suite of challenging text-only and multimodal questions. Despite access to advanced tools (e.g., web browsing, code development and execution, and text file editing) agent systems yielded only modest accuracy gains over baseline LLMs, reaching 60.3% and 28.0% in AgentClinic MedQA and MIMIC, 30.3% on MedAgentsBench, and 8.6% on HLE text. Multimodal accuracy remained low (15.5% on multimodal HLE, 29.2% on AgentClinic NEJM), while resource demands increased substantially, with >10× token usage and >2× latency. Although 89.9% of hallucinations were filtered by in-agent safeguards, hallucinations remained prevalent. These findings reveal that current agentic designs offer modest performance benefits at significant computational and workflow cost, underscoring the need for more accurate, efficient, and clinically viable agent systems.
000309943 536__ $$0G:(DE-HGF)POF4-899$$a899 - ohne Topic (POF4-899)$$cPOF4-899$$fPOF IV$$x0
000309943 588__ $$aDataset connected to CrossRef, PubMed, , Journals: inrepo02.dkfz.de
000309943 7001_ $$aCarrero, Zunamys I$$b1
000309943 7001_ $$aJiang, Xiaofeng$$b2
000309943 7001_ $$0P:(DE-He78)166430fa64b69b0f218010db468b95bd$$aFerber, Dyke Steven$$b3
000309943 7001_ $$aWölflein, Georg$$b4
000309943 7001_ $$aZhang, Li$$b5
000309943 7001_ $$aJayabalan, Sanddhya$$b6
000309943 7001_ $$aLenz, Tim$$b7
000309943 7001_ $$aHui, Zhouguang$$b8
000309943 7001_ $$0P:(DE-He78)761f5d0f73e0d8f170394b29448a9e8d$$aKather, Jakob Nikolas$$b9$$udkfz
000309943 773__ $$0PERI:(DE-600)2925182-5$$a10.1038/s41746-026-02443-6$$pnn$$tnpj digital medicine$$vnn$$x2398-6352$$y2026
000309943 909CO $$ooai:inrepo02.dkfz.de:309943$$pVDB
000309943 9101_ $$0I:(DE-588b)2036810-0$$6P:(DE-He78)166430fa64b69b0f218010db468b95bd$$aDeutsches Krebsforschungszentrum$$b3$$kDKFZ
000309943 9101_ $$0I:(DE-588b)2036810-0$$6P:(DE-He78)761f5d0f73e0d8f170394b29448a9e8d$$aDeutsches Krebsforschungszentrum$$b9$$kDKFZ
000309943 9131_ $$0G:(DE-HGF)POF4-899$$1G:(DE-HGF)POF4-890$$2G:(DE-HGF)POF4-800$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$aDE-HGF$$bProgrammungebundene Forschung$$lohne Programm$$vohne Topic$$x0
000309943 9141_ $$y2026
000309943 915__ $$0StatID:(DE-HGF)0100$$2StatID$$aJCR$$bNPJ DIGIT MED : 2022$$d2025-11-05
000309943 915__ $$0StatID:(DE-HGF)0200$$2StatID$$aDBCoverage$$bSCOPUS$$d2025-11-05
000309943 915__ $$0StatID:(DE-HGF)0300$$2StatID$$aDBCoverage$$bMedline$$d2025-11-05
000309943 915__ $$0StatID:(DE-HGF)0320$$2StatID$$aDBCoverage$$bPubMed Central$$d2025-11-05
000309943 915__ $$0StatID:(DE-HGF)0501$$2StatID$$aDBCoverage$$bDOAJ Seal$$d2025-08-21T14:06:20Z
000309943 915__ $$0StatID:(DE-HGF)0500$$2StatID$$aDBCoverage$$bDOAJ$$d2025-08-21T14:06:20Z
000309943 915__ $$0StatID:(DE-HGF)0030$$2StatID$$aPeer Review$$bDOAJ : Anonymous peer review$$d2025-08-21T14:06:20Z
000309943 915__ $$0StatID:(DE-HGF)0199$$2StatID$$aDBCoverage$$bClarivate Analytics Master Journal List$$d2025-11-05
000309943 915__ $$0StatID:(DE-HGF)1110$$2StatID$$aDBCoverage$$bCurrent Contents - Clinical Medicine$$d2025-11-05
000309943 915__ $$0StatID:(DE-HGF)0160$$2StatID$$aDBCoverage$$bEssential Science Indicators$$d2025-11-05
000309943 915__ $$0StatID:(DE-HGF)0113$$2StatID$$aWoS$$bScience Citation Index Expanded$$d2025-11-05
000309943 915__ $$0StatID:(DE-HGF)0150$$2StatID$$aDBCoverage$$bWeb of Science Core Collection$$d2025-11-05
000309943 915__ $$0StatID:(DE-HGF)9915$$2StatID$$aIF >= 15$$bNPJ DIGIT MED : 2022$$d2025-11-05
000309943 915__ $$0StatID:(DE-HGF)0561$$2StatID$$aArticle Processing Charges$$d2025-11-05
000309943 915__ $$0StatID:(DE-HGF)0700$$2StatID$$aFees$$d2025-11-05
000309943 9201_ $$0I:(DE-He78)HD02-20160331$$kHD02$$lKoordinierungsstelle NCT Heidelberg$$x0
000309943 980__ $$ajournal
000309943 980__ $$aVDB
000309943 980__ $$aI:(DE-He78)HD02-20160331
000309943 980__ $$aUNRESTRICTED