%0 Journal Article
%A Dorfner, Felix J
%A Dada, Amin
%A Busch, Felix
%A Makowski, Marcus R
%A Han, Tianyu
%A Truhn, Daniel
%A Kleesiek, Jens
%A Sushil, Madhumita
%A Adams, Lisa C
%A Bressem, Keno K
%T Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks.
%J Journal of the American Medical Informatics Association
%V 32
%N 6
%@ 1067-5027
%C Oxford
%I Oxford Univ. Press
%M DKFZ-2025-00736
%P 1015-1024
%D 2025
%Z 2025 Jun 1;32(6):1015-1024
%X Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study aims to critically evaluate the performance of biomedically fine-tuned LLMs against their general-purpose counterparts across a range of clinical tasks.We evaluated the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on clinical case challenges from NEJM and JAMA, and on multiple clinical tasks, such as information extraction, document summarization and clinical coding. We used a diverse set of benchmarks specifically chosen to be outside the likely fine-tuning datasets of biomedical models, ensuring a fair assessment of generalization capabilities.Biomedical LLMs generally underperformed compared to general-purpose models, especially on tasks not focused on probing medical knowledge. While on the case challenges, larger biomedical and general-purpose models showed similar performance (eg, OpenBioLLM-70B: 66.4
%K benchmarking (Other)
%K biomedical fine-tuning (Other)
%K domain-specific adaptation (Other)
%K hallucination in AI models (Other)
%K large language models (LLMs) (Other)
%F PUB:(DE-HGF)16
%9 Journal Article
%$ pmid:40190132
%R 10.1093/jamia/ocaf045
%U https://inrepo02.dkfz.de/record/300283