Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks.

Dorfner, Felix J; Makowski, Marcus R; Sushil, Madhumita; Adams, Lisa C; Han, Tianyu; Dada, Amin; Busch, Felix; Bressem, Keno K; Truhn, Daniel; Kleesiek, Jens

doi:10.1093/jamia/ocaf045

Journal Article

DKFZ-2025-00736

Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks.

Dorfner, F. J. ; Dada, A. ; Busch, F. ; Makowski, M. R. ; Han, T. ; Truhn, D. ; Kleesiek, J.DKFZ* ; Sushil, M. ; Adams, L. C. ; Bressem, K. K.

2025
Oxford Univ. Press Oxford

Journal of the American Medical Informatics Association 32(6), 1015-1024 (2025) [10.1093/jamia/ocaf045]

This record in other databases:

Please use a persistent id in citations: doi:10.1093/jamia/ocaf045

Abstract: Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study aims to critically evaluate the performance of biomedically fine-tuned LLMs against their general-purpose counterparts across a range of clinical tasks.We evaluated the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on clinical case challenges from NEJM and JAMA, and on multiple clinical tasks, such as information extraction, document summarization and clinical coding. We used a diverse set of benchmarks specifically chosen to be outside the likely fine-tuning datasets of biomedical models, ensuring a fair assessment of generalization capabilities.Biomedical LLMs generally underperformed compared to general-purpose models, especially on tasks not focused on probing medical knowledge. While on the case challenges, larger biomedical and general-purpose models showed similar performance (eg, OpenBioLLM-70B: 66.4% vs Llama-3-70B-Instruct: 65% on JAMA), smaller biomedical models showed more pronounced underperformance (OpenBioLLM-8B: 30% vs Llama-3-8B-Instruct: 64.3% on NEJM). Similar trends appeared across CLUE benchmarks, with general-purpose models often achieving higher scores in text generation, question answering, and coding. Notably, biomedical LLMs also showed a higher tendency to hallucinate.Our findings challenge the assumption that biomedical fine-tuning inherently improves LLM performance, as general-purpose models consistently performed better on unseen medical tasks. Retrieval-augmented generation may offer a more effective strategy for clinical adaptation.Fine-tuning LLMs on biomedical data may not yield the anticipated benefits. Alternative approaches, such as retrieval augmentation, should be further explored for effective and reliable clinical integration of LLMs.

Keyword(s): benchmarking ; biomedical fine-tuning ; domain-specific adaptation ; hallucination in AI models ; large language models (LLMs)

Classification:

ddc:610

Note: 2025 Jun 1;32(6):1015-1024

Contributing Institute(s):

DKTK Koordinierungsstelle Essen/Düsseldorf (ED01)

Research Program(s):

899 - ohne Topic (POF4-899) (POF4-899)

Appears in the scientific report 2025

Database coverage:
Medline

; Clarivate Analytics Master Journal List ; Current Contents - Clinical Medicine ; Current Contents - Social and Behavioral Sciences ; Essential Science Indicators ; IF >= 5 ; JCR ; Nationallizenz

; SCOPUS ; Science Citation Index Expanded ; Social Sciences Citation Index ; Web of Science Core Collection

Click to display QR Code for this record

The record appears in these collections:
Document types > Articles > Journal Article
Public records
Publications database

Record created 2025-04-08, last modified 2025-05-20

Similar records

Rate this document:

(Not yet reviewed)

Add to personal basket
Export as Author List with IDs BibTeX (UTF-8), EndNote XML, EndNote Text, RIS, MARC, Print MARC, MARCXML, DC,
Request correction
Submit fulltext

guest :: login DKFZ
		Search		Submit		Personalize Your alerts Your baskets Your searches		Help