Journal Article DKFZ-2025-00736

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks.

 ;  ;  ;  ;  ;  ;  ;  ;  ;

2025
Oxford Univ. Press Oxford

Journal of the American Medical Informatics Association 32(6), 1015-1024 () [10.1093/jamia/ocaf045]
 GO

This record in other databases:  

Please use a persistent id in citations: doi:

Abstract: Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study aims to critically evaluate the performance of biomedically fine-tuned LLMs against their general-purpose counterparts across a range of clinical tasks.We evaluated the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on clinical case challenges from NEJM and JAMA, and on multiple clinical tasks, such as information extraction, document summarization and clinical coding. We used a diverse set of benchmarks specifically chosen to be outside the likely fine-tuning datasets of biomedical models, ensuring a fair assessment of generalization capabilities.Biomedical LLMs generally underperformed compared to general-purpose models, especially on tasks not focused on probing medical knowledge. While on the case challenges, larger biomedical and general-purpose models showed similar performance (eg, OpenBioLLM-70B: 66.4% vs Llama-3-70B-Instruct: 65% on JAMA), smaller biomedical models showed more pronounced underperformance (OpenBioLLM-8B: 30% vs Llama-3-8B-Instruct: 64.3% on NEJM). Similar trends appeared across CLUE benchmarks, with general-purpose models often achieving higher scores in text generation, question answering, and coding. Notably, biomedical LLMs also showed a higher tendency to hallucinate.Our findings challenge the assumption that biomedical fine-tuning inherently improves LLM performance, as general-purpose models consistently performed better on unseen medical tasks. Retrieval-augmented generation may offer a more effective strategy for clinical adaptation.Fine-tuning LLMs on biomedical data may not yield the anticipated benefits. Alternative approaches, such as retrieval augmentation, should be further explored for effective and reliable clinical integration of LLMs.

Keyword(s): benchmarking ; biomedical fine-tuning ; domain-specific adaptation ; hallucination in AI models ; large language models (LLMs)

Classification:

Note: 2025 Jun 1;32(6):1015-1024

Contributing Institute(s):
  1. DKTK Koordinierungsstelle Essen/Düsseldorf (ED01)
Research Program(s):
  1. 899 - ohne Topic (POF4-899) (POF4-899)

Appears in the scientific report 2025
Database coverage:
Medline ; Clarivate Analytics Master Journal List ; Current Contents - Clinical Medicine ; Current Contents - Social and Behavioral Sciences ; Essential Science Indicators ; IF >= 5 ; JCR ; NationallizenzNationallizenz ; SCOPUS ; Science Citation Index Expanded ; Social Sciences Citation Index ; Web of Science Core Collection
Click to display QR Code for this record

The record appears in these collections:
Document types > Articles > Journal Article
Public records
Publications database

 Record created 2025-04-08, last modified 2025-05-20



Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)