Classifying Clinical Evidence Levels of Cancer Variants in Biomedical Literature Using Machine Learning and Large Language Models.

Credidio, Graziella; Größler, Michael; Riemann, Layla Tabea; Knurr, Alexander; Roth, Benjamin

doi:DOI:10.3233/SHTI260300

Journal Article

DKFZ-2026-01219

Classifying Clinical Evidence Levels of Cancer Variants in Biomedical Literature Using Machine Learning and Large Language Models.

Credidio, G. ; Größler, M. ; Roth, B.DKFZ* ; Knurr, A.DKFZ* ; Riemann, L. T.

2026
IOS Press Amsterdam

Studies in health technology and informatics 336, 854-858 (2026) [DOI:10.3233/SHTI260300]

Abstract: Automating the classification of clinical evidence levels in biomedical literature can support precision oncology by facilitating the acceleration of variant interpretation and informed decision-making. This study compares the performance of two state-of-the-art large language models (LLMs) (GPT-4.1-mini and Gemini-2.5-Flash) and two machine learning (ML) algorithms (decision tree and XGBoost) for classifying publications according to the Clinical Interpretation of Variants in Cancer (CIViC) evidence level system. Zero- and few-shot prompting strategies were tested for LLMs, while Term Frequency-Inverse Document Frequency (TF-IDF) and word embedding representations were evaluated for ML models. XGBoost with TF-IDF achieved the highest performance (micro-F1 = 0.83), outperforming both LLMs and decision trees. All models performed best on mid-range evidence levels (B to D) and struggled with high (A) and inferential (E) levels, reflecting dataset imbalance and linguistic ambiguity. These findings suggest that, at present, abstract-level evidence classification is largely driven by explicit lexical cues, with limited added benefit from standalone LLM-based approaches.

Keyword(s): Machine Learning (MeSH) ; Humans (MeSH) ; Neoplasms: genetics (MeSH) ; Neoplasms: classification (MeSH) ; Neoplasms: diagnosis (MeSH) ; Natural Language Processing (MeSH) ; Data Mining: methods (MeSH) ; Algorithms (MeSH) ; Large Language Models (MeSH) ; Clinical Evidence Level ; Large Language Models ; Machine Learning ; Text Classification