From 049c1617e908870d5c2082e81daab85ebf148380 Mon Sep 17 00:00:00 2001 From: Zhiyuan Li <147009309+LZYEIL@users.noreply.github.com> Date: Wed, 4 Mar 2026 20:07:06 -0600 Subject: [PATCH 1/8] Re-format the section 4 --- content/04.adapting.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/content/04.adapting.md b/content/04.adapting.md index f0cdfa8..03b67bd 100644 --- a/content/04.adapting.md +++ b/content/04.adapting.md @@ -1,3 +1,7 @@ -### Adapting General Foundation Models to Biomedical Tasks +## Adapting General Foundation Models to Biomedical Tasks + +*General: +How general-purpose foundation models (e.g., large language and vision models) are adapted to biomedical applications through prompting, fine-tuning, and tool use.* + + -How general-purpose foundation models (e.g., large language and vision models) are adapted to biomedical applications through prompting, fine-tuning, and tool use. From db49b703d9d0f76b4b00fd1246b52028405f7616 Mon Sep 17 00:00:00 2001 From: Zhiyuan Li <147009309+LZYEIL@users.noreply.github.com> Date: Wed, 4 Mar 2026 22:28:36 -0600 Subject: [PATCH 2/8] Add the motivation and intro of section 4 --- content/04.adapting.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/content/04.adapting.md b/content/04.adapting.md index 03b67bd..1a009ad 100644 --- a/content/04.adapting.md +++ b/content/04.adapting.md @@ -4,4 +4,17 @@ How general-purpose foundation models (e.g., large language and vision models) are adapted to biomedical applications through prompting, fine-tuning, and tool use.* +One of the core assumptions of Domain-Adaptive Pretraining (DAPT) models is that foundation models can be pretrained on biomedical corpora in large scale to better process complex downstream tasks. However, recent empirical research by Jeong et al. (2025) [@doi:10.48550/arXiv.2411.08870] has raised a systematic challenge to this assumption: "all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting and supervised fine-tuning regimes for medical question +answering (QA)." + +In fact, modern general-purpose foundational models (such as Llama-3 and GPT-4) [@doi:10.48550/arXiv.2407.21783; @doi:10.48550/arXiv.2303.08774] have already internalized a certain level of biological and medical knowledge with reasoning capabilities: The Llama-3-8B foundational model outperforms the earlier specialized model MEDITRON-70B on multiple medical benchmarks [@doi:10.48550/arXiv.2411.08870]. + +This prompts us to think that if the diminishing marginal returns of domain-specific pre-training become apparent, directly adapting powerful general-purpose foundational models to medical tasks will emerge as a more scalable approach. Compared to training from scratch, adapting general models offers three core advantages: +- Transferable Emergent Abilities: The complex logical reasoning and instruction-following capabilities that general models acquire from massive, heterogeneous data can directly generalize to clinical and biological tasks [@doi:10.48550/arXiv.2005.14165; @doi:10.48550/arXiv.2206.07682]. +- Low Data Requirement: By leveraging prompting or fine-tuning, they can mitigate the bottleneck of data scarcity in the biomedical field [@doi:10.1038/s41586-023-05881-4]. +- Fast Iteration Speed: The rapid pace at which the open-source community and industry update the architectures of foundation models (e.g., the Llama and Qwen series) unlocks new possibilities for the future [@doi:10.48550/arXiv.2001.08361]. + +In this section, we propose a double-class taxonomy of adaptation: +- By Parameter Intervention Level: From prompting, to parameter-efficient fine-tuning (PEFT), and further to instruction tuning. +- By Knowledge Injection Method: With internalized parametric knowledge, retrieval-augmented generation (RAG), and the combination of external tools (Tool Use & Agents). From 0cd6a19581c165b85bec428d32d4701a046badd9 Mon Sep 17 00:00:00 2001 From: Zhiyuan Li <147009309+LZYEIL@users.noreply.github.com> Date: Wed, 4 Mar 2026 22:34:11 -0600 Subject: [PATCH 3/8] Fix grammar in section 4 --- content/04.adapting.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/04.adapting.md b/content/04.adapting.md index 1a009ad..10e029b 100644 --- a/content/04.adapting.md +++ b/content/04.adapting.md @@ -10,7 +10,7 @@ answering (QA)." In fact, modern general-purpose foundational models (such as Llama-3 and GPT-4) [@doi:10.48550/arXiv.2407.21783; @doi:10.48550/arXiv.2303.08774] have already internalized a certain level of biological and medical knowledge with reasoning capabilities: The Llama-3-8B foundational model outperforms the earlier specialized model MEDITRON-70B on multiple medical benchmarks [@doi:10.48550/arXiv.2411.08870]. This prompts us to think that if the diminishing marginal returns of domain-specific pre-training become apparent, directly adapting powerful general-purpose foundational models to medical tasks will emerge as a more scalable approach. Compared to training from scratch, adapting general models offers three core advantages: -- Transferable Emergent Abilities: The complex logical reasoning and instruction-following capabilities that general models acquire from massive, heterogeneous data can directly generalize to clinical and biological tasks [@doi:10.48550/arXiv.2005.14165; @doi:10.48550/arXiv.2206.07682]. +- Transferable Emergent Abilities: The complex logical reasoning and instruction-following capabilities that general models acquire from massive, heterogeneous data can directly generalize to clinical and biological tasks [@doi:10.48550/arXiv.2005.14165; @doi:10.48550/arXiv.2206.07682]. - Low Data Requirement: By leveraging prompting or fine-tuning, they can mitigate the bottleneck of data scarcity in the biomedical field [@doi:10.1038/s41586-023-05881-4]. - Fast Iteration Speed: The rapid pace at which the open-source community and industry update the architectures of foundation models (e.g., the Llama and Qwen series) unlocks new possibilities for the future [@doi:10.48550/arXiv.2001.08361]. From 2aa3fa73d07682c0f24060309bf5db97e2955bb9 Mon Sep 17 00:00:00 2001 From: Zhiyuan Li <147009309+LZYEIL@users.noreply.github.com> Date: Tue, 10 Mar 2026 13:42:58 -0500 Subject: [PATCH 4/8] Update section 4 subsection --- content/04.adapting.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/content/04.adapting.md b/content/04.adapting.md index 10e029b..c6b6e22 100644 --- a/content/04.adapting.md +++ b/content/04.adapting.md @@ -18,3 +18,7 @@ In this section, we propose a double-class taxonomy of adaptation: - By Parameter Intervention Level: From prompting, to parameter-efficient fine-tuning (PEFT), and further to instruction tuning. - By Knowledge Injection Method: With internalized parametric knowledge, retrieval-augmented generation (RAG), and the combination of external tools (Tool Use & Agents). + +### Prompting-based Adaptation +#### Zero-shot & Few-shot Prompting + From d7ad64c3d48617701ae0536ca00ba38bf6758c45 Mon Sep 17 00:00:00 2001 From: Zhiyuan Li <147009309+LZYEIL@users.noreply.github.com> Date: Sat, 14 Mar 2026 20:40:18 -0500 Subject: [PATCH 5/8] Update section 4 --- content/04.adapting.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/content/04.adapting.md b/content/04.adapting.md index c6b6e22..0aca861 100644 --- a/content/04.adapting.md +++ b/content/04.adapting.md @@ -21,4 +21,10 @@ In this section, we propose a double-class taxonomy of adaptation: ### Prompting-based Adaptation #### Zero-shot & Few-shot Prompting +Zero-shot and few-shot prompting are the fundamental paradigms for evaluating the “out-of-the-box” capabilities of general foundational models. In this setting, the model’s parameters remain completely frozen, and the model is required to perform specific tasks solely through the input prompts. Zero-shot does not provide any task examples to the model; few-shot, on the other hand, incorporates a small number of “input–output” examples to help the model understand the reasoning patterns [@doi:10.48550/arXiv.2005.14165]. In the biomedical field, this adaptation method gives a crucial window into the domain knowledge implicit in general-purpose models. +In clinical medicine, early studies demonstrated that modern large language models (LLMs) already possess substantial medical knowledge under zero-shot conditions. A common strategy is to convert clinical questions into standardized multiple-choice templates—explicitly formatting inputs as “Question – Options (A) to (E) – Answer”. This allows LLMs to process complex diagnostic scenarios in a consistent evaluation framework [@doi:10.48550/arXiv.2303.13375]. Such structured prompting have been shown to let models pass professional medical examinations such as the USMLE (United States Medical Licensing Examination) without any domain-specific fine-tuning, and even outperform earlier specialized medical systems in certain tasks. What's more, incorporating a small number of expert demonstrations in the prompt has proven highly effective for aligning model outputs with professional medical reasoning patterns. In particular, inserting three to five clinician-authored question–answer examples as few-shot demonstrations in the prompt prefix significantly improves performance on comprehensive benchmarks such as MultiMedQA [@doi:10.1038/s41586-023-06291-2]. These examples provide implicit guidance for both reasoning structure and output style, allowing the model to emulate real-world decision-making processes. + +In the fields of molecular biology and chemistry, the primary challenge of prompting lies in mapping non-textual biological sequences into linguistic representations that large language models can process natively. Recent research has explored strategies for representing biological macromolecules as “language-like sequences,” enabling general-purpose LLMs to interpret biological data through purely text-based prompts. One notable approach treats protein sequences as a "second language." By structuring the prompt as a translation task (e.g., Directly inputting the amino acid sequence with an instruction: "Please translate the following protein sequence into its biological function"), models are enabled to infer underlying properties entirely zero-shot [@doi:10.48550/arXiv.2510.11188]. Alternatively, alignment between natural language and protein sequences can be achieved by dynamically concatenating knowledge-guided instructions (such as Gene Ontology definitions) with the target sequence in the prompt [@doi:10.18653/v1/2024.acl-long.62]. In the field of small-molecule chemistry, 1D molecular string representations are routinely utilized to construct few-shot prompts. By wrapping SMILES strings within natural language query templates (e.g., "Describe the pharmacological properties of the molecule: [SMILES]"), general-purpose language models can effectively perform molecular property prediction and chemical reasoning tasks [@doi:10.48550/arXiv.2406.06777]. + +Overall, zero-shot and few-shot prompting showcase the powerful potential of general-purpose foundational models in the biomedical field. Simply through prompt design at the input level, researchers can activate the implicit biomedical knowledge acquired during large-scale pre-training. This capability enables effective adaptation to domain tasks, while also laying the conceptual foundation for later more complex adaptation methods. From 11f32e149800efa48def247b62d6a23e4a4741f0 Mon Sep 17 00:00:00 2001 From: Zhiyuan Li <147009309+LZYEIL@users.noreply.github.com> Date: Sat, 11 Apr 2026 21:16:21 -0500 Subject: [PATCH 6/8] Add RAG --- content/04.adapting.md | 129 ++++++++++++++++++++++++++++++++++++++--- 1 file changed, 120 insertions(+), 9 deletions(-) diff --git a/content/04.adapting.md b/content/04.adapting.md index 0aca861..a3c8238 100644 --- a/content/04.adapting.md +++ b/content/04.adapting.md @@ -3,28 +3,139 @@ *General: How general-purpose foundation models (e.g., large language and vision models) are adapted to biomedical applications through prompting, fine-tuning, and tool use.* - -One of the core assumptions of Domain-Adaptive Pretraining (DAPT) models is that foundation models can be pretrained on biomedical corpora in large scale to better process complex downstream tasks. However, recent empirical research by Jeong et al. (2025) [@doi:10.48550/arXiv.2411.08870] has raised a systematic challenge to this assumption: "all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting and supervised fine-tuning regimes for medical question -answering (QA)." +One of the core assumptions of Domain-Adaptive Pretraining (DAPT) models is that foundation models can be pretrained on biomedical corpora in large scale to better process complex downstream tasks. However, recent empirical research [@doi:10.48550/arXiv.2411.08870] has raised a systematic challenge to this assumption: "all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting and supervised fine-tuning regimes for medical question answering (QA)." In fact, modern general-purpose foundational models (such as Llama-3 and GPT-4) [@doi:10.48550/arXiv.2407.21783; @doi:10.48550/arXiv.2303.08774] have already internalized a certain level of biological and medical knowledge with reasoning capabilities: The Llama-3-8B foundational model outperforms the earlier specialized model MEDITRON-70B on multiple medical benchmarks [@doi:10.48550/arXiv.2411.08870]. This prompts us to think that if the diminishing marginal returns of domain-specific pre-training become apparent, directly adapting powerful general-purpose foundational models to medical tasks will emerge as a more scalable approach. Compared to training from scratch, adapting general models offers three core advantages: -- Transferable Emergent Abilities: The complex logical reasoning and instruction-following capabilities that general models acquire from massive, heterogeneous data can directly generalize to clinical and biological tasks [@doi:10.48550/arXiv.2005.14165; @doi:10.48550/arXiv.2206.07682]. -- Low Data Requirement: By leveraging prompting or fine-tuning, they can mitigate the bottleneck of data scarcity in the biomedical field [@doi:10.1038/s41586-023-05881-4]. -- Fast Iteration Speed: The rapid pace at which the open-source community and industry update the architectures of foundation models (e.g., the Llama and Qwen series) unlocks new possibilities for the future [@doi:10.48550/arXiv.2001.08361]. + +- **Transferable Emergent Abilities:** The complex logical reasoning and instruction-following capabilities that general models acquire from massive, heterogeneous data can directly generalize to clinical and biological tasks [@doi:10.48550/arXiv.2005.14165; @doi:10.48550/arXiv.2206.07682]. +- **Low Data Requirement:** By leveraging prompting or fine-tuning, they can mitigate the bottleneck of data scarcity in the biomedical field [@doi:10.1038/s41586-023-05881-4]. +- **Fast Iteration Speed:** The rapid pace at which the open-source community and industry update the architectures of foundation models (e.g., the Llama and Qwen series) unlocks new possibilities for the future [@doi:10.48550/arXiv.2001.08361]. In this section, we propose a double-class taxonomy of adaptation: -- By Parameter Intervention Level: From prompting, to parameter-efficient fine-tuning (PEFT), and further to instruction tuning. -- By Knowledge Injection Method: With internalized parametric knowledge, retrieval-augmented generation (RAG), and the combination of external tools (Tool Use & Agents). + +- **By Parameter Intervention Level:** From prompting, to parameter-efficient fine-tuning (PEFT), and further to instruction tuning. +- **By Knowledge Injection Method:** With internalized parametric knowledge, retrieval-augmented generation (RAG), and the combination of external tools (Tool Use & Agents). + + + + ### Prompting-based Adaptation + #### Zero-shot & Few-shot Prompting + Zero-shot and few-shot prompting are the fundamental paradigms for evaluating the “out-of-the-box” capabilities of general foundational models. In this setting, the model’s parameters remain completely frozen, and the model is required to perform specific tasks solely through the input prompts. Zero-shot does not provide any task examples to the model; few-shot, on the other hand, incorporates a small number of “input–output” examples to help the model understand the reasoning patterns [@doi:10.48550/arXiv.2005.14165]. In the biomedical field, this adaptation method gives a crucial window into the domain knowledge implicit in general-purpose models. + + + In clinical medicine, early studies demonstrated that modern large language models (LLMs) already possess substantial medical knowledge under zero-shot conditions. A common strategy is to convert clinical questions into standardized multiple-choice templates—explicitly formatting inputs as “Question – Options (A) to (E) – Answer”. This allows LLMs to process complex diagnostic scenarios in a consistent evaluation framework [@doi:10.48550/arXiv.2303.13375]. Such structured prompting have been shown to let models pass professional medical examinations such as the USMLE (United States Medical Licensing Examination) without any domain-specific fine-tuning, and even outperform earlier specialized medical systems in certain tasks. What's more, incorporating a small number of expert demonstrations in the prompt has proven highly effective for aligning model outputs with professional medical reasoning patterns. In particular, inserting three to five clinician-authored question–answer examples as few-shot demonstrations in the prompt prefix significantly improves performance on comprehensive benchmarks such as MultiMedQA [@doi:10.1038/s41586-023-06291-2]. These examples provide implicit guidance for both reasoning structure and output style, allowing the model to emulate real-world decision-making processes. + + In the fields of molecular biology and chemistry, the primary challenge of prompting lies in mapping non-textual biological sequences into linguistic representations that large language models can process natively. Recent research has explored strategies for representing biological macromolecules as “language-like sequences,” enabling general-purpose LLMs to interpret biological data through purely text-based prompts. One notable approach treats protein sequences as a "second language." By structuring the prompt as a translation task (e.g., Directly inputting the amino acid sequence with an instruction: "Please translate the following protein sequence into its biological function"), models are enabled to infer underlying properties entirely zero-shot [@doi:10.48550/arXiv.2510.11188]. Alternatively, alignment between natural language and protein sequences can be achieved by dynamically concatenating knowledge-guided instructions (such as Gene Ontology definitions) with the target sequence in the prompt [@doi:10.18653/v1/2024.acl-long.62]. In the field of small-molecule chemistry, 1D molecular string representations are routinely utilized to construct few-shot prompts. By wrapping SMILES strings within natural language query templates (e.g., "Describe the pharmacological properties of the molecule: [SMILES]"), general-purpose language models can effectively perform molecular property prediction and chemical reasoning tasks [@doi:10.48550/arXiv.2406.06777]. -Overall, zero-shot and few-shot prompting showcase the powerful potential of general-purpose foundational models in the biomedical field. Simply through prompt design at the input level, researchers can activate the implicit biomedical knowledge acquired during large-scale pre-training. This capability enables effective adaptation to domain tasks, while also laying the conceptual foundation for later more complex adaptation methods. + + +#### Chain-of-Thought & Structured Reasoning Prompting + +While standard few-shot prompting excel at extracting explicit knowledge, complex biomedical tasks—such as differential diagnosis or biological pathway analysis—strictly demand multi-step causal reasoning. To bridge this gap, Chain-of-Thought (CoT) prompting and structured reasoning strategies have emerged as critical adaptation paradigms. By explicitly prompting the model to generate intermediate rationales before giving a final answer, CoT unlocks the latent sequential logic embedded within foundational models [@doi:10.48550/arXiv.2201.11903]. + + + +In clinical settings, diagnostic accuracy and reliability are significantly enhanced when models are guided through structured cognitive processes. Beyond basic CoT, Self-consistency (SC) decoding strategies—which sample multiple independent reasoning paths and aggregate the most frequent final answer via majority voting—have proven essential for stabilizing medical reasoning [@doi:10.48550/arXiv.2203.11171]. This concept later evolved into Ensemble Refinement (ER) in specialized medical frameworks like Med-PaLM 2, where the model is prompted to refine its own final answer based on a synthesized review of multiple generated rationales [@doi:10.1038/s41591-024-03423-7]. Furthermore, diagnostic performance can be optimized by mimicking specific clinical styles, such as Differential, Analytical, and Bayesian reasoning prompts, which allow models to generate interpretable reasoning chains without sacrificing precision [@doi:10.1038/s41746-024-01010-1]. Practical implementations in radiology have shown that a two-step structured approach—first categorizing clinical information into history and imaging findings, then performing synthesis—outperforms standard CoT in diagnostic accuracy for complex cases [@doi:10.1007/s11604-024-01712-2]. The potential of such approaches is further explored in randomized clinical trials; while implementations in which physicians freely query LLMs show no significant gains in diagnostic reasoning, the fact that standalone LLMs can exceed physician performance highlights a promising opportunity: systematically integrating structured-reasoning prompting into human-AI workflows to advance clinical impact[@doi:10.1001/jamanetworkopen.2024.40969]. + + + +In genomics, structured reasoning is often used to decompose high-dimensional data into interpretable insights. Rather than direct functional prediction, a "decompose-then-analyze" CoT strategy is employed for gene set functional discovery. This involves guiding the model to analyze individual gene functions and their biological intersections before assigning a name to a gene set, a method that has successfully identified novel functional modules missed by traditional enrichment analysis [@doi:10.1038/s41592-024-02525-x]. For biological summaries, structured scoring prompts—which require models to evaluate candidates across multiple dimensions such as biomarker relevance, therapeutic value, and biological significance—transform qualitative knowledge into quantitative decision-making frameworks [@doi:10.1186/s12967-023-04576-8]. These structured prompting strategies demonstrate that general-purpose models can move beyond standard clinical texts generation to perform knowledge-driven biological inference. + + + +#### Multi-modal Prompting + +Beyond textual sequences, in medical image analysis, the paradigm of "promptable segmentation" has redefined zero-shot interaction. Rather than being confined to predefined anatomical categories, models like the Segment Anything Model (SAM) utilize discrete spatial prompts—such as points, bounding boxes, or masks—to redirect model attention and isolate arbitrary anatomical structures [@doi:10.48550/arXiv.2304.02643]. However, systematic evaluations across MRI, CT, and ultrasound modalities reveal that while zero-shot SAM performs impressively for well-circumscribed objects, its performance is highly variable and drops significantly in ambiguous scenarios, such as brain tumor segmentation [@doi:10.1016/j.media.2023.102918]. For diagnostic classification, the "label-to-prompt" framework established by CLIP, where discrete categories are wrapped into natural language templates, has paved the way for zero-shot pathology recognition without relying on standard supervised classifiers. [@doi:10.48550/arXiv.2103.00020]. + + + +For high-level clinical reasoning, interleaved text-image prompting enables general Multimodal Large Language Models (MLLMs) to activate latent medical expertise directly from raw visual inputs. To enhance the interpretability of these zero-shot decisions, strategies like MedCoT (Medical Chain of Thought) utilize hierarchical expert prompts—simulating a process of initial diagnosis, review verification, and expert consensus—to generate explicit reasoning chains for medical VQA [@doi:10.18533/v1/2024.emnlp-main.962]. Systematic evaluations of general models further confirm that prompt engineering can successfully elicit high-quality medical image understanding for report generation and VQA. However, tasks requiring precise medical visual grounding still need substantial improvement, and that conventional metrics often fail to capture clinical efficacy [@doi:10.48550/arXiv.2310.20381]. + + + + + +#### Limitations of Prompting + +Despite the flexibility of prompting-based adaptation, it introduces several structural limitations that compromise its reliability in biomedical environments. + + + +First, prompting-based adaptation suffers from extreme sensitivity. Few-shot performance is highly susceptible to the permutation order of examples, where even a slight reordering can lead to random guessing performance [@doi:10.18653/v1/2022.acl-long.556]. In clinical Natural Language Processing (NLP), evaluations across multiple prompt types demonstrate that zero-shot is sensitive to task-specific nuances. Without task-specific tailoring, there remains a significant performance gap when facing different clinical scenarios [@doi:10.2196/55318]. + + + + +Medical hallucinations also remains a fundamental challenge. Benchmarks like Med-HALT reveal that models frequently generate plausible yet unverified information, exposing significant flaws in their reasoning capabilities [@doi:10.48550/arXiv.2307.15343]. Furthermore, evaluations on real-world healthcare queries (MedHalu) demonstrate that LLMs are also profoundly vulnerable to these errors—often underperforming human experts in detecting hallucinations within their own generated responses [@doi:10.48550/arXiv.2409.19492]. + + + +Finally, the processing of biomedical data is hindered by context-dependent degradation. General models universally exhibit a "lost in the middle" phenomenon, where they robustly utilize information at the beginning of a long input context but fail to extract critical features from the center [@doi:10.1162/tacl_a_00638]. When models attempt to leverage long patient contexts, they suffer from underlying memorization issues and struggle severely with cases requiring temporal reasoning [@doi:10.48550/arXiv.2510.18691]. + + + +The structural limitations underscore the necessity of moving beyond pure prompting toward Retrieval-Augmented Generation (RAG). + + + + + +### Retrieval-Augmented Generation (RAG) + +To mitigate previous problems, Retrieval-Augmented Generation (RAG) introduces a paradigm that combines zero parameter modification with external knowledge injection. This "retrieve-and-generate" framework connects a pre-trained model’s parametric memory with a non-parametric memory (such as a dense vector index of Wikipedia). By grounding generation in retrieved documents, this approach significantly improves factual faithfulness [@doi:10.48550/arXiv.2005.11401]. + + + +#### Motivation for RAG in Biomedicine + +The adoption of RAG in the biomedical domain is driven by three core motivations: + +- **Literature Explosion:** The sheer volume of biomedical research far exceeds the capacity for foundation models to be frequently re-trained. PubMed currently contains over 37 million citations, with roughly 1.5 million new articles added each year. As a result, continuous pre-training to internalize this massive influx of literature is computationally difficult [@doi:10.48550/arXiv.2501.07171]. +- **The Criticality of Medical Timeliness:** Biomedical knowledge is highly dynamic. The "half-life" of medical knowledge—the time it takes for half of what is currently known to be proven false—has drastically shortened, dropping from 50 years in the 1950s to an estimated 73 days in 2020 [@pmid:21686208]. Clinical guidelines are continuously updated (e.g., pandemic protocols), and since LLMs have a fixed pre-training data cutoff, their static parametric memory becomes obsolete when the training ends. RAG bypasses this limitation by fetching real-time data, ensuring that clinical decisions are grounded in the current evidence. + + + +#### Biomedical-Specific Retrieval System Design + +The efficacy of a RAG system intrinsically depends on the quality of its retrieval component. The design of has evolved from naive lexical matching to domain-specific semantic architectures to address the unique linguistic complexities of biomedical terminology [@doi:10.48550/arXiv.2505.01146]. + + + +Historically, sparse retrieval methods based on probabilistic relevance frameworks, such as BM25 [@doi:10.1561/1500000019], have served as the baseline for document fetching. While computationally efficient, sparse retrieval struggles with the severe synonymy in biomedicine. To overcome lexical limitations, Dense Passage Retrieval (DPR) [@doi:10.18653/v1/2020.emnlp-main.550] introduced a dual-encoder architecture that maps both queries and documents into a shared continuous vector space. However, general-domain dense retrievers often experience performance degradation when applied out-of-the-box to medical corpora. As exposed by heterogeneous benchmarks like BEIR [@doi:10.48550/arXiv.2104.08663], general dense models even underperform traditional sparse methods like BM25 on clinical datasets due to domain shift. + + + +To bridge the gap, advanced data construction and domain-specific pre-training are strictly required. A prominent milestone is MedCPT (Contrastive Pre-trained Transformers for Medicine) [@doi:10.1093/bioinformatics/btad651]. It achieved SOTA zero-shot performance by pre-training a retriever-reranker architecture on an unprecedented scale of real-world PubMed user search logs (over 255 million interactions), proving that domain-specific contrastive alignment is far more efficient than scaling general parameters. + + + +Also, identifying distinctions between similar but diagnostically distinct medical documents remains a core bottleneck. Recent frameworks like BiCA [@doi:10.48550/arXiv.2511.08029] tackle this by proposing citation-aware hard negative mining. By utilizing multi-hop citation links within PubMed to generate contextually irrelevant negative samples, BiCA compels the dense retriever to learn highly precise semantic boundaries, paving the path toward highly data-efficient domain adaptation in biomedical RAG systems. + + + +#### Applications of RAG in Specific Biomedical Tasks + +Under clinical settings, RAG primarily aims to enhance clinical question answering (QA) by grounding foundation models in evidence-based guidelines. MIRAGE benchmark and the MedRAG toolkit [@doi:10.18653/v1/2024.findings-acl.372] have comprehensively demonstrated the impact of retrieval augmentation. Across 41 combinations of corpora, retrievers, and LLMs, RAG improved the accuracy of various models by up to 18%, enabling smaller models like GPT-3.5 and Mixtral to rival the zero-shot performance of GPT-4. However, these benchmarks also highlight the "lost-in-the-middle" effect, echoing the context-dependent degradation seen in prompting-based adaptation. + + + +To translate performance gains into practical tools, frameworks like Almanac [@doi:10.1056/AIoa2300068] integrate manually curated clinical guidelines and treatment recommendations as high-quality retrieval sources. Evaluated blindly by clinicians across hundreds of clinical queries, Almanac significantly outperformed standard search-augmented LLMs (e.g., standard ChatGPT-4, Bing, and Bard) in factual accuracy, completeness, and adversarial safety. Furthermore, to address the multi-step nature of complex medical diagnosis, i-MedRAG [@doi:10.48550/arXiv.2408.00727] introduces an iterative follow-up mechanism: The model autonomously generates subsequent queries based on the initial retrieved context, overcoming the insufficient information retrieval bottleneck inherent in single-turn medical RAG systems. + + + +In biology, RAG applications shift to querying highly structured bioinformatics databases. This structural shift often blurs the boundary between traditional RAG and tool-augmented LLMs (a concept further explored in Section *'Tool Use and LLM-based Agents'*). GeneGPT [@doi:10.1093/bioinformatics/btae075] bypasses standard text retrieval by teaching LLMs to execute API calls to NCBI databases (e.g., for gene queries and sequence searches). This tool-augmented RAG paradigm significantly outperforms parametric baselines on genomics benchmarks like GeneTuring. + + + +Similarly, mitigating hallucinations in functional genomics remains critical, as fabricated gene functions can misguide biological experiments. GeneAgent [@doi:10.1038/s41592-025-02748-6] tackles this by introducing a self-verification language agent for gene set analysis. After generating initial functional annotations, GeneAgent automatically queries external biological databases to verify and fact-check its own outputs. This integration of retrieval augmentation and self-correction yields significantly higher biological consistency than standalone GPT-4. From 11a2ff60c969ce106b4b05640affb7b7b78d47b4 Mon Sep 17 00:00:00 2001 From: Zhiyuan Li <147009309+LZYEIL@users.noreply.github.com> Date: Sat, 11 Apr 2026 22:46:18 -0500 Subject: [PATCH 7/8] Add RAG (v2) --- content/04.adapting.md | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-) diff --git a/content/04.adapting.md b/content/04.adapting.md index a3c8238..19042a0 100644 --- a/content/04.adapting.md +++ b/content/04.adapting.md @@ -108,7 +108,7 @@ The adoption of RAG in the biomedical domain is driven by three core motivations #### Biomedical-Specific Retrieval System Design -The efficacy of a RAG system intrinsically depends on the quality of its retrieval component. The design of has evolved from naive lexical matching to domain-specific semantic architectures to address the unique linguistic complexities of biomedical terminology [@doi:10.48550/arXiv.2505.01146]. +The efficacy of a RAG system intrinsically depends on the quality of its retrieval component. The design has evolved from naive lexical matching to domain-specific semantic architectures to address the unique linguistic complexities of biomedical terminology [@doi:10.48550/arXiv.2505.01146]. @@ -124,9 +124,13 @@ Also, identifying distinctions between similar but diagnostically distinct medic +Beyond optimizing retrieval matching, processing long biomedical documents (e.g., genomic sequencing reports) requires specialized structural designs. Standard vector databases require text to be split into fixed-length segments (chunks). In medicine, this arbitrary segmentation destroys semantic integrity—separating a patient's symptoms from their diagnosis, or isolating a gene from its regulatory pathway. To preserve structural dependencies, modern biomedical RAG pipelines are shifting towards advanced segmentation strategies. Techniques like Parent-Child Retrieval [@doi:10.48550/arXiv.2401.18059] and Semantic Chunking [@doi:10.48550/arXiv.2312.06648] are increasingly utilized to ensure that complete biological mechanisms remain intact during chunking. Moreover, Graph-based RAG paradigms have emerged to bypass linear text segmentation entirely. Recent frameworks such as MedGraphRAG [@doi:10.48550/arXiv.2408.04187] extract medical documents into hierarchical knowledge graphs, linking information via biomedical entity relationships to enable multi-hop diagnostic reasoning. + + + #### Applications of RAG in Specific Biomedical Tasks -Under clinical settings, RAG primarily aims to enhance clinical question answering (QA) by grounding foundation models in evidence-based guidelines. MIRAGE benchmark and the MedRAG toolkit [@doi:10.18653/v1/2024.findings-acl.372] have comprehensively demonstrated the impact of retrieval augmentation. Across 41 combinations of corpora, retrievers, and LLMs, RAG improved the accuracy of various models by up to 18%, enabling smaller models like GPT-3.5 and Mixtral to rival the zero-shot performance of GPT-4. However, these benchmarks also highlight the "lost-in-the-middle" effect, echoing the context-dependent degradation seen in prompting-based adaptation. +Under clinical settings, RAG primarily aims to enhance clinical question answering (QA) by grounding foundation models in retrieved medical knowledge — including clinical guidelines, textbooks, and curated QA datasets. MIRAGE benchmark and the MedRAG toolkit [@doi:10.18653/v1/2024.findings-acl.372] have comprehensively demonstrated the impact of retrieval augmentation. Across 41 combinations of corpora, retrievers, and LLMs, RAG improved the accuracy of various models by up to 18%, enabling smaller models like GPT-3.5 and Mixtral to rival the zero-shot performance of GPT-4. However, these benchmarks also highlight the "lost-in-the-middle" effect, echoing the context-dependent degradation seen in prompting-based adaptation. @@ -139,3 +143,22 @@ In biology, RAG applications shift to querying highly structured bioinformatics Similarly, mitigating hallucinations in functional genomics remains critical, as fabricated gene functions can misguide biological experiments. GeneAgent [@doi:10.1038/s41592-025-02748-6] tackles this by introducing a self-verification language agent for gene set analysis. After generating initial functional annotations, GeneAgent automatically queries external biological databases to verify and fact-check its own outputs. This integration of retrieval augmentation and self-correction yields significantly higher biological consistency than standalone GPT-4. + + + + + +#### Limitations of RAG + +A core assumption of RAG is that the retrieved documents are inherently helpful. However, biomedical literature and clinical records are fraught with conflicting studies. When general-purpose LLMs are fed contradictory context—often termed "retrieval noise"—their reasoning ability can be seriously compromised. Retrieval noise actively disrupts the LLM's causal reasoning and leads to confusion. Recent work has proposed a 'self-reflection' mechanism, where the model evaluates the relevance of retrieved documents before generation [@doi:10.1093/bioinformatics/btae238]. However, such approaches remain preliminary. + + + +As discussed, the processing of long biomedical documents exposes an architectural flaw in standard RAG pipelines. While emerging GraphRAG and semantic chunking attempt to reconstruct contextual relationships, they introduce massive computational overhead. Preserving the structural dependencies in a computationally scalable way remains an inherent challenge. + + + +In time-critical healthcare settings (e.g., ICU), decision support must be instantaneous. However, RAG's cascading architecture introduces unavoidable latency bottlenecks. While RAG ensures evidence-based accuracy, its sequential processing delay severely restricts its viability for synchronous, point-of-care clinical deployments. + + + From c8a5dedd01a0651325a8b34cf702cd6cade4c1f4 Mon Sep 17 00:00:00 2001 From: Zhiyuan Li <147009309+LZYEIL@users.noreply.github.com> Date: Wed, 29 Apr 2026 23:53:12 -0500 Subject: [PATCH 8/8] Revise Section4 --- content/04.adapting.md | 135 +++++------------------------------------ 1 file changed, 15 insertions(+), 120 deletions(-) diff --git a/content/04.adapting.md b/content/04.adapting.md index 19042a0..21e02bd 100644 --- a/content/04.adapting.md +++ b/content/04.adapting.md @@ -1,164 +1,59 @@ ## Adapting General Foundation Models to Biomedical Tasks -*General: -How general-purpose foundation models (e.g., large language and vision models) are adapted to biomedical applications through prompting, fine-tuning, and tool use.* +A common assumption in biomedical AI holds that domain-specific pretraining is usually a prerequisite for clinical and biological competence. Recent evidence challenges this directly: Most medical vision-language models and LLMs fail to consistently outperform their general base models under standard prompting and fine-tuning regimes [@doi:10.48550/arXiv.2411.08870], and Llama-3-8B even surpasses the domain-pretrained MEDITRON-70B on multiple benchmarks. The implies that the bottleneck may not be biomedical knowledge, but how effectively that knowledge is elicited and behaviorally aligned. -One of the core assumptions of Domain-Adaptive Pretraining (DAPT) models is that foundation models can be pretrained on biomedical corpora in large scale to better process complex downstream tasks. However, recent empirical research [@doi:10.48550/arXiv.2411.08870] has raised a systematic challenge to this assumption: "all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting and supervised fine-tuning regimes for medical question answering (QA)." -In fact, modern general-purpose foundational models (such as Llama-3 and GPT-4) [@doi:10.48550/arXiv.2407.21783; @doi:10.48550/arXiv.2303.08774] have already internalized a certain level of biological and medical knowledge with reasoning capabilities: The Llama-3-8B foundational model outperforms the earlier specialized model MEDITRON-70B on multiple medical benchmarks [@doi:10.48550/arXiv.2411.08870]. -This prompts us to think that if the diminishing marginal returns of domain-specific pre-training become apparent, directly adapting powerful general-purpose foundational models to medical tasks will emerge as a more scalable approach. Compared to training from scratch, adapting general models offers three core advantages: +This section argues that adaptation methods reveal three things in sequence—that general models are far more capable than assumed, that the limits of adaptation are structural rather than engineering issues, and that the endpoint is not a better biomedical foundation model but a fundamentally different system architecture. -- **Transferable Emergent Abilities:** The complex logical reasoning and instruction-following capabilities that general models acquire from massive, heterogeneous data can directly generalize to clinical and biological tasks [@doi:10.48550/arXiv.2005.14165; @doi:10.48550/arXiv.2206.07682]. -- **Low Data Requirement:** By leveraging prompting or fine-tuning, they can mitigate the bottleneck of data scarcity in the biomedical field [@doi:10.1038/s41586-023-05881-4]. -- **Fast Iteration Speed:** The rapid pace at which the open-source community and industry update the architectures of foundation models (e.g., the Llama and Qwen series) unlocks new possibilities for the future [@doi:10.48550/arXiv.2001.08361]. -In this section, we propose a double-class taxonomy of adaptation: -- **By Parameter Intervention Level:** From prompting, to parameter-efficient fine-tuning (PEFT), and further to instruction tuning. -- **By Knowledge Injection Method:** With internalized parametric knowledge, retrieval-augmented generation (RAG), and the combination of external tools (Tool Use & Agents). +### The Underestimated Baseline +The most surprising finding across adaptation research is not how much work required to make general models clinically useful, but how little. Zero- and few-shot prompting alone enable models to pass the USMLE without any domain-specific tuning [@doi:10.48550/arXiv.2303.13375], and structured clinical demonstrations yield further gains on comprehensive benchmarks [@doi:10.1038/s41586-023-06291-2]. In biochemistry, casting biological sequences into linguistic forms, treating protein sequences as a translatable "second language" [@doi:10.48550/arXiv.2502.17504] or leveraging prompt engineering to elicit molecular reasoning from SMILES representations [@doi:10.1021/acscentsci.4c01935], enables zero-shot functional inference and molecular property prediction without task-specific fine-tuning. Chain-of-thought prompting unlocks latent multi-step clinical reasoning, and the resulting paradox is revealing: AI offers no diagnostic advantage when queried informally by physicians, yet outperforms when reasoning independently [@doi:10.1001/jamanetworkopen.2024.40969]. The bottleneck here is not the model capability—it is the absence of reasoning structure in clinical workflows. -### Prompting-based Adaptation +Parameter-efficient fine-tuning (PEFT) sharpens this argument. Updating less than 1% of parameters via LoRA enables protein language models to exceed full fine-tuning performance across diverse prediction tasks while reducing training time 4.5-fold [@doi:10.1038/s41467-024-51844-2], and PEFT models even outperform full fine-tuning on protein-protein interaction tasks with two orders of magnitude fewer parameters [@doi:10.1073/pnas.2405840121]. In clinical text summarization, lightly fine-tuned LLMs surpass human medical expert consensus [@doi:10.1038/s41591-024-02855-5]. Taken together, prompting and PEFT establish a clear prior: The biomedical capability of general foundation models has been systematically underestimated, and the marginal return of domain-specific pretraining is far lower than the field has assumed. -#### Zero-shot & Few-shot Prompting -Zero-shot and few-shot prompting are the fundamental paradigms for evaluating the “out-of-the-box” capabilities of general foundational models. In this setting, the model’s parameters remain completely frozen, and the model is required to perform specific tasks solely through the input prompts. Zero-shot does not provide any task examples to the model; few-shot, on the other hand, incorporates a small number of “input–output” examples to help the model understand the reasoning patterns [@doi:10.48550/arXiv.2005.14165]. In the biomedical field, this adaptation method gives a crucial window into the domain knowledge implicit in general-purpose models. +### Structural Limits -In clinical medicine, early studies demonstrated that modern large language models (LLMs) already possess substantial medical knowledge under zero-shot conditions. A common strategy is to convert clinical questions into standardized multiple-choice templates—explicitly formatting inputs as “Question – Options (A) to (E) – Answer”. This allows LLMs to process complex diagnostic scenarios in a consistent evaluation framework [@doi:10.48550/arXiv.2303.13375]. Such structured prompting have been shown to let models pass professional medical examinations such as the USMLE (United States Medical Licensing Examination) without any domain-specific fine-tuning, and even outperform earlier specialized medical systems in certain tasks. What's more, incorporating a small number of expert demonstrations in the prompt has proven highly effective for aligning model outputs with professional medical reasoning patterns. In particular, inserting three to five clinician-authored question–answer examples as few-shot demonstrations in the prompt prefix significantly improves performance on comprehensive benchmarks such as MultiMedQA [@doi:10.1038/s41586-023-06291-2]. These examples provide implicit guidance for both reasoning structure and output style, allowing the model to emulate real-world decision-making processes. +Yet adaptation has a ceiling: General-purpose pretraining objectives are misaligned with the precision, consistency, and accountability demands of biomedical reasoning. This is not a problem that better prompts or more LoRA layers can solve. -In the fields of molecular biology and chemistry, the primary challenge of prompting lies in mapping non-textual biological sequences into linguistic representations that large language models can process natively. Recent research has explored strategies for representing biological macromolecules as “language-like sequences,” enabling general-purpose LLMs to interpret biological data through purely text-based prompts. One notable approach treats protein sequences as a "second language." By structuring the prompt as a translation task (e.g., Directly inputting the amino acid sequence with an instruction: "Please translate the following protein sequence into its biological function"), models are enabled to infer underlying properties entirely zero-shot [@doi:10.48550/arXiv.2510.11188]. Alternatively, alignment between natural language and protein sequences can be achieved by dynamically concatenating knowledge-guided instructions (such as Gene Ontology definitions) with the target sequence in the prompt [@doi:10.18653/v1/2024.acl-long.62]. In the field of small-molecule chemistry, 1D molecular string representations are routinely utilized to construct few-shot prompts. By wrapping SMILES strings within natural language query templates (e.g., "Describe the pharmacological properties of the molecule: [SMILES]"), general-purpose language models can effectively perform molecular property prediction and chemical reasoning tasks [@doi:10.48550/arXiv.2406.06777]. +RAG exposes this clearly. Because biomedical knowledge evolves faster than any model can be retrained—PubMed adds 1.5 million articles annually [@doi:10.48550/arXiv.2509.04304], and the effective half-life of clinical knowledge has shrunk to mere months [@doi:10.1136/bmjopen-2023-072374]. RAG here appears to be a scalable path, and when retrieval works, the gains are real: MedCPT, trained on 255 million PubMed user queries, enables accurate zero-shot biomedical retrieval [@doi:10.1093/bioinformatics/btad651], and high-quality RAG pipelines improve medical QA accuracy by ~18% [@doi:10.18653/v1/2024.findings-acl.372]. But the dependency on retrieval quality is itself a structural problem: General-purpose dense retrievers fail on clinical text due to domain shift, and biomedical literature with contradictory evidence can amplify retrieval noise [@doi:10.1093/bioinformatics/btae238]. More critically, retrieval is based on semantic similarity rather than methodological rigor, failing to adhere to the evidence hierarchy of Evidence-Based Medicine (EBM). Ultimately, RAG does not solve the knowledge grounding problem—it relocates it from the model to the retrieval system. -#### Chain-of-Thought & Structured Reasoning Prompting +PEFT faces a mirror-image constraint: It can enhance latent knowledge, but cannot invent representations the model never learned. In low-resource clinical tasks involving rare diseases, LoRA adapters consistently fail to compensate for gaps in pretraining coverage [@doi:10.48550/arXiv.2407.19299]. This yields a conclusion: PEFT makes domain-specific pretraining redundant when general pretraining already covers the relevant knowledge, and irrelevant when it does not. -While standard few-shot prompting excel at extracting explicit knowledge, complex biomedical tasks—such as differential diagnosis or biological pathway analysis—strictly demand multi-step causal reasoning. To bridge this gap, Chain-of-Thought (CoT) prompting and structured reasoning strategies have emerged as critical adaptation paradigms. By explicitly prompting the model to generate intermediate rationales before giving a final answer, CoT unlocks the latent sequential logic embedded within foundational models [@doi:10.48550/arXiv.2201.11903]. +Instruction tuning reveals the deepest tension in alignment. The goal of making a general model reason like a clinician runs into conflicting optimization objectives. Clinical expert preferences tend to be liability-aware and oriented toward deferring uncertainties, whereas RLHF optimizes for helpfulness and fluency. These are not tunable parameters but competing design goals. BioMed-VITAL partially bridges this gap in the multimodal setting by embedding clinician preferences into instruction generation and selection [@doi:10.48550/arXiv.2406.13173], and Balanced Fine-Tuning sidesteps reward model complexity through token- and sample-level reweighting, outperforming SFT on sparse biomedical reasoning tasks [@doi:10.48550/arXiv.2511.21075]. But these are only patches for structural mismatch. High-quality instruction datasets: BioInstruct's 25,000 GPT-4-generated clinical instructions improve QA by 17.3% [@doi:10.1093/jamia/ocae122], and Mol-Instructions spanning 17 biomolecular tasks [@doi:10.48550/arXiv.2306.08018] raise the ceiling of alignment, but without fixing it. The ceiling itself is locked by an irreconcilable tension: The objectives of optimization versus the norms of clinical accountability. Instruction tuning, under this framing, is not a path to safe clinical AI; rather, it merely shows how far we can go before hitting that inherent bound. -In clinical settings, diagnostic accuracy and reliability are significantly enhanced when models are guided through structured cognitive processes. Beyond basic CoT, Self-consistency (SC) decoding strategies—which sample multiple independent reasoning paths and aggregate the most frequent final answer via majority voting—have proven essential for stabilizing medical reasoning [@doi:10.48550/arXiv.2203.11171]. This concept later evolved into Ensemble Refinement (ER) in specialized medical frameworks like Med-PaLM 2, where the model is prompted to refine its own final answer based on a synthesized review of multiple generated rationales [@doi:10.1038/s41591-024-03423-7]. Furthermore, diagnostic performance can be optimized by mimicking specific clinical styles, such as Differential, Analytical, and Bayesian reasoning prompts, which allow models to generate interpretable reasoning chains without sacrificing precision [@doi:10.1038/s41746-024-01010-1]. Practical implementations in radiology have shown that a two-step structured approach—first categorizing clinical information into history and imaging findings, then performing synthesis—outperforms standard CoT in diagnostic accuracy for complex cases [@doi:10.1007/s11604-024-01712-2]. The potential of such approaches is further explored in randomized clinical trials; while implementations in which physicians freely query LLMs show no significant gains in diagnostic reasoning, the fact that standalone LLMs can exceed physician performance highlights a promising opportunity: systematically integrating structured-reasoning prompting into human-AI workflows to advance clinical impact[@doi:10.1001/jamanetworkopen.2024.40969]. -In genomics, structured reasoning is often used to decompose high-dimensional data into interpretable insights. Rather than direct functional prediction, a "decompose-then-analyze" CoT strategy is employed for gene set functional discovery. This involves guiding the model to analyze individual gene functions and their biological intersections before assigning a name to a gene set, a method that has successfully identified novel functional modules missed by traditional enrichment analysis [@doi:10.1038/s41592-024-02525-x]. For biological summaries, structured scoring prompts—which require models to evaluate candidates across multiple dimensions such as biomarker relevance, therapeutic value, and biological significance—transform qualitative knowledge into quantitative decision-making frameworks [@doi:10.1186/s12967-023-04576-8]. These structured prompting strategies demonstrate that general-purpose models can move beyond standard clinical texts generation to perform knowledge-driven biological inference. +### Toward a Decentralized Architecture +The pattern across all adaptation methods points to the same conclusion: A single model, however trained, fine-tuned, or aligned cannot simultaneously maintain up-to-date knowledge, reason multi-step with precision, and operate reliably under distribution shift. These are not properties of any single model; they are properties of systems. -#### Multi-modal Prompting -Beyond textual sequences, in medical image analysis, the paradigm of "promptable segmentation" has redefined zero-shot interaction. Rather than being confined to predefined anatomical categories, models like the Segment Anything Model (SAM) utilize discrete spatial prompts—such as points, bounding boxes, or masks—to redirect model attention and isolate arbitrary anatomical structures [@doi:10.48550/arXiv.2304.02643]. However, systematic evaluations across MRI, CT, and ultrasound modalities reveal that while zero-shot SAM performs impressively for well-circumscribed objects, its performance is highly variable and drops significantly in ambiguous scenarios, such as brain tumor segmentation [@doi:10.1016/j.media.2023.102918]. For diagnostic classification, the "label-to-prompt" framework established by CLIP, where discrete categories are wrapped into natural language templates, has paved the way for zero-shot pathology recognition without relying on standard supervised classifiers. [@doi:10.48550/arXiv.2103.00020]. +The endpoint should be a decentralized architecture in which general foundation models serve as reasoning orchestrators, and tools help to connect them to the external world. This is the architecture that agent-based systems are already beginning to instantiate. -For high-level clinical reasoning, interleaved text-image prompting enables general Multimodal Large Language Models (MLLMs) to activate latent medical expertise directly from raw visual inputs. To enhance the interpretability of these zero-shot decisions, strategies like MedCoT (Medical Chain of Thought) utilize hierarchical expert prompts—simulating a process of initial diagnosis, review verification, and expert consensus—to generate explicit reasoning chains for medical VQA [@doi:10.18533/v1/2024.emnlp-main.962]. Systematic evaluations of general models further confirm that prompt engineering can successfully elicit high-quality medical image understanding for report generation and VQA. However, tasks requiring precise medical visual grounding still need substantial improvement, and that conventional metrics often fail to capture clinical efficacy [@doi:10.48550/arXiv.2310.20381]. + In chemistry, ChemCrow demonstrated that equipping a general LLM with 18 professional tools enables autonomous synthesis planning comparable to expert scientists without any pretraining [@doi:10.1038/s42256-023-00788-1]. The same principle extends across the full arc of biological discovery, from self-verifying gene-set annotation agents [@doi:10.1038/s41592-025-02748-6], to multi-agent frameworks that autonomously cycle through hypothesis generation, experimental design, and result interpretation [@doi:10.1016/j.cell.2024.09.022]. In clinical settings, EHRAgent converts natural-language queries into executable code for complex EHR reasoning, improving success rates by 29.6% over direct prompting [@doi:10.18653/v1/2024.emnlp-main.1245], not by knowing more medicine, but by knowing how to use the right tools. - -#### Limitations of Prompting - -Despite the flexibility of prompting-based adaptation, it introduces several structural limitations that compromise its reliability in biomedical environments. - - - -First, prompting-based adaptation suffers from extreme sensitivity. Few-shot performance is highly susceptible to the permutation order of examples, where even a slight reordering can lead to random guessing performance [@doi:10.18653/v1/2022.acl-long.556]. In clinical Natural Language Processing (NLP), evaluations across multiple prompt types demonstrate that zero-shot is sensitive to task-specific nuances. Without task-specific tailoring, there remains a significant performance gap when facing different clinical scenarios [@doi:10.2196/55318]. - - - - -Medical hallucinations also remains a fundamental challenge. Benchmarks like Med-HALT reveal that models frequently generate plausible yet unverified information, exposing significant flaws in their reasoning capabilities [@doi:10.48550/arXiv.2307.15343]. Furthermore, evaluations on real-world healthcare queries (MedHalu) demonstrate that LLMs are also profoundly vulnerable to these errors—often underperforming human experts in detecting hallucinations within their own generated responses [@doi:10.48550/arXiv.2409.19492]. - - - -Finally, the processing of biomedical data is hindered by context-dependent degradation. General models universally exhibit a "lost in the middle" phenomenon, where they robustly utilize information at the beginning of a long input context but fail to extract critical features from the center [@doi:10.1162/tacl_a_00638]. When models attempt to leverage long patient contexts, they suffer from underlying memorization issues and struggle severely with cases requiring temporal reasoning [@doi:10.48550/arXiv.2510.18691]. - - - -The structural limitations underscore the necessity of moving beyond pure prompting toward Retrieval-Augmented Generation (RAG). - - - - - -### Retrieval-Augmented Generation (RAG) - -To mitigate previous problems, Retrieval-Augmented Generation (RAG) introduces a paradigm that combines zero parameter modification with external knowledge injection. This "retrieve-and-generate" framework connects a pre-trained model’s parametric memory with a non-parametric memory (such as a dense vector index of Wikipedia). By grounding generation in retrieved documents, this approach significantly improves factual faithfulness [@doi:10.48550/arXiv.2005.11401]. - - - -#### Motivation for RAG in Biomedicine - -The adoption of RAG in the biomedical domain is driven by three core motivations: - -- **Literature Explosion:** The sheer volume of biomedical research far exceeds the capacity for foundation models to be frequently re-trained. PubMed currently contains over 37 million citations, with roughly 1.5 million new articles added each year. As a result, continuous pre-training to internalize this massive influx of literature is computationally difficult [@doi:10.48550/arXiv.2501.07171]. -- **The Criticality of Medical Timeliness:** Biomedical knowledge is highly dynamic. The "half-life" of medical knowledge—the time it takes for half of what is currently known to be proven false—has drastically shortened, dropping from 50 years in the 1950s to an estimated 73 days in 2020 [@pmid:21686208]. Clinical guidelines are continuously updated (e.g., pandemic protocols), and since LLMs have a fixed pre-training data cutoff, their static parametric memory becomes obsolete when the training ends. RAG bypasses this limitation by fetching real-time data, ensuring that clinical decisions are grounded in the current evidence. - - - -#### Biomedical-Specific Retrieval System Design - -The efficacy of a RAG system intrinsically depends on the quality of its retrieval component. The design has evolved from naive lexical matching to domain-specific semantic architectures to address the unique linguistic complexities of biomedical terminology [@doi:10.48550/arXiv.2505.01146]. - - - -Historically, sparse retrieval methods based on probabilistic relevance frameworks, such as BM25 [@doi:10.1561/1500000019], have served as the baseline for document fetching. While computationally efficient, sparse retrieval struggles with the severe synonymy in biomedicine. To overcome lexical limitations, Dense Passage Retrieval (DPR) [@doi:10.18653/v1/2020.emnlp-main.550] introduced a dual-encoder architecture that maps both queries and documents into a shared continuous vector space. However, general-domain dense retrievers often experience performance degradation when applied out-of-the-box to medical corpora. As exposed by heterogeneous benchmarks like BEIR [@doi:10.48550/arXiv.2104.08663], general dense models even underperform traditional sparse methods like BM25 on clinical datasets due to domain shift. - - - -To bridge the gap, advanced data construction and domain-specific pre-training are strictly required. A prominent milestone is MedCPT (Contrastive Pre-trained Transformers for Medicine) [@doi:10.1093/bioinformatics/btad651]. It achieved SOTA zero-shot performance by pre-training a retriever-reranker architecture on an unprecedented scale of real-world PubMed user search logs (over 255 million interactions), proving that domain-specific contrastive alignment is far more efficient than scaling general parameters. - - - -Also, identifying distinctions between similar but diagnostically distinct medical documents remains a core bottleneck. Recent frameworks like BiCA [@doi:10.48550/arXiv.2511.08029] tackle this by proposing citation-aware hard negative mining. By utilizing multi-hop citation links within PubMed to generate contextually irrelevant negative samples, BiCA compels the dense retriever to learn highly precise semantic boundaries, paving the path toward highly data-efficient domain adaptation in biomedical RAG systems. - - - -Beyond optimizing retrieval matching, processing long biomedical documents (e.g., genomic sequencing reports) requires specialized structural designs. Standard vector databases require text to be split into fixed-length segments (chunks). In medicine, this arbitrary segmentation destroys semantic integrity—separating a patient's symptoms from their diagnosis, or isolating a gene from its regulatory pathway. To preserve structural dependencies, modern biomedical RAG pipelines are shifting towards advanced segmentation strategies. Techniques like Parent-Child Retrieval [@doi:10.48550/arXiv.2401.18059] and Semantic Chunking [@doi:10.48550/arXiv.2312.06648] are increasingly utilized to ensure that complete biological mechanisms remain intact during chunking. Moreover, Graph-based RAG paradigms have emerged to bypass linear text segmentation entirely. Recent frameworks such as MedGraphRAG [@doi:10.48550/arXiv.2408.04187] extract medical documents into hierarchical knowledge graphs, linking information via biomedical entity relationships to enable multi-hop diagnostic reasoning. - - - -#### Applications of RAG in Specific Biomedical Tasks - -Under clinical settings, RAG primarily aims to enhance clinical question answering (QA) by grounding foundation models in retrieved medical knowledge — including clinical guidelines, textbooks, and curated QA datasets. MIRAGE benchmark and the MedRAG toolkit [@doi:10.18653/v1/2024.findings-acl.372] have comprehensively demonstrated the impact of retrieval augmentation. Across 41 combinations of corpora, retrievers, and LLMs, RAG improved the accuracy of various models by up to 18%, enabling smaller models like GPT-3.5 and Mixtral to rival the zero-shot performance of GPT-4. However, these benchmarks also highlight the "lost-in-the-middle" effect, echoing the context-dependent degradation seen in prompting-based adaptation. - - - -To translate performance gains into practical tools, frameworks like Almanac [@doi:10.1056/AIoa2300068] integrate manually curated clinical guidelines and treatment recommendations as high-quality retrieval sources. Evaluated blindly by clinicians across hundreds of clinical queries, Almanac significantly outperformed standard search-augmented LLMs (e.g., standard ChatGPT-4, Bing, and Bard) in factual accuracy, completeness, and adversarial safety. Furthermore, to address the multi-step nature of complex medical diagnosis, i-MedRAG [@doi:10.48550/arXiv.2408.00727] introduces an iterative follow-up mechanism: The model autonomously generates subsequent queries based on the initial retrieved context, overcoming the insufficient information retrieval bottleneck inherent in single-turn medical RAG systems. - - - -In biology, RAG applications shift to querying highly structured bioinformatics databases. This structural shift often blurs the boundary between traditional RAG and tool-augmented LLMs (a concept further explored in Section *'Tool Use and LLM-based Agents'*). GeneGPT [@doi:10.1093/bioinformatics/btae075] bypasses standard text retrieval by teaching LLMs to execute API calls to NCBI databases (e.g., for gene queries and sequence searches). This tool-augmented RAG paradigm significantly outperforms parametric baselines on genomics benchmarks like GeneTuring. - - - -Similarly, mitigating hallucinations in functional genomics remains critical, as fabricated gene functions can misguide biological experiments. GeneAgent [@doi:10.1038/s41592-025-02748-6] tackles this by introducing a self-verification language agent for gene set analysis. After generating initial functional annotations, GeneAgent automatically queries external biological databases to verify and fact-check its own outputs. This integration of retrieval augmentation and self-correction yields significantly higher biological consistency than standalone GPT-4. - - - - - -#### Limitations of RAG - -A core assumption of RAG is that the retrieved documents are inherently helpful. However, biomedical literature and clinical records are fraught with conflicting studies. When general-purpose LLMs are fed contradictory context—often termed "retrieval noise"—their reasoning ability can be seriously compromised. Retrieval noise actively disrupts the LLM's causal reasoning and leads to confusion. Recent work has proposed a 'self-reflection' mechanism, where the model evaluates the relevance of retrieved documents before generation [@doi:10.1093/bioinformatics/btae238]. However, such approaches remain preliminary. - - - -As discussed, the processing of long biomedical documents exposes an architectural flaw in standard RAG pipelines. While emerging GraphRAG and semantic chunking attempt to reconstruct contextual relationships, they introduce massive computational overhead. Preserving the structural dependencies in a computationally scalable way remains an inherent challenge. - - - -In time-critical healthcare settings (e.g., ICU), decision support must be instantaneous. However, RAG's cascading architecture introduces unavoidable latency bottlenecks. While RAG ensures evidence-based accuracy, its sequential processing delay severely restricts its viability for synchronous, point-of-care clinical deployments. - - - +In these systems, domain knowledge is no longer baked into model weights but externalized into tools that any sufficiently capable general model can orchestrate [@doi:10.1038/s42256-024-00944-1]. This reframes the central question of this perspective. Domain-specific pretraining is not obsolete, but its role narrows: from the primary source of biomedical competence to one component among many in a larger architecture that no single model was ever meant to carry alone.