RAG-MCP: Scalable Tool Selection for Large Language Models

A research implementation exploring Retrieval-Augmented Generation (RAG) approaches for efficient external tool selection in Large Language Models, with planned extensions using hybrid search techniques.

🎯 Problem Statement

As LLMs integrate with growing toolsets through protocols like Model Context Protocol (MCP), prompt bloat becomes a critical issue:

Including all tool descriptions in prompts overwhelms the LLM's context window
Tool selection accuracy drops dramatically (from ~90% → 13.62% as tools scale)
Token costs and latency increase proportionally
Decision complexity causes model confusion and hallucinations

Example: An LLM with access to 1,000 tools cannot efficiently determine which tool to use for "Find recent papers about climate change" when all 1,000 tool descriptions are in the prompt.

💡 Solution Overview

This project implements and compares 7 different approaches to tool selection for LLMs:

Pure Retrieval Methods (No LLM):

Dense Retrieval Only (top-1) - Cosine similarity on embeddings, select top-1 tool
BM25 Only (top-1) - Lexical search, select top-1 tool
BM25 + Dense Hybrid (top-1) - RRF fusion of BM25 + Dense, select top-1 (ablation study)

LLM-Based Methods: 4. LLM Only (Full Context) - All tools provided to LLM (naive MCP baseline, demonstrates prompt bloat) 5. Dense Retrieval + LLM (top-k) - RAG-MCP: Embedding-based retrieval → LLM selects from top-k 6. BM25 + LLM (top-k) - BM25 retrieval → LLM selects from top-k 7. Hybrid Retrieval + LLM (top-k) - Combined dense + BM25 retrieval → LLM selects from top-k

Key Benefits:

Systematic comparison from pure retrieval to hybrid approaches
Demonstrates trade-offs between speed, accuracy, and context efficiency
Validates RAG-MCP methodology and explores improvements
Ablation study (Approach 3) isolates retrieval quality from LLM contribution

📊 Expected Results

Based on Gan & Sun (2025) and our experimental design:

Approach	Accuracy (Expected)	Token Usage	Latency	k Values Tested	Notes
1. Dense Retrieval Only	Low (~20-30%)	Minimal (0 LLM tokens)	Fastest	Retrieve top-7, select top-1	Reports Recall@1/3/5/7
2. BM25 Only	Low (~15-25%)	Minimal (0 LLM tokens)	Fastest	Retrieve top-7, select top-1	Reports Recall@1/3/5/7
3. BM25 + Dense (No LLM)	Medium (~25-35%)	Minimal (0 LLM tokens)	Fastest	Retrieve top-7, select top-1	Ablation: Hybrid retrieval quality
4. LLM Only (Full Context)	~13%	Highest (100% baseline)	Context Length Failure	All 297 tools	Prompt bloat - exceeds context window
5. Dense Retrieval + LLM	~43%	~50% reduction	Fast	k = 3, 5, 7	RAG-MCP from paper
6. BM25 + LLM	~35-40%	~50% reduction	Fast	k = 3, 5, 7	Lexical filtering
7. Hybrid Retrieval + LLM	>50% (goal)	~50% reduction	Fast	k = 3, 5, 7	Best of both worlds

Key Hypothesis: Hybrid approach (7) should outperform both pure retrieval and single-retrieval methods by combining semantic understanding with keyword matching.

Note on Approach 4: LLM Only fails on datasets with 200+ tools due to context length limitations. With 297 tools, the prompt exceeds Ollama's context window, causing all queries to fail. This validates the need for retrieval-based filtering.

🏗️ Architecture

                    User Query
                        |
        +---------------+---------------+
        |               |               |
   Dense Retrieval   BM25 Search   Hybrid Fusion
   (Embeddings)      (Lexical)    (Both Combined)
        |               |               |
        +-------+-------+-------+-------+
                |               |
         Direct Selection   LLM Selection
         (top-1 only)      (top-k reasoning)
                |               |
                v               v
           Tool Selection   Tool Selection

Components:

Tool Indexer:
- Dense: Embeds tool descriptions into vector space (FAISS)
- Sparse: BM25 index for keyword matching
Retriever: Multiple strategies (dense, BM25, hybrid)
LLM Selector: Optional reasoning layer for top-k candidates
Evaluator: Measures accuracy, token usage, and latency across all 6 approaches

📈 Evaluation Metrics

Comparison Level:

Server-level comparison (not individual tool level)
Evaluates if the approach selects tools from the correct server

Accuracy Metrics:

Accuracy: Top-1 server selection correctness (%)
Recall@k: Is correct server in top-k candidates? (k = 1, 3, 5, 7)
Mean Reciprocal Rank (MRR): Average rank of correct server (0-1, higher is better)

Efficiency Metrics (LLM approaches only):

Average Prompt Tokens
Average Completion Tokens
Total Token Usage
Token Reduction vs Baseline (%)
Cost per Query ($)

Latency Metrics:

Total Query Latency (seconds)
Retrieval Latency (approaches 1, 2, 4, 5, 6)
LLM Inference Latency (approaches 3, 4, 5, 6)

📚 Research Background

This project is based on:

Primary Paper:

Gan, T., & Sun, Q. (2025). RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation. arXiv preprint arXiv:2505.03275.

Supporting Work:

Luo et al. (2025) - MCPBench evaluation framework
Lewis et al. (2020) - RAG foundations
Gao et al. (2023) - Hybrid retrieval survey

Key Insights:

Tool selection degrades significantly as toolsets scale (13.62% accuracy at 11,100 tools)
Semantic retrieval restores accuracy to ~43% while reducing tokens by >50%
Hybrid approaches may further improve by combining semantic + keyword matching

📁 Project Structure

Enhancing_RAG_MCP/
├── src/
│   ├── indexing/              # Index building components
│   │   ├── tool_indexer.py    # Dense embeddings (FAISS)
│   │   └── bm25_indexer.py    # Sparse BM25 index
│   ├── retrieval/             # Retrieval components
│   │   ├── dense_retriever.py # Dense/semantic retrieval
│   │   ├── bm25_retriever.py  # Sparse/lexical retrieval
│   │   └── hybrid_retriever.py # Hybrid fusion (RRF)
│   ├── approaches/            # Core implementations of 7 approaches
│   │   ├── dense_only.py      # Approach 1: Dense Retrieval Only
│   │   ├── bm25_only.py       # Approach 2: BM25 Only
│   │   ├── bm25_plus_dense.py # Approach 3: BM25 + Dense (No LLM)
│   │   ├── llm_only.py        # Approach 4: LLM Only (Full Context)
│   │   ├── dense_llm.py       # Approach 5: Dense + LLM
│   │   ├── bm25_llm.py        # Approach 6: BM25 + LLM
│   │   └── llm_hybrid.py      # Approach 7: Hybrid + LLM
│   └── llm/                   # LLM integration
│       └── llm_selector.py    # LLM tool selection logic
├── benchmarking/              # Evaluation framework
│   └── benchmarker.py         # Unified benchmarking suite
├── data/
│   ├── tools/                 # Tool definitions (JSON)
│   ├── queries/               # Test queries with ground truth
│   ├── indexes/               # Pre-built FAISS and BM25 indexes
│   └── results/               # Experiment results
└── tests/                     # Unit and integration tests

🛠️ Technology Stack

Core Libraries:

sentence-transformers - Dense embeddings (semantic search)
faiss-cpu - Vector similarity search
rank-bm25 - Sparse lexical search (BM25)
vllm or ollama - Open-source LLM serving
pandas, numpy - Data processing

Retrieval Components:

Dense: all-MiniLM-L6-v2 (fast baseline) or all-mpnet-base-v2 (higher quality)
Sparse: BM25 with custom tokenization
Hybrid: Reciprocal Rank Fusion (RRF) or weighted combination

LLMs (Self-Hosted):

Primary: Mistral 7B Instruct / Mixtral 8x7B Instruct
Alternative: Qwen2.5-7B-Instruct / LLaMA 3.1-8B-Instruct
Deployment: vLLM server via SSH (GPU-accelerated)

Infrastructure:

Remote GPU server access via SSH
Model serving: vLLM / Text Generation Inference / Ollama
GPU Requirements: 40GB+ VRAM (A100 or equivalent)

🚀 Quick Start

Local Development & Testing

1. Install Dependencies:

pip install -r requirements.txt

2. Install Ollama for Local Testing:

# See Ollama_Setup_Guide.md for detailed instructions
ollama run mistral:7b-instruct-q4_0

3. Test Individual Approaches:

# Test Approach 1 (Dense Retrieval Only - no LLM needed)
python src/approaches/dense_only.py

# Test Approach 2 (BM25 Only - no LLM needed)
python src/approaches/bm25_only.py

# Test Approach 3-6 (LLM-based - requires Ollama running)
python src/approaches/llm_only.py
python src/approaches/dense_llm.py
python src/approaches/bm25_llm.py
python src/approaches/hybrid_llm.py

4. Run on HPC Cluster (after local testing):

# Create benchmarking script
python scripts/run_full_benchmark.py

# Submit SLURM job
sbatch scripts/submit_benchmark.sh

Development vs Production

Environment	Purpose	Tools	Dataset
Local	Development & debugging	Sample tools (3-5)	Ollama (CPU/small GPU)
HPC	Final benchmarking	Full dataset (200+ tools)	vLLM (A100 GPU)

📊 Current Status

Implementation Progress:

Development Workflow:

Phase 1: Local Development (Current)

Build all 6 approaches with modular design
Test each approach locally with Ollama (3-5 sample tools)
Debug and validate implementation
Commit each working approach

Phase 2: HPC Benchmarking (After all approaches complete)

Create unified benchmarking script
Prepare SLURM job submission script
Deploy to university HPC cluster
Run comprehensive evaluation on full dataset (200+ tools)
Collect results and perform analysis

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
benchmarking		benchmarking
data		data
results		results
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
35-Enhancing_Tool_Selection_in_Large_Language_Models-YashMalode.pdf		35-Enhancing_Tool_Selection_in_Large_Language_Models-YashMalode.pdf
Execution_Plan.md		Execution_Plan.md
Ollama_Setup_Guide.md		Ollama_Setup_Guide.md
README.md		README.md
Running_Mistral7B_on_CARC_Guide.md		Running_Mistral7B_on_CARC_Guide.md
USC_CARC_HPC_Guide.txt		USC_CARC_HPC_Guide.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG-MCP: Scalable Tool Selection for Large Language Models

🎯 Problem Statement

💡 Solution Overview

📊 Expected Results

🏗️ Architecture

📈 Evaluation Metrics

📚 Research Background

📁 Project Structure

🛠️ Technology Stack

🚀 Quick Start

Local Development & Testing

Development vs Production

📊 Current Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG-MCP: Scalable Tool Selection for Large Language Models

🎯 Problem Statement

💡 Solution Overview

📊 Expected Results

🏗️ Architecture

📈 Evaluation Metrics

📚 Research Background

📁 Project Structure

🛠️ Technology Stack

🚀 Quick Start

Local Development & Testing

Development vs Production

📊 Current Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages