-
Notifications
You must be signed in to change notification settings - Fork 16
Agent Skill BM25 Search Guide
Rick Hightower edited this page Feb 1, 2026
·
2 revisions
BM25 (Best Matching 25) is a keyword-based search algorithm that finds exact term matches in your indexed documents. It's excellent for technical queries where you need precise word matching rather than semantic understanding.
Choose BM25 when:
- Looking for specific function names, class names, or API endpoints
- Searching for error codes, status codes, or technical terms
- Need exact word matches rather than conceptual similarity
- Working with code documentation, API references, or technical specifications
- The query contains specific technical jargon or identifiers
Examples of BM25 queries:
-
"AuthenticationError"- Find exact error class references -
"HTTP 404"- Find status code documentation -
"recursiveCharacterTextSplitter"- Find specific function names -
"OAuth2 flow"- Find exact OAuth implementation details
# Basic BM25 search
agent-brain query "your exact terms" --mode bm25
# With custom threshold (lower for more results)
agent-brain query "functionName" --mode bm25 --threshold 0.1
# With result count limit
agent-brain query "error code" --mode bm25 --top-k 10# POST /query endpoint
curl -X POST http://localhost:8000/query/ \
-H "Content-Type: application/json" \
-d '{
"query": "AuthenticationError",
"mode": "bm25",
"threshold": 0.2,
"top_k": 5
}'| Option | Default | Description | Use Case |
|---|---|---|---|
--mode bm25 |
Required | Selects BM25 algorithm | All BM25 searches |
--threshold F |
0.7 | Minimum relevance score (0.0-1.0) | Lower for more results, higher for precision |
--top-k N |
5 | Maximum results to return | Increase for comprehensive results |
BM25 Advantages:
- ⚡ Fast: ~10-20ms response time
- 🎯 Precise: Finds exact word matches
- 🔍 Predictable: Results based on term frequency and document length
- 💾 Lightweight: No API keys required
When BM25 is better than Hybrid:
- Searching for specific identifiers (function names, error codes)
- Technical documentation with exact terminology
- When you need guaranteed exact matches
- Performance-critical applications
When BM25 is better than Vector:
- Non-English text or technical jargon
- When semantic meaning could be misleading
- Code search and technical documentation
- Exact string matching requirements
BM25 scores documents based on:
- Term Frequency (TF): How often the search term appears
- Inverse Document Frequency (IDF): How rare the term is across documents
- Document Length Normalization: Shorter documents score higher for same term frequency
Formula: score = Σ IDF(q_i) × (TF(q_i,D) × (k₁ + 1)) / (TF(q_i,D) + k₁ × (1 - b + b × |D|/avgDL))
Where:
-
q_i: Query terms -
D: Document -
k₁ = 1.5(term frequency saturation) -
b = 0.75(length normalization factor)
Query: agent-brain query "recursiveCharacterTextSplitter" --mode bm25
Response:
{
"results": [
{
"text": "The recursiveCharacterTextSplitter splits text recursively using character separators...",
"source": "/docs/api/text-splitters.md",
"score": 0.85,
"vector_score": null,
"bm25_score": 0.85,
"chunk_id": "chunk_123",
"metadata": {
"file_name": "text-splitters.md",
"chunk_index": 0
}
}
],
"query_time_ms": 12.5,
"total_results": 1
}Query: agent-brain query "HTTP 404" --mode bm25
Response:
{
"results": [
{
"text": "HTTP 404 Not Found indicates the requested resource could not be found...",
"source": "/docs/api/http-status-codes.md",
"score": 0.92,
"vector_score": null,
"bm25_score": 0.92,
"chunk_id": "chunk_456",
"metadata": {
"file_name": "http-status-codes.md",
"chunk_index": 2
}
},
{
"text": "404 errors commonly occur when:\n- URL is mistyped\n- Resource was deleted...",
"source": "/docs/troubleshooting/404-errors.md",
"score": 0.78,
"vector_score": null,
"bm25_score": 0.78,
"chunk_id": "chunk_789",
"metadata": {
"file_name": "404-errors.md",
"chunk_index": 0
}
}
],
"query_time_ms": 15.2,
"total_results": 2
}- Response Time: 10-50ms (fastest of all modes)
- CPU Usage: Low (pure algorithmic scoring)
- Memory Usage: Minimal (uses pre-built BM25 index)
- Scalability: Excellent (index built once, queried many times)
- Use exact terms: BM25 works best with specific words, not general concepts
-
Lower threshold for technical searches: Technical docs may need
threshold 0.1-0.3 - Combine with domain knowledge: Know what terms are likely to appear in your docs
- Use for code search: Perfect for finding function definitions, class names, imports
- No results found: Try lowering the threshold or using different terminology
- Too many results: Increase threshold or add more specific terms
-
Index not ready: Ensure documents are indexed before searching (
agent-brain status)
#!/bin/bash
# Search for specific error codes
agent-brain query "$1" --mode bm25 --json | jq '.results[0].text'# Find all mentions of a function
agent-brain query "myFunction" --mode bm25 --json | jq -r '.results[].source'import requests
response = requests.post('http://localhost:8000/query/', json={
'query': 'AuthenticationError',
'mode': 'bm25',
'threshold': 0.2
})
results = response.json()['results']- Design-Architecture-Overview
- Design-Query-Architecture
- Design-Storage-Architecture
- Design-Class-Diagrams
- GraphRAG-Guide
- Agent-Skill-Hybrid-Search-Guide
- Agent-Skill-Graph-Search-Guide
- Agent-Skill-Vector-Search-Guide
- Agent-Skill-BM25-Search-Guide
Search
Server
Setup
- Pluggable-Providers-Spec
- GraphRAG-Integration-Spec
- Agent-Brain-Plugin-Spec
- Multi-Instance-Architecture-Spec