The Document Embedder creates vector embeddings for documents in your BMLibrarian knowledge base. These embeddings enable semantic search capabilities, allowing you to find documents based on meaning rather than just keywords.
- Automatic Processing: Finds and embeds documents without existing embeddings
- Multiple Models: Support for any Ollama embedding model
- Source Filtering: Embed specific document sources (medRxiv, PubMed, etc.)
- Progress Tracking: Visual progress bars and detailed statistics
- Batch Processing: Efficient batch processing for large datasets
- Database Integration: Seamless integration with bmlibrarian's chunk and embedding tables
The document embedder requires:
# Core dependency
uv pip install ollama
# Optional: Progress bars
uv pip install tqdmEmbed 100 documents from medRxiv:
uv run python embed_documents_cli.py embed --source medrxiv --limit 100View embedding statistics:
uv run python embed_documents_cli.py statusCount documents that need embeddings:
uv run python embed_documents_cli.py count --source medrxivGenerate embeddings for documents without existing embeddings.
Usage:
uv run python embed_documents_cli.py embed [OPTIONS]Options:
--source NAME: Filter by source name (e.g., medrxiv, pubmed)--limit N: Maximum number of documents to embed--batch-size N: Documents per batch (default: 100)--model NAME: Ollama model to use (default: snowflake-arctic-embed2:latest)-v, --verbose: Enable verbose logging
Examples:
-
Embed medRxiv abstracts:
uv run python embed_documents_cli.py embed --source medrxiv --limit 100
-
Embed all documents:
uv run python embed_documents_cli.py embed
-
Use a different model:
uv run python embed_documents_cli.py embed --model nomic-embed-text:latest --limit 50
-
Large batch processing:
uv run python embed_documents_cli.py embed --source medrxiv --batch-size 500
Count how many documents don't have embeddings yet.
Usage:
uv run python embed_documents_cli.py count [OPTIONS]Options:
--source NAME: Filter by source name--model NAME: Check for specific model's embeddings
Examples:
-
Count all unembedded documents:
uv run python embed_documents_cli.py count
-
Count by source:
uv run python embed_documents_cli.py count --source medrxiv
Display detailed statistics about embeddings in the database.
Usage:
uv run python embed_documents_cli.py status [OPTIONS]Options:
--model NAME: Check status for specific model
Output includes:
- Model information (name, ID, dimension)
- Total documents with abstracts
- Documents with/without embeddings
- Breakdown by source
- Percentage embedded per source
BMLibrarian supports any Ollama embedding model. Common choices:
| Model | Dimension | Speed | Quality | Use Case |
|---|---|---|---|---|
snowflake-arctic-embed2:latest |
1024 | Medium | Excellent | Recommended default |
nomic-embed-text:latest |
768 | Fast | Good | Quick embeddings |
mxbai-embed-large:latest |
1024 | Slow | Excellent | Highest quality |
To use a different model:
uv run python embed_documents_cli.py embed --model nomic-embed-text:latestThe embedder connects to Ollama at http://localhost:11434 by default. Ensure Ollama is running:
# Check Ollama status
ollama list
# Pull a model if needed
ollama pull snowflake-arctic-embed2:latestYou can use the embedder directly in Python code:
from bmlibrarian.embeddings import DocumentEmbedder
# Initialize embedder
embedder = DocumentEmbedder(model_name="snowflake-arctic-embed2:latest")
# Embed documents
stats = embedder.embed_documents(
source_name='medrxiv',
limit=100,
batch_size=50
)
print(f"Embedded {stats['embedded_count']} documents")
print(f"Failed: {stats['failed_count']}")
# Count unembedded documents
count = embedder.count_documents_without_embeddings(source_name='medrxiv')
print(f"Remaining: {count} documents")-
Start with a small batch to test:
uv run python embed_documents_cli.py embed --source medrxiv --limit 10
-
Check status:
uv run python embed_documents_cli.py status
-
Scale up gradually:
uv run python embed_documents_cli.py embed --source medrxiv --limit 1000
-
Import documents (e.g., from medRxiv):
uv run python medrxiv_import_cli.py update --days-to-fetch 7
-
Embed new documents:
uv run python embed_documents_cli.py embed --source medrxiv
-
Monitor progress:
uv run python embed_documents_cli.py status
Set up automated embedding generation:
#!/bin/bash
# daily_embedding.sh
# Embed new medRxiv papers
uv run python embed_documents_cli.py embed --source medrxiv --limit 500
# Check status
uv run python embed_documents_cli.py statusSchedule with cron:
# Run daily at 4 AM
0 4 * * * cd /path/to/bmlibrarian && ./daily_embedding.shTypical performance on a modern CPU:
| Model | Embeddings/sec | Time for 1000 docs |
|---|---|---|
| snowflake-arctic-embed2 | 5-10 | 2-3 minutes |
| nomic-embed-text | 10-20 | 1-2 minutes |
| mxbai-embed-large | 2-5 | 3-5 minutes |
Factors affecting speed:
- CPU performance
- Model size
- Abstract length
- Ollama configuration
The embedder:
- Uses connection pooling for efficiency
- Batch commits every 100 documents
- Checks for existing embeddings to avoid duplicates
- Creates indexes automatically via schema
- Small models (768-dim): ~2GB RAM
- Large models (1024-dim): ~4GB RAM
- Batch processing: Minimal memory overhead
chunks - Text chunks:
CREATE TABLE chunks (
id INT PRIMARY KEY,
document_id INT REFERENCES document(id),
chunking_strategy_id INT,
chunktype_id INT,
document_title TEXT,
text TEXT,
chunklength INT,
chunk_no INT
);embedding_base - Base embedding info:
CREATE TABLE embedding_base (
id INT PRIMARY KEY,
chunk_id INT REFERENCES chunks(id),
model_id INT REFERENCES embedding_models(id)
);emb_768 / emb_1024 - Actual vectors:
CREATE TABLE emb_768 (
embedding vector(768)
) INHERITS (embedding_base);
CREATE TABLE emb_1024 (
embedding vector(1024)
) INHERITS (embedding_base);The embedder creates:
- One chunk per document (chunk_no=0)
- Chunk text = document abstract
- Embedding stored in emb_768 or emb_1024 based on model
Problem: Ollama Python package not found.
Solution:
uv pip install ollamaProblem: Specified model not available.
Solution:
# Pull the model
ollama pull snowflake-arctic-embed2:latest
# Or use a different model
uv run python embed_documents_cli.py embed --model nomic-embed-text:latestProblem: Ollama server not running.
Solution:
# Start Ollama (varies by OS)
ollama serve
# Or check if it's running
curl http://localhost:11434/api/tagsProblem: Embeddings taking too long.
Solutions:
- Use a faster model:
--model nomic-embed-text:latest - Reduce batch size:
--batch-size 50 - Check Ollama configuration
- Consider GPU acceleration for Ollama
Problem: Model produces dimension other than 768 or 1024.
Solution: BMLibrarian currently supports 768 and 1024 dimensions. Use a supported model:
- 768-dim: nomic-embed-text, gte-base
- 1024-dim: snowflake-arctic-embed2, mxbai-embed-large
Once documents are embedded, use semantic search in queries:
from bmlibrarian.database import search_with_semantic
# Semantic search uses embeddings automatically
results = search_with_semantic(
search_text="cardiovascular benefits of exercise",
threshold=0.7,
max_results=50
)
for doc in results:
print(f"{doc['title']} - Similarity: {doc['semantic_score']:.3f}")Combine keyword and semantic search:
from bmlibrarian.database import search_hybrid
documents, metadata = search_hybrid(
search_text="cardiovascular benefits of exercise",
query_text="cardiovascular & exercise",
search_config={
'semantic': {'enabled': True, 'similarity_threshold': 0.7},
'bm25': {'enabled': True},
'keyword': {'enabled': True}
}
)Planned features:
- Full-text chunking: Embed document full text in addition to abstracts
- Chunk overlap: Sliding window chunking for better coverage
- Multiple embedding models: Use different models for different purposes
- Incremental updates: Re-embed when documents are updated
- Quality metrics: Track embedding quality and coverage
- MedRxiv Import Guide - Import documents to embed
- Query Agent Guide - Use embeddings in queries
- Multi-Model Query Guide - Advanced search with embeddings