bmlibrarian/doc/users/document_embedding_guide.md at master · hherb/bmlibrarian

Document Embedding Guide

Overview

The Document Embedder creates vector embeddings for documents in your BMLibrarian knowledge base. These embeddings enable semantic search capabilities, allowing you to find documents based on meaning rather than just keywords.

Features

Automatic Processing: Finds and embeds documents without existing embeddings
Multiple Models: Support for any Ollama embedding model
Source Filtering: Embed specific document sources (medRxiv, PubMed, etc.)
Progress Tracking: Visual progress bars and detailed statistics
Batch Processing: Efficient batch processing for large datasets
Database Integration: Seamless integration with bmlibrarian's chunk and embedding tables

Installation Requirements

The document embedder requires:

# Core dependency
uv pip install ollama

# Optional: Progress bars
uv pip install tqdm

Quick Start

Basic Embedding

Embed 100 documents from medRxiv:

uv run python embed_documents_cli.py embed --source medrxiv --limit 100

Check Status

View embedding statistics:

uv run python embed_documents_cli.py status

Count Unembedded Documents

Count documents that need embeddings:

uv run python embed_documents_cli.py count --source medrxiv

Command Reference

`embed` - Generate Embeddings

Generate embeddings for documents without existing embeddings.

Usage:

uv run python embed_documents_cli.py embed [OPTIONS]

Options:

--source NAME: Filter by source name (e.g., medrxiv, pubmed)
--limit N: Maximum number of documents to embed
--batch-size N: Documents per batch (default: 100)
--model NAME: Ollama model to use (default: snowflake-arctic-embed2:latest)
-v, --verbose: Enable verbose logging

Examples:

Embed medRxiv abstracts:

uv run python embed_documents_cli.py embed --source medrxiv --limit 100

Embed all documents:

uv run python embed_documents_cli.py embed

Use a different model:

uv run python embed_documents_cli.py embed --model nomic-embed-text:latest --limit 50

Large batch processing:

uv run python embed_documents_cli.py embed --source medrxiv --batch-size 500

`count` - Count Unembedded Documents

Count how many documents don't have embeddings yet.

Usage:

uv run python embed_documents_cli.py count [OPTIONS]

Options:

--source NAME: Filter by source name
--model NAME: Check for specific model's embeddings

Examples:

Count all unembedded documents:

uv run python embed_documents_cli.py count

Count by source:

uv run python embed_documents_cli.py count --source medrxiv

`status` - Show Embedding Statistics

Display detailed statistics about embeddings in the database.

Usage:

uv run python embed_documents_cli.py status [OPTIONS]

Options:

--model NAME: Check status for specific model

Output includes:

Model information (name, ID, dimension)
Total documents with abstracts
Documents with/without embeddings
Breakdown by source
Percentage embedded per source

Configuration

Model Selection

BMLibrarian supports any Ollama embedding model. Common choices:

Model	Dimension	Speed	Quality	Use Case
`snowflake-arctic-embed2:latest`	1024	Medium	Excellent	Recommended default
`nomic-embed-text:latest`	768	Fast	Good	Quick embeddings
`mxbai-embed-large:latest`	1024	Slow	Excellent	Highest quality

To use a different model:

uv run python embed_documents_cli.py embed --model nomic-embed-text:latest

Ollama Configuration

The embedder connects to Ollama at http://localhost:11434 by default. Ensure Ollama is running:

# Check Ollama status
ollama list

# Pull a model if needed
ollama pull snowflake-arctic-embed2:latest

Programmatic Usage

You can use the embedder directly in Python code:

from bmlibrarian.embeddings import DocumentEmbedder

# Initialize embedder
embedder = DocumentEmbedder(model_name="snowflake-arctic-embed2:latest")

# Embed documents
stats = embedder.embed_documents(
    source_name='medrxiv',
    limit=100,
    batch_size=50
)

print(f"Embedded {stats['embedded_count']} documents")
print(f"Failed: {stats['failed_count']}")

# Count unembedded documents
count = embedder.count_documents_without_embeddings(source_name='medrxiv')
print(f"Remaining: {count} documents")

Workflow Recommendations

Initial Setup

Start with a small batch to test:

uv run python embed_documents_cli.py embed --source medrxiv --limit 10

Check status:

uv run python embed_documents_cli.py status

Scale up gradually:

uv run python embed_documents_cli.py embed --source medrxiv --limit 1000

Production Workflow

Import documents (e.g., from medRxiv):

uv run python medrxiv_import_cli.py update --days-to-fetch 7

Embed new documents:

uv run python embed_documents_cli.py embed --source medrxiv

Monitor progress:

uv run python embed_documents_cli.py status

Daily Maintenance

Set up automated embedding generation:

#!/bin/bash
# daily_embedding.sh

# Embed new medRxiv papers
uv run python embed_documents_cli.py embed --source medrxiv --limit 500

# Check status
uv run python embed_documents_cli.py status

Schedule with cron:

# Run daily at 4 AM
0 4 * * * cd /path/to/bmlibrarian && ./daily_embedding.sh

Performance Considerations

Embedding Speed

Typical performance on a modern CPU:

Model	Embeddings/sec	Time for 1000 docs
snowflake-arctic-embed2	5-10	2-3 minutes
nomic-embed-text	10-20	1-2 minutes
mxbai-embed-large	2-5	3-5 minutes

Factors affecting speed:

CPU performance
Model size
Abstract length
Ollama configuration

Database Performance

The embedder:

Uses connection pooling for efficiency
Batch commits every 100 documents
Checks for existing embeddings to avoid duplicates
Creates indexes automatically via schema

Memory Usage

Small models (768-dim): ~2GB RAM
Large models (1024-dim): ~4GB RAM
Batch processing: Minimal memory overhead

Database Schema

Tables Involved

chunks - Text chunks:

CREATE TABLE chunks (
    id INT PRIMARY KEY,
    document_id INT REFERENCES document(id),
    chunking_strategy_id INT,
    chunktype_id INT,
    document_title TEXT,
    text TEXT,
    chunklength INT,
    chunk_no INT
);

embedding_base - Base embedding info:

CREATE TABLE embedding_base (
    id INT PRIMARY KEY,
    chunk_id INT REFERENCES chunks(id),
    model_id INT REFERENCES embedding_models(id)
);

emb_768 / emb_1024 - Actual vectors:

CREATE TABLE emb_768 (
    embedding vector(768)
) INHERITS (embedding_base);

CREATE TABLE emb_1024 (
    embedding vector(1024)
) INHERITS (embedding_base);

For Abstracts

The embedder creates:

One chunk per document (chunk_no=0)
Chunk text = document abstract
Embedding stored in emb_768 or emb_1024 based on model

Troubleshooting

"ollama not installed"

Problem: Ollama Python package not found.

Solution:

uv pip install ollama

"Model not found in Ollama"

Problem: Specified model not available.

Solution:

# Pull the model
ollama pull snowflake-arctic-embed2:latest

# Or use a different model
uv run python embed_documents_cli.py embed --model nomic-embed-text:latest

"Connection refused to Ollama"

Problem: Ollama server not running.

Solution:

# Start Ollama (varies by OS)
ollama serve

# Or check if it's running
curl http://localhost:11434/api/tags

Slow Embedding Generation

Problem: Embeddings taking too long.

Solutions:

Use a faster model: --model nomic-embed-text:latest
Reduce batch size: --batch-size 50
Check Ollama configuration
Consider GPU acceleration for Ollama

"Unsupported embedding dimension"

Problem: Model produces dimension other than 768 or 1024.

Solution: BMLibrarian currently supports 768 and 1024 dimensions. Use a supported model:

768-dim: nomic-embed-text, gte-base
1024-dim: snowflake-arctic-embed2, mxbai-embed-large

Integration with Research Workflows

Semantic Search

Once documents are embedded, use semantic search in queries:

from bmlibrarian.database import search_with_semantic

# Semantic search uses embeddings automatically
results = search_with_semantic(
    search_text="cardiovascular benefits of exercise",
    threshold=0.7,
    max_results=50
)

for doc in results:
    print(f"{doc['title']} - Similarity: {doc['semantic_score']:.3f}")

Hybrid Search

Combine keyword and semantic search:

from bmlibrarian.database import search_hybrid

documents, metadata = search_hybrid(
    search_text="cardiovascular benefits of exercise",
    query_text="cardiovascular & exercise",
    search_config={
        'semantic': {'enabled': True, 'similarity_threshold': 0.7},
        'bm25': {'enabled': True},
        'keyword': {'enabled': True}
    }
)

Future Enhancements

Planned features:

Full-text chunking: Embed document full text in addition to abstracts
Chunk overlap: Sliding window chunking for better coverage
Multiple embedding models: Use different models for different purposes
Incremental updates: Re-embed when documents are updated
Quality metrics: Track embedding quality and coverage

FilesExpand file tree

document_embedding_guide.md

Latest commit

History

document_embedding_guide.md

File metadata and controls

Document Embedding Guide

Overview

Features

Installation Requirements

Quick Start

Basic Embedding

Check Status

Count Unembedded Documents

Command Reference

embed - Generate Embeddings

count - Count Unembedded Documents

status - Show Embedding Statistics

Configuration

Model Selection

Ollama Configuration

Programmatic Usage

Workflow Recommendations

Initial Setup

Production Workflow

Daily Maintenance

Performance Considerations

Embedding Speed

Database Performance

Memory Usage

Database Schema

Tables Involved

For Abstracts

Troubleshooting

"ollama not installed"

"Model not found in Ollama"

"Connection refused to Ollama"

Slow Embedding Generation

"Unsupported embedding dimension"

Integration with Research Workflows

Semantic Search

Hybrid Search

Future Enhancements

See Also

`embed` - Generate Embeddings

`count` - Count Unembedded Documents

`status` - Show Embedding Statistics