Skip to content

labrat-0/rag-embedding-generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG Embedding Generator

Apify Actor that generates vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains directly with RAG Content Chunker or any crawler output. Outputs flat embedding objects with pass-through metadata, ready for any vector database. No vendor lock-in. MCP-ready for AI agent integration.

Features

  • Two embedding providers: OpenAI (text-embedding-3-small/large, ada-002) and Cohere (embed-english/multilingual-v3.0, light variants)
  • Three input modes: single text, text list, or dataset chaining from any previous actor
  • Pass-through metadata from RAG Content Chunker (chunk_id, source_url, page_title, section_heading)
  • Batched API requests for throughput (up to 2048 texts per OpenAI call, 96 per Cohere call)
  • Exponential backoff retry on rate limits and transient failures (3 attempts)
  • API key marked isSecret -- never logged, never stored, never included in output
  • Hardcoded API base URLs to prevent SSRF attacks
  • Input validation and sanitization (key format checks, dataset ID regex, text length limits)
  • Output: raw float arrays compatible with any vector DB (Pinecone, Qdrant, Weaviate, Chroma, etc.)

Requirements

  • Python 3.11+
  • Apify platform account (for running as Actor)
  • OpenAI or Cohere API key

Install dependencies:

pip install -r requirements.txt

Configuration

Actor Inputs

Defined in .actor/INPUT_SCHEMA.json:

  • api_key (string, required) -- your OpenAI or Cohere API key. Marked isSecret
  • provider (string, optional) -- "openai" (default) or "cohere"
  • model (string, optional) -- embedding model to use. Default: "text-embedding-3-small"
  • text (string, optional) -- a single text string to embed, max 100,000 characters
  • texts (array, optional) -- a list of text strings to embed, max 10,000 items
  • dataset_id (string, optional) -- Apify dataset ID from a previous actor run (e.g., RAG Content Chunker). Takes priority over text/texts
  • dataset_field (string, optional) -- field to read from each dataset item. Default: "text". Supports dot notation
  • batch_size (integer, optional) -- texts per API request. Default: 128. Max: 2048 (OpenAI) or 96 (Cohere)
  • include_text (boolean, optional) -- include original text in output. Default: false

At least one of text, texts, or dataset_id must be provided, plus api_key.

Supported Models

Provider Model Dimensions Notes
OpenAI text-embedding-3-small 1536 Default. Cheapest, good quality
OpenAI text-embedding-3-large 3072 Best quality, higher cost
OpenAI text-embedding-ada-002 1536 Legacy, widely deployed
Cohere embed-english-v3.0 1024 English-optimized
Cohere embed-multilingual-v3.0 1024 100+ languages
Cohere embed-english-light-v3.0 384 Faster, smaller vectors
Cohere embed-multilingual-light-v3.0 384 Faster, multilingual

Usage

Local (CLI)

APIFY_TOKEN=your-token apify run

Single Text Input

{
  "api_key": "sk-your-openai-key",
  "provider": "openai",
  "model": "text-embedding-3-small",
  "text": "This is a sample text to embed into a vector representation."
}

Text List Input

{
  "api_key": "sk-your-openai-key",
  "texts": [
    "First document to embed.",
    "Second document to embed.",
    "Third document to embed."
  ]
}

Dataset Chaining (from RAG Content Chunker)

{
  "api_key": "sk-your-openai-key",
  "dataset_id": "abc123XYZ",
  "dataset_field": "text",
  "model": "text-embedding-3-small",
  "batch_size": 256
}

Example Output

Each embedding is a separate dataset item:

{
  "index": 0,
  "embedding": [0.0123, -0.0456, 0.0789, "...1536 floats total"],
  "dimensions": 1536,
  "token_count": 12,
  "chunk_id": "a1b2c3d4e5f67890",
  "source_url": "https://example.com/page",
  "page_title": "Example Page",
  "section_heading": "Introduction"
}

A summary item is appended at the end:

{
  "_summary": true,
  "total_embeddings": 42,
  "total_tokens": 8374,
  "provider": "openai",
  "model": "text-embedding-3-small",
  "dimensions": 1536,
  "processing_time": 3.241,
  "billing": {
    "total_embeddings": 42,
    "amount": 0.0126,
    "rate_per_embedding": 0.0003
  }
}

Pipeline Position

This actor fills the embedding step in a standard RAG pipeline:

Crawl (Website Content Crawler, 101K+ users)
  -> Clean (optional preprocessing)
    -> Chunk (RAG Content Chunker)
      -> Embed (this actor)
        -> Store (Pinecone, Qdrant, Weaviate integrations)

Chaining with RAG Content Chunker

  1. Run RAG Content Chunker on your text or crawler output
  2. Copy the output dataset ID from the chunker run
  3. Pass it as dataset_id to this actor
  4. This actor reads each chunk, skips _summary rows, and passes through chunk_id, source_url, page_title, and section_heading metadata

The output vectors include all the metadata needed to store them in a vector database with proper source attribution.

Architecture

  • src/agent/main.py -- Actor entry point, input routing (text/texts/dataset), dataset loading, output
  • src/agent/embedder.py -- Core embedding engine, OpenAI + Cohere API calls, batching, retry logic
  • src/agent/validation.py -- Input validation, API key format checks, provider/model whitelist, sanitization
  • src/agent/pricing.py -- PPE billing calculator ($0.0003/embedding)
  • skill.md -- Machine-readable skill contract for agent discovery

Security

  • API key handling: Marked isSecret in input schema, validated for format only, never logged or stored, stripped from error messages
  • SSRF prevention: Outbound requests hardcoded to api.openai.com and api.cohere.ai only -- no user-supplied URLs
  • Provider/model whitelist: Only known provider+model combinations accepted, prevents arbitrary endpoint injection
  • Input sanitization: Control characters stripped, dataset IDs and field names regex-validated, text length bounded
  • Error safety: All error messages pass through _sanitize_error() to ensure API keys are never leaked in logs or output
  • No data retention: Texts and embeddings exist only in memory during the run

Pricing

Pay-Per-Event (PPE): $0.0003 per embedding ($0.30 per 1,000 embeddings).

This is the actor's platform fee only. You also pay the embedding provider (OpenAI or Cohere) directly via your own API key.

Content Size Approx. Embeddings Actor Fee Provider Fee (OpenAI 3-small)
Single blog post 10-20 $0.003-$0.006 ~$0.001
10-page website 50-100 $0.015-$0.03 ~$0.005
100-page docs site 500-1,000 $0.15-$0.30 ~$0.05
Large knowledge base 5,000-10,000 $1.50-$3.00 ~$0.50

Troubleshooting

  • "API key is required": Provide your OpenAI or Cohere API key in the api_key field
  • "Invalid OpenAI API key format": OpenAI keys start with sk- followed by alphanumeric characters
  • "Invalid model for provider": Check the supported models table above. Model names are case-sensitive
  • "No input provided": Supply at least one of text, texts, or dataset_id
  • "Text exceeds maximum length": Individual texts are limited to 100K characters. Use texts or dataset_id for bulk
  • "Invalid dataset_id format": Must be alphanumeric with hyphens/underscores, 1-64 characters
  • "API key is invalid or expired": Your provider API key was rejected. Verify it in your OpenAI/Cohere dashboard
  • "Failed after 3 attempts": Transient API error. Try again, or reduce batch_size if hitting rate limits
  • Dataset errors: Verify the dataset ID exists and the actor has access to it

License

See LICENSE file for details.


MCP Integration

This actor works as an MCP tool through Apify's hosted MCP server. No custom server needed.

  • Endpoint: https://mcp.apify.com?tools=labrat011/rag-embedding-generator
  • Auth: Authorization: Bearer <APIFY_TOKEN>
  • Transport: Streamable HTTP
  • Works with: Claude Desktop, Cursor, VS Code, Windsurf, Warp, Gemini CLI

Example MCP config (Claude Desktop / Cursor):

{
    "mcpServers": {
        "rag-embedding-generator": {
            "url": "https://mcp.apify.com?tools=labrat011/rag-embedding-generator",
            "headers": {
                "Authorization": "Bearer <APIFY_TOKEN>"
            }
        }
    }
}

AI agents can use this actor to generate vector embeddings from text using OpenAI or Cohere, embed chunked documents, and prepare data for vector database storage -- all as a callable MCP tool.

About

Apify Actor: Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors