RAG Embedding Generator

Apify Actor that generates vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains directly with RAG Content Chunker or any crawler output. Outputs flat embedding objects with pass-through metadata, ready for any vector database. No vendor lock-in. MCP-ready for AI agent integration.

Features

Two embedding providers: OpenAI (text-embedding-3-small/large, ada-002) and Cohere (embed-english/multilingual-v3.0, light variants)
Three input modes: single text, text list, or dataset chaining from any previous actor
Pass-through metadata from RAG Content Chunker (chunk_id, source_url, page_title, section_heading)
Batched API requests for throughput (up to 2048 texts per OpenAI call, 96 per Cohere call)
Exponential backoff retry on rate limits and transient failures (3 attempts)
API key marked isSecret -- never logged, never stored, never included in output
Hardcoded API base URLs to prevent SSRF attacks
Input validation and sanitization (key format checks, dataset ID regex, text length limits)
Output: raw float arrays compatible with any vector DB (Pinecone, Qdrant, Weaviate, Chroma, etc.)

Requirements

Python 3.11+
Apify platform account (for running as Actor)
OpenAI or Cohere API key

Install dependencies:

pip install -r requirements.txt

Configuration

Actor Inputs

Defined in .actor/INPUT_SCHEMA.json:

api_key (string, required) -- your OpenAI or Cohere API key. Marked isSecret
provider (string, optional) -- "openai" (default) or "cohere"
model (string, optional) -- embedding model to use. Default: "text-embedding-3-small"
text (string, optional) -- a single text string to embed, max 100,000 characters
texts (array, optional) -- a list of text strings to embed, max 10,000 items
dataset_id (string, optional) -- Apify dataset ID from a previous actor run (e.g., RAG Content Chunker). Takes priority over text/texts
dataset_field (string, optional) -- field to read from each dataset item. Default: "text". Supports dot notation
batch_size (integer, optional) -- texts per API request. Default: 128. Max: 2048 (OpenAI) or 96 (Cohere)
include_text (boolean, optional) -- include original text in output. Default: false

At least one of text, texts, or dataset_id must be provided, plus api_key.

Supported Models

Provider	Model	Dimensions	Notes
OpenAI	`text-embedding-3-small`	1536	Default. Cheapest, good quality
OpenAI	`text-embedding-3-large`	3072	Best quality, higher cost
OpenAI	`text-embedding-ada-002`	1536	Legacy, widely deployed
Cohere	`embed-english-v3.0`	1024	English-optimized
Cohere	`embed-multilingual-v3.0`	1024	100+ languages
Cohere	`embed-english-light-v3.0`	384	Faster, smaller vectors
Cohere	`embed-multilingual-light-v3.0`	384	Faster, multilingual

Usage

Local (CLI)

APIFY_TOKEN=your-token apify run

Single Text Input

{
  "api_key": "sk-your-openai-key",
  "provider": "openai",
  "model": "text-embedding-3-small",
  "text": "This is a sample text to embed into a vector representation."
}

Text List Input

{
  "api_key": "sk-your-openai-key",
  "texts": [
    "First document to embed.",
    "Second document to embed.",
    "Third document to embed."
  ]
}

Dataset Chaining (from RAG Content Chunker)

{
  "api_key": "sk-your-openai-key",
  "dataset_id": "abc123XYZ",
  "dataset_field": "text",
  "model": "text-embedding-3-small",
  "batch_size": 256
}

Example Output

Each embedding is a separate dataset item:

{
  "index": 0,
  "embedding": [0.0123, -0.0456, 0.0789, "...1536 floats total"],
  "dimensions": 1536,
  "token_count": 12,
  "chunk_id": "a1b2c3d4e5f67890",
  "source_url": "https://example.com/page",
  "page_title": "Example Page",
  "section_heading": "Introduction"
}

A summary item is appended at the end:

{
  "_summary": true,
  "total_embeddings": 42,
  "total_tokens": 8374,
  "provider": "openai",
  "model": "text-embedding-3-small",
  "dimensions": 1536,
  "processing_time": 3.241,
  "billing": {
    "total_embeddings": 42,
    "amount": 0.0126,
    "rate_per_embedding": 0.0003
  }
}

Pipeline Position

This actor fills the embedding step in a standard RAG pipeline:

Crawl (Website Content Crawler, 101K+ users)
  -> Clean (optional preprocessing)
    -> Chunk (RAG Content Chunker)
      -> Embed (this actor)
        -> Store (Pinecone, Qdrant, Weaviate integrations)

Chaining with RAG Content Chunker

Run RAG Content Chunker on your text or crawler output
Copy the output dataset ID from the chunker run
Pass it as dataset_id to this actor
This actor reads each chunk, skips _summary rows, and passes through chunk_id, source_url, page_title, and section_heading metadata

The output vectors include all the metadata needed to store them in a vector database with proper source attribution.

Architecture

src/agent/main.py -- Actor entry point, input routing (text/texts/dataset), dataset loading, output
src/agent/embedder.py -- Core embedding engine, OpenAI + Cohere API calls, batching, retry logic
src/agent/validation.py -- Input validation, API key format checks, provider/model whitelist, sanitization
src/agent/pricing.py -- PPE billing calculator ($0.0003/embedding)
skill.md -- Machine-readable skill contract for agent discovery

Security

API key handling: Marked isSecret in input schema, validated for format only, never logged or stored, stripped from error messages
SSRF prevention: Outbound requests hardcoded to api.openai.com and api.cohere.ai only -- no user-supplied URLs
Provider/model whitelist: Only known provider+model combinations accepted, prevents arbitrary endpoint injection
Input sanitization: Control characters stripped, dataset IDs and field names regex-validated, text length bounded
Error safety: All error messages pass through _sanitize_error() to ensure API keys are never leaked in logs or output
No data retention: Texts and embeddings exist only in memory during the run

Pricing

Pay-Per-Event (PPE): $0.0003 per embedding ($0.30 per 1,000 embeddings).

This is the actor's platform fee only. You also pay the embedding provider (OpenAI or Cohere) directly via your own API key.

Content Size	Approx. Embeddings	Actor Fee	Provider Fee (OpenAI 3-small)
Single blog post	10-20	$0.003-$0.006	~$0.001
10-page website	50-100	$0.015-$0.03	~$0.005
100-page docs site	500-1,000	$0.15-$0.30	~$0.05
Large knowledge base	5,000-10,000	$1.50-$3.00	~$0.50

Troubleshooting

"API key is required": Provide your OpenAI or Cohere API key in the api_key field
"Invalid OpenAI API key format": OpenAI keys start with sk- followed by alphanumeric characters
"Invalid model for provider": Check the supported models table above. Model names are case-sensitive
"No input provided": Supply at least one of text, texts, or dataset_id
"Text exceeds maximum length": Individual texts are limited to 100K characters. Use texts or dataset_id for bulk
"Invalid dataset_id format": Must be alphanumeric with hyphens/underscores, 1-64 characters
"API key is invalid or expired": Your provider API key was rejected. Verify it in your OpenAI/Cohere dashboard
"Failed after 3 attempts": Transient API error. Try again, or reduce batch_size if hitting rate limits
Dataset errors: Verify the dataset ID exists and the actor has access to it

License

See LICENSE file for details.

MCP Integration

This actor works as an MCP tool through Apify's hosted MCP server. No custom server needed.

Endpoint: https://mcp.apify.com?tools=labrat011/rag-embedding-generator
Auth: Authorization: Bearer <APIFY_TOKEN>
Transport: Streamable HTTP
Works with: Claude Desktop, Cursor, VS Code, Windsurf, Warp, Gemini CLI

Example MCP config (Claude Desktop / Cursor):

{
    "mcpServers": {
        "rag-embedding-generator": {
            "url": "https://mcp.apify.com?tools=labrat011/rag-embedding-generator",
            "headers": {
                "Authorization": "Bearer <APIFY_TOKEN>"
            }
        }
    }
}

AI agents can use this actor to generate vector embeddings from text using OpenAI or Cohere, embed chunked documents, and prepare data for vector database storage -- all as a callable MCP tool.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.actor		.actor
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
skill.md		skill.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Embedding Generator

Features

Requirements

Configuration

Actor Inputs

Supported Models

Usage

Local (CLI)

Single Text Input

Text List Input

Dataset Chaining (from RAG Content Chunker)

Example Output

Pipeline Position

Chaining with RAG Content Chunker

Architecture

Security

Pricing

Troubleshooting

License

MCP Integration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Embedding Generator

Features

Requirements

Configuration

Actor Inputs

Supported Models

Usage

Local (CLI)

Single Text Input

Text List Input

Dataset Chaining (from RAG Content Chunker)

Example Output

Pipeline Position

Chaining with RAG Content Chunker

Architecture

Security

Pricing

Troubleshooting

License

MCP Integration

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages