Data Flow

This page explains how data moves through RocketRide pipelines, from ingestion to output.

Pipeline Overview

A pipeline is a directed graph of processing nodes. Each node receives data on one or more input lanes, processes it, and emits results on one or more output lanes. The engine orchestrates execution order based on the edges defined in the pipeline JSON.

Source ──► Preprocessor ──► Embeddings ──► Vector Store ──► LLM ──► Output

Lanes

Lanes are typed channels that carry data between nodes. Each lane carries a specific kind of data:

Lane	Description
`text`	Raw text content
`documents`	Parsed/chunked document objects with metadata
`questions`	User questions (for search or chat)
`answers`	LLM-generated answers
`tags`	Classification tags and labels
`image`	Image data
`audio`	Audio data
`video`	Video data
`table`	Structured tabular data
`classifications`	Classification results
`_source`	Internal source metadata

A node declares which lanes it accepts as input and which lanes it produces as output in its services.json file. For example, the OpenAI LLM node accepts questions and produces answers:

"lanes": {
    "questions": ["answers"]
}

A vector store like Chroma accepts both documents (for storage) and questions (for search), and can produce documents, answers, or questions:

"lanes": {
    "documents": [],
    "questions": ["documents", "answers", "questions"]
}

Processing Stages

A typical RAG (Retrieval-Augmented Generation) pipeline flows through these stages:

1. Source / Ingestion

Data enters the pipeline from one of several sources:

Webhook -- HTTP POST from external systems
Chat -- user messages via the Chat UI or SDK
Dropper -- file uploads via the Dropper UI
Filesystem -- local file monitoring
Cloud -- cloud storage connectors (S3, GCS, etc.)

2. Document Parsing

Raw files are parsed into text using Apache Tika or specialized parsers:

Tika -- PDF, DOCX, XLSX, PPTX, HTML, and 1000+ file formats
LlamaParse -- advanced document parsing service
Reducto -- document reduction and extraction
OCR -- image-to-text extraction

3. Preprocessing / Chunking

Text is split into chunks suitable for embedding:

General Text (preprocessor_langchain) -- LangChain-based splitters (recursive, character, markdown, LaTeX, NLTK, spaCy)
Code (preprocessor_code) -- language-aware code splitting
LLM (preprocessor_llm) -- LLM-based semantic chunking

4. Embeddings

Chunks are converted to vector representations:

OpenAI (embedding_openai) -- text-embedding-3-small, text-embedding-3-large, ada-002
Transformer (embedding_transformer) -- local transformer models
Image (embedding_image) -- image embeddings

5. Vector Storage

Embeddings are stored and indexed for similarity search:

Chroma, Pinecone, Qdrant, Weaviate, Milvus, Astra DB, MongoDB Atlas, PostgreSQL (pgvector)

6. LLM Generation

Questions are answered using retrieved context and an LLM:

OpenAI, Anthropic, Gemini, Mistral, Bedrock, Ollama, DeepSeek, xAI, Perplexity, IBM Watson, Vertex AI

7. Output

Results are formatted and returned to the client:

Text Output -- plain text or formatted responses
HTTP Results -- structured response objects returned to the calling client
Local Text Output -- writes results to the local filesystem

Pipeline JSON Structure

A pipeline is defined in a JSON file with components (nodes) and edges (connections):

{
    "version": "1.0",
    "components": [
        {
            "id": "embedder",
            "type": "embedding_openai://",
            "parameters": {
                "embedding_openai": {
                    "profile": "text-embedding-3-small",
                    "text-embedding-3-small": {
                        "apikey": "${OPENAI_API_KEY}"
                    }
                }
            }
        },
        {
            "id": "store",
            "type": "chroma://",
            "parameters": {
                "chroma": {
                    "profile": "local",
                    "local": {
                        "host": "localhost",
                        "port": "8000",
                        "collection": "my_collection"
                    }
                }
            }
        },
        {
            "id": "llm",
            "type": "llm_openai://",
            "parameters": {
                "llm_openai": {
                    "profile": "openai-5-2",
                    "openai-5-2": {
                        "apikey": "${OPENAI_API_KEY}"
                    }
                }
            }
        }
    ],
    "edges": [
        { "from": "embedder", "to": "store", "lane": "documents" },
        { "from": "store", "to": "llm", "lane": "questions" },
        { "from": "llm", "to": null, "lane": "answers" }
    ]
}

See Pipeline API for the full specification and Component Reference for all available nodes.

Environment Variable Substitution

Configuration values in pipeline JSON can reference environment variables using the ${VAR_NAME} syntax. The engine resolves these at pipeline load time. Clients can also pass environment variables via the env parameter when connecting.

Next Steps

Pipeline API -- full pipeline JSON specification
Component Reference -- all pipeline nodes by category
System Overview -- architecture layers

Getting Started

Architecture

API Reference

Contributing

Governance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Flow

Data Flow

Pipeline Overview

Lanes

Processing Stages

1. Source / Ingestion

2. Document Parsing

3. Preprocessing / Chunking

4. Embeddings

5. Vector Storage

6. LLM Generation

7. Output

Pipeline JSON Structure

Environment Variable Substitution

Next Steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally