Skip to content

Data Flow

Dmitrii Karataev edited this page Feb 26, 2026 · 2 revisions

Data Flow

This page explains how data moves through RocketRide pipelines, from ingestion to output.

Pipeline Overview

A pipeline is a directed graph of processing nodes. Each node receives data on one or more input lanes, processes it, and emits results on one or more output lanes. The engine orchestrates execution order based on the edges defined in the pipeline JSON.

Source ──► Preprocessor ──► Embeddings ──► Vector Store ──► LLM ──► Output

Lanes

Lanes are typed channels that carry data between nodes. Each lane carries a specific kind of data:

Lane Description
text Raw text content
documents Parsed/chunked document objects with metadata
questions User questions (for search or chat)
answers LLM-generated answers
tags Classification tags and labels
image Image data
audio Audio data
video Video data
table Structured tabular data
classifications Classification results
_source Internal source metadata

A node declares which lanes it accepts as input and which lanes it produces as output in its services.json file. For example, the OpenAI LLM node accepts questions and produces answers:

"lanes": {
    "questions": ["answers"]
}

A vector store like Chroma accepts both documents (for storage) and questions (for search), and can produce documents, answers, or questions:

"lanes": {
    "documents": [],
    "questions": ["documents", "answers", "questions"]
}

Processing Stages

A typical RAG (Retrieval-Augmented Generation) pipeline flows through these stages:

1. Source / Ingestion

Data enters the pipeline from one of several sources:

  • Webhook -- HTTP POST from external systems
  • Chat -- user messages via the Chat UI or SDK
  • Dropper -- file uploads via the Dropper UI
  • Filesystem -- local file monitoring
  • Cloud -- cloud storage connectors (S3, GCS, etc.)

2. Document Parsing

Raw files are parsed into text using Apache Tika or specialized parsers:

  • Tika -- PDF, DOCX, XLSX, PPTX, HTML, and 1000+ file formats
  • LlamaParse -- advanced document parsing service
  • Reducto -- document reduction and extraction
  • OCR -- image-to-text extraction

3. Preprocessing / Chunking

Text is split into chunks suitable for embedding:

  • General Text (preprocessor_langchain) -- LangChain-based splitters (recursive, character, markdown, LaTeX, NLTK, spaCy)
  • Code (preprocessor_code) -- language-aware code splitting
  • LLM (preprocessor_llm) -- LLM-based semantic chunking

4. Embeddings

Chunks are converted to vector representations:

  • OpenAI (embedding_openai) -- text-embedding-3-small, text-embedding-3-large, ada-002
  • Transformer (embedding_transformer) -- local transformer models
  • Image (embedding_image) -- image embeddings

5. Vector Storage

Embeddings are stored and indexed for similarity search:

  • Chroma, Pinecone, Qdrant, Weaviate, Milvus, Astra DB, MongoDB Atlas, PostgreSQL (pgvector)

6. LLM Generation

Questions are answered using retrieved context and an LLM:

  • OpenAI, Anthropic, Gemini, Mistral, Bedrock, Ollama, DeepSeek, xAI, Perplexity, IBM Watson, Vertex AI

7. Output

Results are formatted and returned to the client:

  • Text Output -- plain text or formatted responses
  • HTTP Results -- structured response objects returned to the calling client
  • Local Text Output -- writes results to the local filesystem

Pipeline JSON Structure

A pipeline is defined in a JSON file with components (nodes) and edges (connections):

{
    "version": "1.0",
    "components": [
        {
            "id": "embedder",
            "type": "embedding_openai://",
            "parameters": {
                "embedding_openai": {
                    "profile": "text-embedding-3-small",
                    "text-embedding-3-small": {
                        "apikey": "${OPENAI_API_KEY}"
                    }
                }
            }
        },
        {
            "id": "store",
            "type": "chroma://",
            "parameters": {
                "chroma": {
                    "profile": "local",
                    "local": {
                        "host": "localhost",
                        "port": "8000",
                        "collection": "my_collection"
                    }
                }
            }
        },
        {
            "id": "llm",
            "type": "llm_openai://",
            "parameters": {
                "llm_openai": {
                    "profile": "openai-5-2",
                    "openai-5-2": {
                        "apikey": "${OPENAI_API_KEY}"
                    }
                }
            }
        }
    ],
    "edges": [
        { "from": "embedder", "to": "store", "lane": "documents" },
        { "from": "store", "to": "llm", "lane": "questions" },
        { "from": "llm", "to": null, "lane": "answers" }
    ]
}

See Pipeline API for the full specification and Component Reference for all available nodes.

Environment Variable Substitution

Configuration values in pipeline JSON can reference environment variables using the ${VAR_NAME} syntax. The engine resolves these at pipeline load time. Clients can also pass environment variables via the env parameter when connecting.

Next Steps

Clone this wiki locally