-
Notifications
You must be signed in to change notification settings - Fork 45
Data Flow
This page explains how data moves through RocketRide pipelines, from ingestion to output.
A pipeline is a directed graph of processing nodes. Each node receives data on one or more input lanes, processes it, and emits results on one or more output lanes. The engine orchestrates execution order based on the edges defined in the pipeline JSON.
Source ──► Preprocessor ──► Embeddings ──► Vector Store ──► LLM ──► Output
Lanes are typed channels that carry data between nodes. Each lane carries a specific kind of data:
| Lane | Description |
|---|---|
text |
Raw text content |
documents |
Parsed/chunked document objects with metadata |
questions |
User questions (for search or chat) |
answers |
LLM-generated answers |
tags |
Classification tags and labels |
image |
Image data |
audio |
Audio data |
video |
Video data |
table |
Structured tabular data |
classifications |
Classification results |
_source |
Internal source metadata |
A node declares which lanes it accepts as input and which lanes it produces as output in its services.json file. For example, the OpenAI LLM node accepts questions and produces answers:
"lanes": {
"questions": ["answers"]
}A vector store like Chroma accepts both documents (for storage) and questions (for search), and can produce documents, answers, or questions:
"lanes": {
"documents": [],
"questions": ["documents", "answers", "questions"]
}A typical RAG (Retrieval-Augmented Generation) pipeline flows through these stages:
Data enters the pipeline from one of several sources:
- Webhook -- HTTP POST from external systems
- Chat -- user messages via the Chat UI or SDK
- Dropper -- file uploads via the Dropper UI
- Filesystem -- local file monitoring
- Cloud -- cloud storage connectors (S3, GCS, etc.)
Raw files are parsed into text using Apache Tika or specialized parsers:
- Tika -- PDF, DOCX, XLSX, PPTX, HTML, and 1000+ file formats
- LlamaParse -- advanced document parsing service
- Reducto -- document reduction and extraction
- OCR -- image-to-text extraction
Text is split into chunks suitable for embedding:
-
General Text (
preprocessor_langchain) -- LangChain-based splitters (recursive, character, markdown, LaTeX, NLTK, spaCy) -
Code (
preprocessor_code) -- language-aware code splitting -
LLM (
preprocessor_llm) -- LLM-based semantic chunking
Chunks are converted to vector representations:
-
OpenAI (
embedding_openai) -- text-embedding-3-small, text-embedding-3-large, ada-002 -
Transformer (
embedding_transformer) -- local transformer models -
Image (
embedding_image) -- image embeddings
Embeddings are stored and indexed for similarity search:
- Chroma, Pinecone, Qdrant, Weaviate, Milvus, Astra DB, MongoDB Atlas, PostgreSQL (pgvector)
Questions are answered using retrieved context and an LLM:
- OpenAI, Anthropic, Gemini, Mistral, Bedrock, Ollama, DeepSeek, xAI, Perplexity, IBM Watson, Vertex AI
Results are formatted and returned to the client:
- Text Output -- plain text or formatted responses
- HTTP Results -- structured response objects returned to the calling client
- Local Text Output -- writes results to the local filesystem
A pipeline is defined in a JSON file with components (nodes) and edges (connections):
{
"version": "1.0",
"components": [
{
"id": "embedder",
"type": "embedding_openai://",
"parameters": {
"embedding_openai": {
"profile": "text-embedding-3-small",
"text-embedding-3-small": {
"apikey": "${OPENAI_API_KEY}"
}
}
}
},
{
"id": "store",
"type": "chroma://",
"parameters": {
"chroma": {
"profile": "local",
"local": {
"host": "localhost",
"port": "8000",
"collection": "my_collection"
}
}
}
},
{
"id": "llm",
"type": "llm_openai://",
"parameters": {
"llm_openai": {
"profile": "openai-5-2",
"openai-5-2": {
"apikey": "${OPENAI_API_KEY}"
}
}
}
}
],
"edges": [
{ "from": "embedder", "to": "store", "lane": "documents" },
{ "from": "store", "to": "llm", "lane": "questions" },
{ "from": "llm", "to": null, "lane": "answers" }
]
}See Pipeline API for the full specification and Component Reference for all available nodes.
Configuration values in pipeline JSON can reference environment variables using the ${VAR_NAME} syntax. The engine resolves these at pipeline load time. Clients can also pass environment variables via the env parameter when connecting.
- Pipeline API -- full pipeline JSON specification
- Component Reference -- all pipeline nodes by category
- System Overview -- architecture layers
Getting Started
Architecture
API Reference
Contributing
Governance