Composable blocks and flows for synthetic data generation
SDG Hub is a Python framework for building synthetic data generation pipelines. Chain LLM, parsing, transform, filtering, and agent blocks into YAML-defined flows -- then generate training data at scale.
pip install sdg-hubfrom sdg_hub import FlowRegistry, Flow
# Discover and load a built-in flow
FlowRegistry.discover_flows()
flow = Flow.from_yaml(FlowRegistry.get_flow_path("MCP Server Distillation"))
# Configure and run
flow.set_model_config(model="openai/gpt-4o")
result = flow.generate(dataset)See the Quick Start for a full walkthrough, or browse all built-in flows.
Full documentation at ai-innovation.team/sdg_hub
- Installation -- setup, optional dependencies, development install
- Quick Start -- end-to-end walkthrough from loading a flow to generating data
- Core Concepts -- blocks, flows, registries, and dataset handling
- Block Reference -- LLM, parsing, transform, filtering, agent, and custom blocks
- Flow Reference -- YAML schema, built-in flows, custom flows
- API Reference -- auto-generated from source
- Contributing -- development setup and contribution guidelines
Apache License 2.0 -- see LICENSE.
Built by the Red Hat AI Innovation Team
