DataSQRL is an open-source data engineering harness that provides guardrails and feedback for AI coding agents to build reliable data pipelines, data APIs, and data products.
DataSQRL ensures coding agents meet the non-functional requirements of production data systems for data quality, scalability, governance, and reliability. DataSQRL provides deep-inspection of SQL, relational validators, and deterministic event-replay simulation to ensure agent-generated code meets these requirements through iterative feedback loops.
DataSQRL provides three capabilities that coding agents need to produce production-grade data systems:
-
Conceptual Framework: A SQL-based logical layer grounded in relational algebra and stream processing, with a physical layer that maps to execution engines. Gives agents a precise vocabulary for reasoning about data transformations.
-
Comprehensive Validation: Verification at every level across syntax, schema, data flow semantics, physical plans, and deployment assets, with actionable error messages that guide agents toward correct solutions.
-
Real-World Feedback: A simulator for local testing with timestamp-accurate replay, plus production telemetry hooks that correlate runtime behavior back to source code for autonomous troubleshooting.
DataSQRL compiles SQL scripts into deployment artifacts for PostgreSQL, Apache Kafka, Apache Flink, and Apache Iceberg—running on your existing infrastructure with Docker, Kubernetes, or cloud-managed services.
Create a new data project with the init command:
docker run --rm -v $PWD:/build datasqrl/cmd init api messenger(Use ${PWD} in Powershell on Windows)
This creates a data API project with sample data sources and a processing script called messenger.sqrl.
Run the project:
docker run -it --rm -p 8888:8888 -p 8081:8081 -v $PWD:/build datasqrl/cmd run messenger-prod-package.jsonAccess the API at http://localhost:8888/v1/graphiql/. Add messages:
mutation {
Messages(event: {message: "Hello World"}) {
message_time
}
}Query messages:
{
Messages {
message
message_time
}
}Also available via REST or MCP. Terminate with CTRL-C.
Instruct your favorite coding agent to update messenger.sqrl with test coverage and iterate until tests pass with:
docker run -it --rm -v $PWD:/build datasqrl/cmd test messenger-test-package.jsonFor example, to expose an endpoint for total messages:
TotalMessages := SELECT COUNT(*) as num_messages, MAX(message_time) as latest_timestamp
FROM Messages LIMIT 1;Finally, compile deployment artifacts to deploy to Kubernetes or cloud services:
docker run --rm -v $PWD:/build datasqrl/cmd compile messenger-prod-package.jsonThe build/deploy directory contains Flink compiled plans, Kafka topic definitions, PostgreSQL schemas, server queries, MCP tool definitions, and GraphQL models—ready for Kubernetes or cloud deployment.
Read the Getting Started tutorial or explore the AI generated data products for a fictional bank based on this catalog definition for a real-world inspired organizational use case.
Coding agents can generate SQL queries that produce correct results on test data. But will those queries perform at scale? Handle late-arriving events correctly? Maintain data quality when upstream schemas change? Provide data lineage, governance, and meet compliance?
These non-functional requirements — data quality, scalability, governance, reliability, cost efficiency — are what distinguish data engineering from general software development. General-purpose coding agents aren't equipped to handle them consistently.
DataSQRL provides the guardrails, feedback loops, and domain-specific constraints that coding agents need. Without a harness, you get pipelines that work in demos but fail in production. With a harness, you get pipelines that embody data engineering best practices and domain-specific knowledge.
To see DataSQRL guiding an AI coding agent, watch this demo.
DataSQRL is a harness and framework that deterministically automates data plumbing, reducing the complexity that coding agents must handle while providing feedback through deep introspection.
- Write SQL: Define data transformations in SQRL (SQL with stream processing and API extensions)
- Compile: DataSQRL builds a computational DAG, validates semantics, and optimizes execution
- Analyze: The compiler detects data inconsistencies, performance issues, and capability mismatches
- Generate: Cost-based optimization assigns operators to engines (Flink, Kafka, Postgres, Vert.x) and generates deployment artifacts
- Iterate: Compilation output helps the agent refine its solution, while simulation provides real-world feedback
The entire pipeline is defined in SQL: easy to understand, verify, and maintain. DataSQRL handles the complex mapping to physical infrastructure so agents can focus on business logic. DataSQRL is compatible with any code agent and can be extended to incorporate organization knowledge and meet custom compliance requirements.
DataSQRL includes a function library and connectors for Kafka, Iceberg, Postgres, and more. The framework is extensible, add custom functions, connectors, or execution engines.
Read the in-depth explanation or view the full documentation.
Our goal is to build a data engineering harness that enables safe, reliable automation of data platforms. We believe anyone who can read SQL should be empowered to build complex data systems that are robust and production-ready.
Your feedback is invaluable. Let us know what works and what doesn't by filing GitHub issues or starting discussions.
We welcome code contributions. See CONTRIBUTING.md for details.


