feat: wire data_diff tool through reladiff engine#104
feat: wire data_diff tool through reladiff engine#104suryaiyer95 wants to merge 3 commits intomainfrom
Conversation
- New `data-diff` primary agent mode for cross-database data validation with progressive checks: row counts → column profiles → segment checksums → row-level diffs - New `/data-validate` skill with dialect-specific SQL templates for Snowflake, Postgres, BigQuery, DuckDB, Databricks, ClickHouse, MySQL - Prompt covers 4 validation levels, cross-database checksum awareness, and structured PASS/FAIL reporting - Added `/data-validate` to migrator and validator skill lists so both modes can invoke it Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… data validation
Adds the full pipeline: TypeScript tool → Bridge → Python orchestrator → Rust engine.
- `data-diff-run.ts`: TypeScript tool wrapping `Bridge.call("data_diff.run")`
- `data_diff.py`: Python orchestrator driving the cooperative state machine loop
via `altimate_core.ReladiffSession` (start → execute SQL → step → repeat)
- `server.py`: Added `data_diff.run` dispatch to JSON-RPC bridge
- `protocol.ts`: `DataDiffRunParams`/`DataDiffRunResult` interfaces + bridge method
- `registry.ts`: Registered `DataDiffRunTool` in tool registry
- `agent.ts`: Added `data_diff: "allow"` to data-diff agent permissions
- `data-diff.txt`: Rewrote prompt to use `data_diff` tool as primary approach,
with manual SQL as fallback
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reflects altimate-core change: `column_lineage` and `track_lineage` now work without credentials. SDK logging activates when initialized. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
This PR doesn't fully meet our contributing guidelines and PR template. What needs to be fixed:
Please edit this PR description to address the above within 2 hours, or it will be automatically closed. If you believe this was flagged incorrectly, please let a maintainer know. |
| """Execute a single SQL task against the given warehouse.""" | ||
| result = execute_sql( | ||
| SqlExecuteParams(sql=task["sql"], warehouse=warehouse, limit=100_000) | ||
| ) | ||
|
|
||
| # Convert SqlExecuteResult rows to the format expected by ReladiffSession.step() | ||
| rows: list[list[str | None]] = [] | ||
| for row in result.rows: | ||
| rows.append([str(v) if v is not None else None for v in row]) | ||
|
|
||
| return {"id": task["id"], "rows": rows} |
There was a problem hiding this comment.
Bug: SQL errors are returned as data rows and processed as valid results. The Rust engine parses the error string, defaults to 0, causing false-positive data diff outcomes.
Severity: CRITICAL
Suggested Fix
Modify _execute_task to inspect the result from execute_sql. If the result's columns indicate an error (e.g., result.columns == ["error"]), an exception should be raised. This will allow the try/except block in the calling run_data_diff function to catch the failure and report it correctly, preventing incorrect data from being sent to the Rust engine.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: packages/altimate-engine/src/altimate_engine/sql/data_diff.py#L52-L63
Potential issue: The `execute_sql` function returns SQL errors, such as connection
failures, as data rows instead of raising exceptions. The `_execute_task` function in
`data_diff.py` does not check for this error format and processes the error message as
if it were valid query data. This error data is then passed to the Rust engine, which
attempts to parse the error string as an integer. The parse fails and the engine
silently defaults the value to `0`. This causes the data diff to incorrectly report that
tables match when a query has actually failed, leading to silent, false-positive
validation results.
Did we get this right? 👍 / 👎 to inform future reviews.
|
This pull request has been automatically closed because it was not updated to meet our contributing guidelines within the 2-hour window. Feel free to open a new pull request that follows our guidelines. |
Summary
Adds a
data_difftool anddata-diffagent mode that wraps the Rust reladiff engine (from internal PR) for deterministic table-to-table data validation. Tested end-to-end on Snowflake with up to 1M rows.What changed:
data-diff-run.ts— TypeScript tool that calls the Python bridge viaBridge.call("data_diff.run", params)data_diff.py— Python orchestrator that drives the cooperative state machine loop (session.start()→ execute SQL →session.step()→ repeat)server.py— Registersdata_diff.runin the JSON-RPC dispatcherprotocol.ts— AddsDataDiffRunParams/DataDiffRunResultto the bridge protocolagent.ts— Registersdata-diffagent mode with all SQL/warehouse tool permissionsdata-diff.txt— System prompt for the data-diff agent (usesdata_difftool as primary, manual SQL as fallback)SKILL.md—/data-validateskill for guided validation workflowsguard.py— Updated docstrings (no longer requires API keys)Pipeline:
Depends on: AltimateAI/internal PR (reladiff Rust module)
Test Results (Snowflake, verified end-to-end)
Example Prompts
Use
--agent data-diffto enter data-diff mode. Example prompts to try:Quick validation (JoinDiff — same database)
Cross-database validation (HashDiff)
Profile comparison (column-level stats only)
Cascade (progressive — stops early if counts differ)
With WHERE filter
Schema discovery first
Test Plan
data_difftool appears in data-diff agent mode tool list_execute_taskguards against synthetic status rows from executoraltimate_core, invalid table names, SQL failures🤖 Generated with Claude Code