Skip to content

feat: wire data_diff tool through reladiff engine#104

Closed
suryaiyer95 wants to merge 3 commits intomainfrom
feat/data-validation-mode
Closed

feat: wire data_diff tool through reladiff engine#104
suryaiyer95 wants to merge 3 commits intomainfrom
feat/data-validation-mode

Conversation

@suryaiyer95
Copy link
Contributor

@suryaiyer95 suryaiyer95 commented Mar 11, 2026

Summary

Adds a data_diff tool and data-diff agent mode that wraps the Rust reladiff engine (from internal PR) for deterministic table-to-table data validation. Tested end-to-end on Snowflake with up to 1M rows.

What changed:

  • data-diff-run.ts — TypeScript tool that calls the Python bridge via Bridge.call("data_diff.run", params)
  • data_diff.py — Python orchestrator that drives the cooperative state machine loop (session.start() → execute SQL → session.step() → repeat)
  • server.py — Registers data_diff.run in the JSON-RPC dispatcher
  • protocol.ts — Adds DataDiffRunParams/DataDiffRunResult to the bridge protocol
  • agent.ts — Registers data-diff agent mode with all SQL/warehouse tool permissions
  • data-diff.txt — System prompt for the data-diff agent (uses data_diff tool as primary, manual SQL as fallback)
  • SKILL.md/data-validate skill for guided validation workflows
  • guard.py — Updated docstrings (no longer requires API keys)

Pipeline:

LLM (data-diff mode) → data_diff tool (TS) → Bridge.call("data_diff.run")
→ JSON-RPC → server.py → run_data_diff() → altimate_core.ReladiffSession (Rust)
→ cooperative loop (SQL tasks ↔ ConnectionRegistry) → structured result

Depends on: AltimateAI/internal PR (reladiff Rust module)

Test Results (Snowflake, verified end-to-end)

Test Algorithm Rows Time Result
T1 vs T1_COPY (identical) JoinDiff 5 <1s PASS — 0 diffs, 5 unchanged
T1 vs T2 (3 diffs) JoinDiff 5 <1s PASS — carol changed, eve removed, frank added
T1 vs T2 HashDiff 5 <1s PASS — detected mismatch
T1 vs T2 Profile 5 <1s PASS — all columns show correct mismatch
T1 vs T2 Cascade 5 <1s PASS — counts match, stops early
STRESS_SOURCE vs STRESS_TARGET JoinDiff 500K 11s PASS — 495K unchanged, 2.5K updated, 2.5K exclusive each
STRESS_SOURCE vs STRESS_TARGET HashDiff 500K 1.6s PASS — detected mismatch
STRESS_SOURCE vs STRESS_TARGET Profile 500K 4.1s PASS — correct column verdicts
MEGA_SOURCE vs MEGA_TARGET JoinDiff 1M 10.8s PASS — 996K unchanged, 2K updated, 2K exclusive each
MEGA_SOURCE vs MEGA_TARGET Profile 1M 3.5s PASS — correct verdicts

Example Prompts

Use --agent data-diff to enter data-diff mode. Example prompts to try:

Quick validation (JoinDiff — same database)

Compare SOURCE_DB.TEST.TABLE_A against SOURCE_DB.TEST.TABLE_B
using key column ID on the my-snowflake warehouse.
Include columns NAME and AMOUNT.

Cross-database validation (HashDiff)

Validate that staging.orders in my-postgres warehouse matches
analytics.orders in my-snowflake warehouse.
Key columns: order_id. Extra columns: customer_id, total_amount, status.
Use hashdiff algorithm since they're on different databases.

Profile comparison (column-level stats only)

Run a profile comparison of PROD.PUBLIC.CUSTOMERS vs STAGING.PUBLIC.CUSTOMERS
on my-snowflake warehouse. Key column: CUSTOMER_ID.
I just want column-level statistics, not a full row diff.

Cascade (progressive — stops early if counts differ)

Use cascade algorithm to compare WAREHOUSE_A.SCHEMA.LARGE_TABLE
against WAREHOUSE_B.SCHEMA.LARGE_TABLE on snowflake-prod.
Key: ID. This table has 50M rows so start cheap.

With WHERE filter

Compare sales_2024 vs sales_2024_backup on analytics warehouse.
Key: transaction_id. Extra columns: amount, currency, created_at.
Only check rows where created_at >= '2024-06-01'.

Schema discovery first

I need to validate two tables but I'm not sure about the column names.
First show me the schema of PROD.PUBLIC.ORDERS on my-snowflake,
then run a JoinDiff against STAGING.PUBLIC.ORDERS using the primary key.

Test Plan

  • TypeScript type check passes
  • Build succeeds
  • data_diff tool appears in data-diff agent mode tool list
  • End-to-end: JoinDiff correctly finds row-level diffs (5-row tables)
  • End-to-end: JoinDiff at scale (500K, 1M rows) — correct stats, <12s
  • End-to-end: HashDiff detects mismatches
  • End-to-end: Profile reports correct column-level verdicts
  • End-to-end: Cascade stops at count stage when counts match
  • Bug fix: _execute_task guards against synthetic status rows from executor
  • Error handling: missing altimate_core, invalid table names, SQL failures

🤖 Generated with Claude Code

suryaiyer95 and others added 3 commits March 9, 2026 19:07
- New `data-diff` primary agent mode for cross-database data validation
  with progressive checks: row counts → column profiles → segment
  checksums → row-level diffs
- New `/data-validate` skill with dialect-specific SQL templates for
  Snowflake, Postgres, BigQuery, DuckDB, Databricks, ClickHouse, MySQL
- Prompt covers 4 validation levels, cross-database checksum awareness,
  and structured PASS/FAIL reporting
- Added `/data-validate` to migrator and validator skill lists so both
  modes can invoke it

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… data validation

Adds the full pipeline: TypeScript tool → Bridge → Python orchestrator → Rust engine.

- `data-diff-run.ts`: TypeScript tool wrapping `Bridge.call("data_diff.run")`
- `data_diff.py`: Python orchestrator driving the cooperative state machine loop
  via `altimate_core.ReladiffSession` (start → execute SQL → step → repeat)
- `server.py`: Added `data_diff.run` dispatch to JSON-RPC bridge
- `protocol.ts`: `DataDiffRunParams`/`DataDiffRunResult` interfaces + bridge method
- `registry.ts`: Registered `DataDiffRunTool` in tool registry
- `agent.ts`: Added `data_diff: "allow"` to data-diff agent permissions
- `data-diff.txt`: Rewrote prompt to use `data_diff` tool as primary approach,
  with manual SQL as fallback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reflects altimate-core change: `column_lineage` and `track_lineage`
now work without credentials. SDK logging activates when initialized.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

This PR doesn't fully meet our contributing guidelines and PR template.

What needs to be fixed:

  • PR description is missing required template sections. Please use the PR template.

Please edit this PR description to address the above within 2 hours, or it will be automatically closed.

If you believe this was flagged incorrectly, please let a maintainer know.

Comment on lines +53 to +63
"""Execute a single SQL task against the given warehouse."""
result = execute_sql(
SqlExecuteParams(sql=task["sql"], warehouse=warehouse, limit=100_000)
)

# Convert SqlExecuteResult rows to the format expected by ReladiffSession.step()
rows: list[list[str | None]] = []
for row in result.rows:
rows.append([str(v) if v is not None else None for v in row])

return {"id": task["id"], "rows": rows}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: SQL errors are returned as data rows and processed as valid results. The Rust engine parses the error string, defaults to 0, causing false-positive data diff outcomes.
Severity: CRITICAL

Suggested Fix

Modify _execute_task to inspect the result from execute_sql. If the result's columns indicate an error (e.g., result.columns == ["error"]), an exception should be raised. This will allow the try/except block in the calling run_data_diff function to catch the failure and report it correctly, preventing incorrect data from being sent to the Rust engine.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: packages/altimate-engine/src/altimate_engine/sql/data_diff.py#L52-L63

Potential issue: The `execute_sql` function returns SQL errors, such as connection
failures, as data rows instead of raising exceptions. The `_execute_task` function in
`data_diff.py` does not check for this error format and processes the error message as
if it were valid query data. This error data is then passed to the Rust engine, which
attempts to parse the error string as an integer. The parse fails and the engine
silently defaults the value to `0`. This causes the data diff to incorrectly report that
tables match when a query has actually failed, leading to silent, false-positive
validation results.

Did we get this right? 👍 / 👎 to inform future reviews.

@github-actions
Copy link

This pull request has been automatically closed because it was not updated to meet our contributing guidelines within the 2-hour window.

Feel free to open a new pull request that follows our guidelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant