feat: wire data_diff tool through reladiff engine by suryaiyer95 · Pull Request #104 · AltimateAI/altimate-code

suryaiyer95 · 2026-03-11T00:29:21Z

Summary

Adds a data_diff tool and data-diff agent mode that wraps the Rust reladiff engine (from internal PR) for deterministic table-to-table data validation. Tested end-to-end on Snowflake with up to 1M rows.

What changed:

data-diff-run.ts — TypeScript tool that calls the Python bridge via Bridge.call("data_diff.run", params)
data_diff.py — Python orchestrator that drives the cooperative state machine loop (session.start() → execute SQL → session.step() → repeat)
server.py — Registers data_diff.run in the JSON-RPC dispatcher
protocol.ts — Adds DataDiffRunParams/DataDiffRunResult to the bridge protocol
agent.ts — Registers data-diff agent mode with all SQL/warehouse tool permissions
data-diff.txt — System prompt for the data-diff agent (uses data_diff tool as primary, manual SQL as fallback)
SKILL.md — /data-validate skill for guided validation workflows
guard.py — Updated docstrings (no longer requires API keys)

Pipeline:

LLM (data-diff mode) → data_diff tool (TS) → Bridge.call("data_diff.run")
→ JSON-RPC → server.py → run_data_diff() → altimate_core.ReladiffSession (Rust)
→ cooperative loop (SQL tasks ↔ ConnectionRegistry) → structured result

Depends on: AltimateAI/internal PR (reladiff Rust module)

Test Results (Snowflake, verified end-to-end)

Test	Algorithm	Rows	Time	Result
T1 vs T1_COPY (identical)	JoinDiff	5	<1s	PASS — 0 diffs, 5 unchanged
T1 vs T2 (3 diffs)	JoinDiff	5	<1s	PASS — carol changed, eve removed, frank added
T1 vs T2	HashDiff	5	<1s	PASS — detected mismatch
T1 vs T2	Profile	5	<1s	PASS — all columns show correct mismatch
T1 vs T2	Cascade	5	<1s	PASS — counts match, stops early
STRESS_SOURCE vs STRESS_TARGET	JoinDiff	500K	11s	PASS — 495K unchanged, 2.5K updated, 2.5K exclusive each
STRESS_SOURCE vs STRESS_TARGET	HashDiff	500K	1.6s	PASS — detected mismatch
STRESS_SOURCE vs STRESS_TARGET	Profile	500K	4.1s	PASS — correct column verdicts
MEGA_SOURCE vs MEGA_TARGET	JoinDiff	1M	10.8s	PASS — 996K unchanged, 2K updated, 2K exclusive each
MEGA_SOURCE vs MEGA_TARGET	Profile	1M	3.5s	PASS — correct verdicts

Example Prompts

Use --agent data-diff to enter data-diff mode. Example prompts to try:

Quick validation (JoinDiff — same database)

Compare SOURCE_DB.TEST.TABLE_A against SOURCE_DB.TEST.TABLE_B
using key column ID on the my-snowflake warehouse.
Include columns NAME and AMOUNT.

Cross-database validation (HashDiff)

Validate that staging.orders in my-postgres warehouse matches
analytics.orders in my-snowflake warehouse.
Key columns: order_id. Extra columns: customer_id, total_amount, status.
Use hashdiff algorithm since they're on different databases.

Profile comparison (column-level stats only)

Run a profile comparison of PROD.PUBLIC.CUSTOMERS vs STAGING.PUBLIC.CUSTOMERS
on my-snowflake warehouse. Key column: CUSTOMER_ID.
I just want column-level statistics, not a full row diff.

Cascade (progressive — stops early if counts differ)

Use cascade algorithm to compare WAREHOUSE_A.SCHEMA.LARGE_TABLE
against WAREHOUSE_B.SCHEMA.LARGE_TABLE on snowflake-prod.
Key: ID. This table has 50M rows so start cheap.

With WHERE filter

Compare sales_2024 vs sales_2024_backup on analytics warehouse.
Key: transaction_id. Extra columns: amount, currency, created_at.
Only check rows where created_at >= '2024-06-01'.

Schema discovery first

I need to validate two tables but I'm not sure about the column names.
First show me the schema of PROD.PUBLIC.ORDERS on my-snowflake,
then run a JoinDiff against STAGING.PUBLIC.ORDERS using the primary key.

Test Plan

🤖 Generated with Claude Code

- New `data-diff` primary agent mode for cross-database data validation with progressive checks: row counts → column profiles → segment checksums → row-level diffs - New `/data-validate` skill with dialect-specific SQL templates for Snowflake, Postgres, BigQuery, DuckDB, Databricks, ClickHouse, MySQL - Prompt covers 4 validation levels, cross-database checksum awareness, and structured PASS/FAIL reporting - Added `/data-validate` to migrator and validator skill lists so both modes can invoke it Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… data validation Adds the full pipeline: TypeScript tool → Bridge → Python orchestrator → Rust engine. - `data-diff-run.ts`: TypeScript tool wrapping `Bridge.call("data_diff.run")` - `data_diff.py`: Python orchestrator driving the cooperative state machine loop via `altimate_core.ReladiffSession` (start → execute SQL → step → repeat) - `server.py`: Added `data_diff.run` dispatch to JSON-RPC bridge - `protocol.ts`: `DataDiffRunParams`/`DataDiffRunResult` interfaces + bridge method - `registry.ts`: Registered `DataDiffRunTool` in tool registry - `agent.ts`: Added `data_diff: "allow"` to data-diff agent permissions - `data-diff.txt`: Rewrote prompt to use `data_diff` tool as primary approach, with manual SQL as fallback Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reflects altimate-core change: `column_lineage` and `track_lineage` now work without credentials. SDK logging activates when initialized. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-11T00:29:33Z

This PR doesn't fully meet our contributing guidelines and PR template.

What needs to be fixed:

PR description is missing required template sections. Please use the PR template.

Please edit this PR description to address the above within 2 hours, or it will be automatically closed.

If you believe this was flagged incorrectly, please let a maintainer know.

sentry · 2026-03-11T00:31:53Z

packages/altimate-engine/src/altimate_engine/sql/data_diff.py

+    """Execute a single SQL task against the given warehouse."""
+    result = execute_sql(
+        SqlExecuteParams(sql=task["sql"], warehouse=warehouse, limit=100_000)
+    )
+
+    # Convert SqlExecuteResult rows to the format expected by ReladiffSession.step()
+    rows: list[list[str | None]] = []
+    for row in result.rows:
+        rows.append([str(v) if v is not None else None for v in row])
+
+    return {"id": task["id"], "rows": rows}


Bug: SQL errors are returned as data rows and processed as valid results. The Rust engine parses the error string, defaults to 0, causing false-positive data diff outcomes.
_{Severity: CRITICAL}

Suggested Fix

Modify _execute_task to inspect the result from execute_sql. If the result's columns indicate an error (e.g., result.columns == ["error"]), an exception should be raised. This will allow the try/except block in the calling run_data_diff function to catch the failure and report it correctly, preventing incorrect data from being sent to the Rust engine.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: packages/altimate-engine/src/altimate_engine/sql/data_diff.py#L52-L63 Potential issue: The `execute_sql` function returns SQL errors, such as connection failures, as data rows instead of raising exceptions. The `_execute_task` function in `data_diff.py` does not check for this error format and processes the error message as if it were valid query data. This error data is then passed to the Rust engine, which attempts to parse the error string as an integer. The parse fails and the engine silently defaults the value to `0`. This causes the data diff to incorrectly report that tables match when a query has actually failed, leading to silent, false-positive validation results.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

github-actions · 2026-03-11T02:54:06Z

This pull request has been automatically closed because it was not updated to meet our contributing guidelines within the 2-hour window.

Feel free to open a new pull request that follows our guidelines.

suryaiyer95 and others added 3 commits March 9, 2026 19:07

docs: update lineage guard docstrings — no longer requires API keys

1713848

Reflects altimate-core change: `column_lineage` and `track_lineage` now work without credentials. SDK logging activates when initialized. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions bot added needs:compliance contributor labels Mar 11, 2026

sentry bot reviewed Mar 11, 2026

View reviewed changes

github-actions bot removed the needs:compliance label Mar 11, 2026

github-actions bot closed this Mar 11, 2026

github-actions bot added the needs:compliance label Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: wire data_diff tool through reladiff engine#104

feat: wire data_diff tool through reladiff engine#104
suryaiyer95 wants to merge 3 commits intomainfrom
feat/data-validation-mode

suryaiyer95 commented Mar 11, 2026 •

edited by anandgupta42

Loading

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

sentry bot Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

suryaiyer95 commented Mar 11, 2026 • edited by anandgupta42 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Results (Snowflake, verified end-to-end)

Example Prompts

Quick validation (JoinDiff — same database)

Cross-database validation (HashDiff)

Profile comparison (column-level stats only)

Cascade (progressive — stops early if counts differ)

With WHERE filter

Schema discovery first

Test Plan

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

sentry bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

suryaiyer95 commented Mar 11, 2026 •

edited by anandgupta42

Loading