Skip to content

feat: wire data_diff tool through reladiff engine#102

Closed
suryaiyer95 wants to merge 3 commits intomainfrom
feat/data-validation-mode
Closed

feat: wire data_diff tool through reladiff engine#102
suryaiyer95 wants to merge 3 commits intomainfrom
feat/data-validation-mode

Conversation

@suryaiyer95
Copy link
Contributor

Summary

  • Adds data_diff TypeScript tool that wraps the Rust reladiff engine via Bridge → Python orchestrator → altimate_core.ReladiffSession
  • Creates data_diff.py Python orchestrator that drives the cooperative state machine loop (start → execute SQL via ConnectionRegistry → step → repeat)
  • Registers data_diff.run method in the JSON-RPC bridge dispatcher
  • Adds DataDiffRunParams/DataDiffRunResult to the bridge protocol
  • Updates data-diff agent prompt to use data_diff tool as primary approach (deterministic Rust engine) with manual SQL as fallback
  • Depends on: AltimateAI/altimate-core-internal PR for the reladiff Rust module

Pipeline

LLM (data-diff mode) → data_diff tool (TS) → Bridge.call("data_diff.run")
→ JSON-RPC → server.py → run_data_diff() → altimate_core.ReladiffSession (Rust)
→ cooperative loop (SQL tasks ↔ ConnectionRegistry) → structured result

Test plan

  • TypeScript type check passes (tsc --noEmit)
  • Build succeeds (bun run build — 11 platform targets)
  • End-to-end test with configured warehouse connections
  • Verify data_diff tool appears in data-diff mode tool list

🤖 Generated with Claude Code

suryaiyer95 and others added 2 commits March 9, 2026 19:07
- New `data-diff` primary agent mode for cross-database data validation
  with progressive checks: row counts → column profiles → segment
  checksums → row-level diffs
- New `/data-validate` skill with dialect-specific SQL templates for
  Snowflake, Postgres, BigQuery, DuckDB, Databricks, ClickHouse, MySQL
- Prompt covers 4 validation levels, cross-database checksum awareness,
  and structured PASS/FAIL reporting
- Added `/data-validate` to migrator and validator skill lists so both
  modes can invoke it

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… data validation

Adds the full pipeline: TypeScript tool → Bridge → Python orchestrator → Rust engine.

- `data-diff-run.ts`: TypeScript tool wrapping `Bridge.call("data_diff.run")`
- `data_diff.py`: Python orchestrator driving the cooperative state machine loop
  via `altimate_core.ReladiffSession` (start → execute SQL → step → repeat)
- `server.py`: Added `data_diff.run` dispatch to JSON-RPC bridge
- `protocol.ts`: `DataDiffRunParams`/`DataDiffRunResult` interfaces + bridge method
- `registry.ts`: Registered `DataDiffRunTool` in tool registry
- `agent.ts`: Added `data_diff: "allow"` to data-diff agent permissions
- `data-diff.txt`: Rewrote prompt to use `data_diff` tool as primary approach,
  with manual SQL as fallback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines +53 to +63
"""Execute a single SQL task against the given warehouse."""
result = execute_sql(
SqlExecuteParams(sql=task["sql"], warehouse=warehouse, limit=100_000)
)

# Convert SqlExecuteResult rows to the format expected by ReladiffSession.step()
rows: list[list[str | None]] = []
for row in result.rows:
rows.append([str(v) if v is not None else None for v in row])

return {"id": task["id"], "rows": rows}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The _execute_task function doesn't check for error results from execute_sql, causing it to treat error messages as valid data rows and pass them to the Rust engine.
Severity: HIGH

Suggested Fix

In _execute_task, check if result.columns == ["error"]. If it is, propagate the error up to the caller (run_data_diff) so it can be handled properly, instead of processing the rows. This will prevent malformed data from reaching the Rust engine and ensure failures are reported correctly.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: packages/altimate-engine/src/altimate_engine/sql/data_diff.py#L52-L63

Potential issue: The `execute_sql` function returns errors, such as connection issues or
invalid SQL, as a special `SqlExecuteResult` object with `columns=["error"]` instead of
raising an exception. The `_execute_task` function in `data_diff.py` does not check for
this error state and processes the error message as if it were a valid data row. This
leads to malformed data being passed to the Rust `ReladiffSession` engine, which can
result in incorrect diffs or opaque crashes. The `run_data_diff` function will
incorrectly report `success: True` even when a SQL execution has failed.

Did we get this right? 👍 / 👎 to inform future reviews.

Reflects altimate-core change: `column_lineage` and `track_lineage`
now work without credentials. SDK logging activates when initialized.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

This PR doesn't fully meet our contributing guidelines and PR template.

What needs to be fixed:

  • PR description is missing required template sections. Please use the PR template.

Please edit this PR description to address the above within 2 hours, or it will be automatically closed.

If you believe this was flagged incorrectly, please let a maintainer know.

@github-actions
Copy link

This pull request has been automatically closed because it was not updated to meet our contributing guidelines within the 2-hour window.

Feel free to open a new pull request that follows our guidelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant