feat: wire data_diff tool for deterministic data validation by suryaiyer95 · Pull Request #107 · AltimateAI/altimate-code

suryaiyer95 · 2026-03-11T06:14:46Z

What does this PR do?

Adds a data_diff tool and data-diff agent mode that wraps the Rust reladiff engine for deterministic table-to-table data validation. Tested end-to-end on Snowflake with up to 1M rows.

Pipeline:

LLM (data-diff mode) → data_diff tool (TS) → Bridge.call("data_diff.run")
→ JSON-RPC → server.py → run_data_diff() → ReladiffSession (Rust)
→ cooperative loop (SQL tasks ↔ ConnectionRegistry) → structured result

Files changed:

data-diff-run.ts — TypeScript tool calling Bridge.call("data_diff.run")
data_diff.py — Python orchestrator driving the cooperative state machine loop
server.py — Registers data_diff.run in JSON-RPC dispatcher
protocol.ts — DataDiffRunParams/DataDiffRunResult bridge protocol types
agent.ts — data-diff agent mode with SQL/warehouse tool permissions
data-diff.txt — System prompt for data-diff agent
SKILL.md — /data-validate skill for guided validation workflows
guard.py — Updated docstrings (no longer requires API keys)

Type of change

New feature (non-breaking change which adds functionality)

How did you verify your code works?

End-to-end tested on Snowflake across all 4 algorithms and at scale (up to 1M rows, <12s).

Issue for this PR

Internal feature — data validation mode for altimate-code CLI.

Checklist

I have tested my changes locally
I have not included unrelated changes

- Restore .trim() on models API JSON to prevent syntax error in generated models-snapshot.ts - Fix archive path for scoped package names (@altimate/cli-*) in release tarball/zip creation - Remove gh release upload from build.ts (handled by github-release job) - Add CHANGELOG.md entry for v0.1.5 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

- Redesign M as 5-wide with visible V-valley to distinguish from A - Change E top from full bar to open-right, distinguishing from T - Fix T with full-width crossbar and I as narrow column - Fix D shape in CODE - Render CODE in theme.accent (purple) instead of theme.primary (peach) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- publish.ts: change glob from `*/package.json` to `**/package.json` to find scoped package directories (@altimate/cli-*) which are 2 levels deep - release.yml: add skip-existing to PyPI publish so it doesn't fail when the engine version hasn't changed between releases Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…(#15780)

…ider (#15619)

@Altimate

The npm org is @AltimateAI, not @Altimate. Update all package names, workspace dependencies, imports, and documentation to use the correct scope so npm publish succeeds. Name mapping: - @altimate/cli → @altimateai/altimate-code - @altimate/cli-sdk → @altimateai/altimate-code-sdk - @altimate/cli-plugin → @altimateai/altimate-code-plugin - @altimate/cli-util → @altimateai/altimate-code-util - @altimate/cli-script → @altimateai/altimate-code-script Also updates publish.ts to emit the wrapper package as @altimateai/altimate-code (no -ai suffix) and hardcodes the bin entry to altimate-code. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Two issues: 1. TypeScript permission-task tests: test fixture wrote config to `opencode.json` but the config loader only looks for `altimate-code.json`. Updated fixture to use correct filename. 2. Python tests: `pytest: command not found` because pyproject.toml had no `dev` optional dependency group. Added `dev` extras with pytest and ruff. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: rename opencode references to altimate-code in all test files Update test files to use the correct names after the config loader was renamed from opencode to altimate-code: - `opencode.json` → `altimate-code.json` - `.opencode/` → `.altimate-code/` - `.git/opencode` → `.git/altimate-code` - `OPENCODE_*` env vars → `ALTIMATE_CLI_*` - Cache dir `opencode` → `altimate-code` - Schema URL `opencode.ai` → `altimate-code.dev` Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: resolve remaining test failures and build import issue - Fix build.ts solid-plugin import to use bare specifier for monorepo hoisting - Update agent tests: "build" → "builder", "plan" → "analyst" for disabled fallback - Fix well-known config mock URL in config.test.ts - Fix message-v2 test: "OpenCode" → "Altimate CLI" - Fix retry.test.ts: replace unsupported test.concurrent with test - Fix read.test.ts: update agent name to "builder" - Fix agent-color.test.ts: update config keys to "builder" - Fix registry.test.ts: remove unpublished plugin dep from test fixture - Skip adding plugin dependency in local dev mode (installDependencies) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address Sentry review comments and Python CI deps - Update theme schema URL from opencode.ai to altimate-code.dev (33 files) - Rename opencode references in ACP README.md and AGENTS.md docs - Update test fixture tmp dir prefix to altimate-code-test- - Install warehouse extras in Python CI for duckdb/boto3 test deps Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: Python CI — SqlGuardResult allows None data, restrict pytest to tests/ - Allow SqlGuardResult.data to be None (fixes lineage.check Pydantic error) - Set testpaths = ["tests"] in pyproject.toml to exclude src/test_local.py from pytest collection (it's a source module, not a test) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: resolve ruff lint errors in Python engine - Remove unused imports in server.py (duplicate imports, unused models) - Remove unused `json` import in schema/cache.py - Remove unused `os` import in sql/feedback_store.py - Add noqa for keyring availability check import Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use import.meta.resolve to find the @opentui/core package directory instead of hardcoding node_modules path, which fails with monorepo hoisting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…aming - Build: output binary as altimate-code instead of opencode - Bin wrapper: look for @altimateai/altimate-code-* scoped packages - Postinstall: resolve @AltimateAI scoped platform packages - Publish: update Docker/AUR/Homebrew refs to AltimateAI/altimate-code - Publish: make Docker/AUR/Homebrew non-fatal (infra not set up yet) - Dockerfile: update binary paths and names Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-authored-by: Jérôme Benoit <jerome.benoit@piment-noir.org>

…sion (#15762) Co-authored-by: Test User <test@test.com> Co-authored-by: Shoubhit Dash <shoubhit2005@gmail.com>

Co-authored-by: Adam <2363879+adamdotdevin@users.noreply.github.com>

…ck required tools)

Co-Authored-By: Kai (Claude Opus 4.6) <noreply@anthropic.com>

…onfig fix: restore TUI crash after upstream merge

…rkflow

fix: correct TEAM_MEMBERS ref from 'dev' to 'main' in pr-standards workflow

- Add `AltimateApi` client for datamate CRUD and integration resolution - Add `datamate` tool with 9 operations: list, show, create, update, delete, add (MCP connect), remove (MCP disconnect), list-integrations, status - Extract shared MCP config utilities (`resolveConfigPath`, `addMcpToConfig`, `removeMcpFromConfig`, `listMcpInConfig`) to `mcp/config.ts` - Add `/datamate-setup` skill for guided datamate onboarding - Register datamate tool in tool registry and TUI sync context - Add test suite for `AltimateApi` credential loading and API methods

feat: datamate manager — dynamic MCP server management

packages/altimate-engine/src/altimate_engine/sql/data_diff.py

+    if result.row_count > 0:
+        for row in result.rows:
+            rows.append([str(v) if v is not None else None for v in row])
+


Replace arc-runner-altimate-code with ubuntu-latest across all workflows to eliminate security risk on public repo. Closes #109 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

- New `data-diff` primary agent mode for cross-database data validation with progressive checks: row counts → column profiles → segment checksums → row-level diffs - New `/data-validate` skill with dialect-specific SQL templates for Snowflake, Postgres, BigQuery, DuckDB, Databricks, ClickHouse, MySQL - Prompt covers 4 validation levels, cross-database checksum awareness, and structured PASS/FAIL reporting - Added `/data-validate` to migrator and validator skill lists so both modes can invoke it Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… data validation Adds the full pipeline: TypeScript tool → Bridge → Python orchestrator → Rust engine. - `data-diff-run.ts`: TypeScript tool wrapping `Bridge.call("data_diff.run")` - `data_diff.py`: Python orchestrator driving the cooperative state machine loop via `altimate_core.ReladiffSession` (start → execute SQL → step → repeat) - `server.py`: Added `data_diff.run` dispatch to JSON-RPC bridge - `protocol.ts`: `DataDiffRunParams`/`DataDiffRunResult` interfaces + bridge method - `registry.ts`: Registered `DataDiffRunTool` in tool registry - `agent.ts`: Added `data_diff: "allow"` to data-diff agent permissions - `data-diff.txt`: Rewrote prompt to use `data_diff` tool as primary approach, with manual SQL as fallback Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…xecutor The executor returns `row_count=0` with a synthetic `["Query executed successfully"]` row when SQL returns no results. Without this guard, the Rust engine interprets the status row as actual data, causing false "duplicate key" errors in JoinDiff. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ut formatting - Add `source_where_clause` and `target_where_clause` params to bridge protocol - Update `run_data_diff` to pass per-table WHERE to reladiff engine - Enhance tool output formatting with column-level match rates and sample mismatches - Expand system prompt with progressive validation guidance Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add 519 integration tests (516 pass + 3 xfail) across 120 test classes - Tests cover: DuckDB, Postgres, cross-warehouse, all 6 algorithms - Edge cases: NULL semantics, numeric precision, reserved keywords, composite keys - Add Docker Compose for Postgres 16 test environment - Add 28 research documents (themes A-Z) covering data validation landscape Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

packages/altimate-engine/src/altimate_engine/sql/data_diff.py

+    if result.row_count > 0:
+        for row in result.rows:
+            rows.append([str(v) if v is not None else None for v in row])
+


…on` to run - Add `integration` marker to `test_data_diff_integration.py` - Configure `pyproject.toml` to skip integration tests by default - Integration tests require Docker Postgres and matching `altimate-core` build Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…onfig These files are internal development artifacts not appropriate for the open-source repository: - `docs/research/` — 28 internal research documents (2.1MB) - `test_data_diff_integration.py` — requires Docker Postgres + Rust engine - `docker-compose.yml` — test infrastructure Unit tests (`test_data_diff.py`) remain — they use mocks and need no external deps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

packages/opencode/src/altimate/tools/data-diff-run.ts

+        lines.push("")
+        lines.push("Column Match Rates:")
+        for (const col of matchRates) {
+          const pct = (col.match_percent as number).toFixed(1)


… diff engine When `execute_sql` fails, it returns `columns=['error']` with the error message as a data row. Previously this was silently passed to the Rust engine as data, causing confusing downstream failures. Now raises `RuntimeError` immediately so the error propagates to the user. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add `_validate_where_clause()` to reject injection patterns (semicolons, comments) - Validate all WHERE clause parameters before passing to Rust engine - Remove unused `_SIDE_MAP` constant from `data_diff.py` - Add null guards for `diff_percent` and `match_percent` in TypeScript to prevent NaN display

- Delete `altimate_engine/sql/data_diff.py` — all Python orchestration now lives in `altimate_core.data_diff` (altimate-core-internal repo) - Delete `tests/test_data_diff.py` — tests moved to altimate-core-internal - Update `server.py` to import from `altimate_core.data_diff` with inline `_executor` and `_resolve_dialect` callbacks - Add try/catch on import with install instructions - Expand data-diff agent prompt with Cascade/Recon/Profile details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

suryaiyer95 · 2026-03-14T01:53:58Z

Multi-Model Code Review: Data Validation Mode

Verdict: APPROVE (with recommended fixes)
Overall Quality: Very high — clean TypeScript tool definition, well-structured Python orchestrator, production-quality agent prompt.
Reviewed by: Claude Opus 4.6, Kimi K2.5, Grok, MiniMax M2.5, GLM-5 (5 independent AI reviewers)

This review also covers the linked PR altimate-core-internal#61. Rust engine findings are posted there.

Major Issues

1. TypeScript `formatOutcome` may never render structured output — serde mismatch

Category: Bug
Location: data-diff-run.ts:111

const mode = outcome.mode as string | undefined  // always undefined?

The TypeScript formatter checks outcome.mode === "diff", "profile", etc. But Rust's default serde serialization of enums produces {"Diff": {...}} (capitalized variant name as key), not {"mode": "diff", ...}. If ReladiffOutcome doesn't have #[serde(tag = "mode", rename_all = "lowercase")], the mode check always fails and users see raw JSON dumps instead of the structured Data Validation Report.

Fix: Either add #[serde(tag = "mode", rename_all = "lowercase")] to ReladiffOutcome in Rust, or change TypeScript to check variant keys:

const mode = "Diff" in outcome ? "diff" : "Profile" in outcome ? "profile" : "Cascade" in outcome ? "cascade" : undefined

Verify the actual serde configuration before merging.
Flagged by: GLM-5

Minor Issues

#	Issue	Location	Fix	Flagged By
2	`conn.type` None → `KeyError` in dialect lookup	`server.py:_resolve_dialect()`	Use `.get(conn.type, "generic")`	MiniMax, GLM-5
3	Recon mode unreachable from agent — no `rules` parameter in tool	`data-diff-run.ts` + `data_diff.py`	Wire up `recon_rules` or document as "coming soon"	Kimi
4	`str()` conversion loses Decimal precision (`"0.10"` → `"0.1"`)	`server.py:231-233`	Known trade-off of string-based checksumming. Document it.	Claude

Positive Observations

TypeScript tool definition is thorough: Zod schemas with descriptive parameter docs, proper error handling, well-structured formatOutcome() for human-readable reports
Production-quality agent prompt (data-diff.txt): Algorithm decision matrix, cost awareness, progressive strategy guide, concrete examples
Clean Python orchestrator: run_data_diff() with dependency-injected executor and dialect_resolver callbacks — zero coupling to DB drivers
Skill documentation (SKILL.md): Clear progressive validation workflow guide

Missing Tests

formatOutcome with actual Rust serde output shapes (verify mode detection works)
conn.type=None in dialect resolution
_executor with Decimal/datetime types
Recon mode end-to-end (currently unreachable from agent)
Empty key_columns array handling

🤖 Multi-model review by Claude Opus 4.6, Kimi K2.5, Grok, MiniMax M2.5, GLM-5

dev-punia-altimate · 2026-03-17T14:23:05Z

✅ Tests — All Passed

TypeScript — passed

Python — passed

_{Tested at e42766d9 | Run log | Powered by QA Autopilot}

anandgupta42 and others added 30 commits March 2, 2026 17:31

release: v0.1.6

340ce8f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

release: v0.1.7

0a08668

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(opencode): clone part data in Bus event to preserve token values …

fd6f713

…(#15780)

fix(provider): forward metadata options to cloudflare-ai-gateway prov…

96d6fb7

…ider (#15619)

chore: generate

e41b535

release: v0.1.8

dc61802

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(opencode): avoid gemini combiner schema sibling injection (#15318)

7e3e85b

chore: generate

9f150b0

release: v0.1.9

614c180

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: resolve @opentui/core parser.worker.js via import.meta.resolve

8738470

Use import.meta.resolve to find the @opentui/core package directory instead of hardcoding node_modules path, which fails with monorepo hoisting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

wip: zen

6aa4928

chore: generate

881ca86

wip: zen

1233ebc

wip: zen

b985ea3

zen: docs

6deb27e

release: v0.1.10

391f365

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: nix flake update for bun 1.3.10 (#15648)

48412f7

Co-authored-by: Jérôme Benoit <jerome.benoit@piment-noir.org>

fix(opencode): disable session navigation commands when no parent ses…

18850c4

…sion (#15762) Co-authored-by: Test User <test@test.com> Co-authored-by: Shoubhit Dash <shoubhit2005@gmail.com>

fix(app): timeline jank

5e8742f

fix(app): timeline jank

e4af1bb

chore: fix test

1e2da60

chore: cleanup

7305fc0

fix(app): stabilize project close navigation (#15817)

356b5d4

Co-authored-by: Adam <2363879+adamdotdevin@users.noreply.github.com>

govindpawa and others added 7 commits March 7, 2026 13:50

ci: revert Windows tests to windows-latest (ARC Windows containers la…

34524f0

…ck required tools)

fix: restore TUI crash after upstream merge

c113580

Co-Authored-By: Kai (Claude Opus 4.6) <noreply@anthropic.com>

Merge pull request #98 from AltimateAI/fix/restore-branding-and-tui-c…

5740133

…onfig fix: restore TUI crash after upstream merge

fix: correct TEAM_MEMBERS ref from 'dev' to 'main' in pr-standards wo…

ffd76bc

…rkflow

Merge pull request #101 from AltimateAI/fix/pr-standards-workflow

4ae73c0

fix: correct TEAM_MEMBERS ref from 'dev' to 'main' in pr-standards workflow

Merge pull request #99 from AltimateAI/feat/datamate-manager-clean

9a02f27

feat: datamate manager — dynamic MCP server management

github-actions bot added the contributor label Mar 11, 2026

sentry bot reviewed Mar 11, 2026

View reviewed changes

packages/altimate-engine/src/altimate_engine/sql/data_diff.py Outdated

Comment on lines +61 to +64

if result.row_count > 0:

for row in result.rows:

rows.append([str(v) if v is not None else None for v in row])

This comment was marked as outdated.

Sign in to view

govindpawa and others added 6 commits March 13, 2026 09:24

chore(ci): remove self-hosted runners from public repo (#110)

f668a67

Replace arc-runner-altimate-code with ubuntu-latest across all workflows to eliminate security risk on public repo. Closes #109 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

suryaiyer95 force-pushed the feat/data-validation-mode branch from 0ab4371 to 289cdde Compare March 13, 2026 16:46

sentry bot reviewed Mar 13, 2026

View reviewed changes

packages/altimate-engine/src/altimate_engine/sql/data_diff.py Outdated

Comment on lines +61 to +64

if result.row_count > 0:

for row in result.rows:

rows.append([str(v) if v is not None else None for v in row])

This comment was marked as outdated.

Sign in to view

suryaiyer95 and others added 3 commits March 13, 2026 09:51

fix: disable data-diff agent in "all primary agents disabled" test

34cffcd

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sentry bot reviewed Mar 13, 2026

View reviewed changes

packages/opencode/src/altimate/tools/data-diff-run.ts Outdated

lines.push("")

lines.push("Column Match Rates:")

for (const col of matchRates) {

const pct = (col.match_percent as number).toFixed(1)

This comment was marked as outdated.

Sign in to view

suryaiyer95 and others added 3 commits March 13, 2026 13:21

anandgupta42 force-pushed the main branch from c86970e to d097682 Compare March 17, 2026 00:40

suryaiyer95 mentioned this pull request Mar 17, 2026

fix: propagate telemetry opt-out to Python engine sidecar #227

Merged

3 tasks

anandgupta42 changed the title ~~feat: wire data_diff tool through reladiff engine for deterministic data validation~~ feat: wire data_diff tool for deterministic data validation Mar 19, 2026

suryaiyer95 closed this Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: wire data_diff tool for deterministic data validation#107

feat: wire data_diff tool for deterministic data validation#107
suryaiyer95 wants to merge 10000 commits intomainfrom
feat/data-validation-mode

suryaiyer95 commented Mar 11, 2026 •

edited by anandgupta42

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

suryaiyer95 commented Mar 14, 2026

Uh oh!

dev-punia-altimate commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

suryaiyer95 commented Mar 11, 2026 • edited by anandgupta42 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of change

How did you verify your code works?

Issue for this PR

Checklist

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

suryaiyer95 commented Mar 14, 2026

Multi-Model Code Review: Data Validation Mode

Major Issues

1. TypeScript formatOutcome may never render structured output — serde mismatch

Minor Issues

Positive Observations

Missing Tests

Uh oh!

dev-punia-altimate commented Mar 17, 2026

✅ Tests — All Passed

TypeScript — passed

Python — passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

suryaiyer95 commented Mar 11, 2026 •

edited by anandgupta42

Loading

1. TypeScript `formatOutcome` may never render structured output — serde mismatch