feat: RCE recorded-trace verifier v0#59
Conversation
Implements the 4-phase verification order from RCE Profile v0.1: - Phase 1: Episode Contract schema + DAG validation (cycles, uniqueness, terminal EMIT_OUTPUT) - Phase 2: Proof pack integrity via existing verify_proof_pack() - Phase 3: Receipt completeness + derived hash recomputation (episode_spec_hash, env_fingerprint_hash, inputs_hash, script_hash, outputs_hash from step receipt payloads) - Phase 4: Recorded-trace replay comparison with per-step comparator tier resolution Verdicts: MATCH (exit 0), DIVERGE (exit 1), INTEGRITY_FAIL (exit 2). SKIPPED steps: output_hash=null, excluded from outputs_hash and Phase 4. INTEGRITY_FAIL: claim_check=null (comparison not reached). Hash format: sha256:<64-char-hex> throughout. DIVERGE: exhaustive collection, at least one divergent step required. New files: - src/assay/rce_verify.py: verifier engine + receipt writer - src/assay/schemas/rce_episode_contract.schema.json: Episode Contract schema - src/assay/schemas/rce_replay_result_v0.1.schema.json: replay result schema - tests/assay/test_rce_verify.py: 6 tests (match, diverge, integrity_fail, skipped, writer, CLI) Modified: - src/assay/__init__.py: export rce_verify public API - src/assay/commands.py: hidden `assay rce-verify` command - src/assay/proof_pack.py: accept slash-versioned RCE receipt types Residual: dispute.replay_pack_root_sha256 emitted as null until replay-bundle packing surface exists. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AgentMesh Lineage CheckLineage coverage: 0/3 commits (0%) No |
There was a problem hiding this comment.
Pull request overview
Adds an initial “recorded-trace” Replay-Constrained Episode (RCE) verifier, including contract/result JSON Schemas, a hidden CLI entrypoint, and tests to cover the main verdict paths.
Changes:
- Introduces
assay.rce_verifyimplementing 4-phase verification (contract + DAG, proof-pack integrity, receipt/artifact completeness, recorded-trace comparison) and writing a replay-result receipt + details sidecar. - Adds JSON Schemas for the RCE Episode Contract and the replay-result receipt format.
- Extends proof-pack receipt-type acceptance to allow namespaced types with
/...suffixes and adds a hiddenassay rce-verifyCLI command with tests.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/assay/test_rce_verify.py | Adds focused tests for MATCH/DIVERGE/INTEGRITY_FAIL paths, skipped-step semantics, writer output, and CLI exit codes. |
| src/assay/rce_verify.py | Implements the RCE verifier, schema validation, phase logic, receipt emission, and disk writer. |
| src/assay/schemas/rce_episode_contract.schema.json | Defines the Episode Contract schema used in Phase 1. |
| src/assay/schemas/rce_replay_result_v0.1.schema.json | Defines the replay-result receipt schema used for verifier outputs. |
| src/assay/commands.py | Adds hidden assay rce-verify CLI command that invokes the verifier/writer. |
| src/assay/proof_pack.py | Broadens allowed namespaced receipt type regex to accept optional /... suffix (e.g., rce.*/v0). |
| src/assay/init.py | Re-exports the new verifier symbols from the top-level package. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| steps_replayed = int(receipt.get("steps_replayed", 0) or 0) | ||
| steps_matched = int(receipt.get("steps_matched", 0) or 0) | ||
| steps_diverged = int(receipt.get("steps_diverged", 0) or 0) | ||
| divergent_step_ids = cast(List[str], receipt.get("divergent_step_ids") or []) |
There was a problem hiding this comment.
validate_rce_replay_result is meant to return a list of validation errors, but it can currently raise (e.g., int("abc") if a caller passes an invalid receipt). This makes the helper unsafe to use as a validator on untrusted input. Consider wrapping the numeric coercions (and any other type assumptions) in try/except and appending errors instead of throwing, so the function always returns List[str].
| steps_replayed = int(receipt.get("steps_replayed", 0) or 0) | |
| steps_matched = int(receipt.get("steps_matched", 0) or 0) | |
| steps_diverged = int(receipt.get("steps_diverged", 0) or 0) | |
| divergent_step_ids = cast(List[str], receipt.get("divergent_step_ids") or []) | |
| try: | |
| steps_replayed = int(receipt.get("steps_replayed", 0) or 0) | |
| except (TypeError, ValueError): | |
| errors.append("steps_replayed: must be an integer") | |
| steps_replayed = 0 | |
| try: | |
| steps_matched = int(receipt.get("steps_matched", 0) or 0) | |
| except (TypeError, ValueError): | |
| errors.append("steps_matched: must be an integer") | |
| steps_matched = 0 | |
| try: | |
| steps_diverged = int(receipt.get("steps_diverged", 0) or 0) | |
| except (TypeError, ValueError): | |
| errors.append("steps_diverged: must be an integer") | |
| steps_diverged = 0 | |
| raw_divergent_step_ids = receipt.get("divergent_step_ids") | |
| if raw_divergent_step_ids is None: | |
| divergent_step_ids = [] | |
| elif isinstance(raw_divergent_step_ids, list): | |
| divergent_step_ids = cast(List[str], raw_divergent_step_ids) | |
| else: | |
| errors.append("divergent_step_ids: must be a list") | |
| divergent_step_ids = [] |
| steps_replayed += 1 | ||
| expected_input_hashes = [ | ||
| cast(str, step_payloads[dependency].get("output_hash")) | ||
| for dependency in cast(List[str], step.get("depends_on") or []) | ||
| ] | ||
| observed_output_hash = _canonical_sha256(parsed_traces[step_id]) | ||
| reasons: List[str] = [] | ||
| if cast(List[str], payload.get("input_hashes") or []) != expected_input_hashes: | ||
| reasons.append("input hash chain mismatch") | ||
| if observed_output_hash != _payload_string(payload, "output_hash"): | ||
| reasons.append("JCS output hash mismatch") |
There was a problem hiding this comment.
The Phase 4 divergence logic treats an input_hashes chain mismatch as a DIVERGE verdict (reasons.append("input hash chain mismatch")). This mismatch is independent of recorded-trace replay (it’s an internal receipt/contract consistency issue), and can cause packs whose step outputs match recorded traces to be labeled DIVERGE instead of INTEGRITY_FAIL. Consider moving this check into Phase 3 (receipt completeness) and failing verification with INTEGRITY_FAIL when the step receipt’s input_hashes don’t match the receipt-derived dependency output hashes (and/or when a step depends on a SKIPPED step).
| } | ||
| }, | ||
| "replay_pack_root_sha256": { | ||
| "type": ["string", "null"] |
There was a problem hiding this comment.
dispute.replay_pack_root_sha256 allows any string when non-null. Since the rest of the format consistently uses sha256:<64-hex>, it would be better for the schema to enforce the same pattern when a value is present (e.g., anyOf of {type: "string", pattern: ...} and {type: "null"}). This prevents accepting malformed hashes once replay-pack bundling is implemented.
| "type": ["string", "null"] | |
| "anyOf": [ | |
| { | |
| "type": "string", | |
| "pattern": "^sha256:[0-9a-f]{64}$" | |
| }, | |
| { | |
| "type": "null" | |
| } | |
| ] |
| step_ids = [str(step.get("step_id") or "") for step in steps] | ||
| duplicates = sorted({step_id for step_id in step_ids if step_ids.count(step_id) > 1 and step_id}) | ||
| if duplicates: | ||
| errors.append(f"replay_script.steps: duplicate step_id values: {', '.join(duplicates)}") |
There was a problem hiding this comment.
Duplicate step_id detection is currently O(n^2) due to repeated step_ids.count(step_id) calls. For larger ReplayScripts this can become a noticeable bottleneck during Phase 1 validation. Consider using a single pass with a Counter/dict of counts to detect duplicates in O(n).
Phase 3 now validates the profile's failure-propagation invariant: - Steps depending on FAIL or SKIPPED ancestors MUST be SKIPPED - SKIPPED status is only valid when at least one dependency is FAIL/SKIPPED - A dependent step marked PASS after an upstream FAIL is rejected as INTEGRITY_FAIL (the receipt chain is structurally invalid) This is a proof-tier correction: without it, the receipt chain could look orderly while the episode semantics were lying. The schema cannot express cross-step execution semantics — this belongs in the verifier. Added focused regression: _build_failed_dependency_pack constructs a pack where a dependent step is incorrectly marked PASS after its upstream FAIL, and the verifier correctly rejects it. 7/7 tests pass. 47/47 adjacent tests (replay_judge + episode) unaffected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…or; tighten schema
Merge-blocker fixes from PR review:
1. input_hashes chain mismatch → INTEGRITY_FAIL (Phase 3)
A mismatch between step.input_hashes and the dependency's
output_hash in the receipt graph is an internal receipt
inconsistency, not a replay-comparison disagreement. Moving this
check to Phase 3 gives it the correct INTEGRITY_FAIL verdict.
Phase 4 now only checks JCS output hash mismatch → DIVERGE.
New test: test_input_hash_chain_mismatch_is_integrity_fail
2. validate_rce_replay_result never raises on malformed input
int("abc") / int([]) raised ValueError/TypeError on malformed
steps_replayed, steps_matched, steps_diverged fields. Replaced
with _safe_int() helper that appends to the error list instead.
Callers can treat the return value as an exhaustive error surface.
New test: test_validate_replay_result_never_raises_on_malformed_input
3. replay_pack_root_sha256 schema pattern (forward-compat)
When non-null, the value must match sha256:<64hex>. Previously
typed as ["string", "null"] with no format constraint.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Address review items by reclassifying hash-chain mismatches as integrity failures, hardening replay-result validation to fail by returned errors instead of exceptions, and tightening replay-pack SHA schema validation. Commit |
Summary
rce_verify.py: recorded-trace RCE verifier implementing the 4-phase verification order from RCE Profile v0.1assay rce-verify <pack_dir> --out-dir <dir>CLI commandrce.*/v0receiptsVerification phases
verify_proof_pack)Key contract surfaces
output_hash=null, excluded fromoutputs_hashand Phase 4claim_check=null(comparison not reached)sha256:<64-char-hex>throughoutcomparator_tiers_by_stepoutputs_hashrecomputed from step receipt payloads (not replay artifacts)Residual gap
dispute.replay_pack_root_sha256emitted asnull— replay-bundle packing not yet implemented.Test plan
🤖 Generated with Claude Code