Skip to content

feat(ahti-emitter): proto sync, Permanent classification, outcome health, constructor#109

Open
dimashev wants to merge 1 commit into
mainfrom
feat/ahti-emitter-polish
Open

feat(ahti-emitter): proto sync, Permanent classification, outcome health, constructor#109
dimashev wants to merge 1 commit into
mainfrom
feat/ahti-emitter-polish

Conversation

@dimashev
Copy link
Copy Markdown
Collaborator

Summary

Polish pass on polku-ahti-emitter so it slots cleanly under the new Retriability classification merged in #108. All bundled into one PR so the proto sync + classification mapping + tests land atomically — none of these are useful without the others.

Changes

  1. Re-vendor proto/ahti_append.proto from sibling Ahti. The copy was stale: APPEND_ERROR_CODE_SCHEMA_REGISTRY_UNCLASSIFIED shifted from 1415 to make room for APPEND_ERROR_CODE_UNKNOWN_NAMESPACE = 14, and AppendError gained a repeated string available_namespaces = 12 field. build.rs regenerated src/proto/ahti.append.v1.rs.
  2. scripts/check-ahti-proto-sync.sh — flags drift vs. the sibling ahti checkout. Skips cleanly when sibling absent (CI containers). Honors AHTI_PROTO_SOURCE for non-sibling layouts. Safe in a pre-commit hook.
  3. Permanent classification in error.rs::into_plugin_error. Permanent gRPC statuses (InvalidArgument, FailedPrecondition, OutOfRange, Unauthenticated, PermissionDenied, NotFound, AlreadyExists, Unimplemented, Aborted) → PluginError::Permanent. Transient/Unknown (Unavailable, DeadlineExceeded, ResourceExhausted, Internal, Unknown, Cancelled, DataLoss) → PluginError::Send. This is the contract that makes feat(core,gateway): Retriability classification + RetryEmitter fail-fast #108 useful for Ahti — schema rejections and bad-token cases now fail fast at retry AND are neutral at the circuit breaker, while real outages still retry and trip the breaker.
  4. Outcome-based health(). Tracks most-recent success/failure as epoch-millis AtomicI64s shared across clones. Unhealthy when the most recent observed batch failed; Healthy otherwise (including pre-first-emit so we don't flap on startup).
  5. AhtiEmitterConfig::for_occurrence(...) constructor — the common case in one call: Occurrence kind, http://localhost:50051, Observed evidence, Short retention. All nine fields stay public for field-update overrides.
  6. Occurrence-only boundary doc in docs/ahti-emitter.md. Product records (entity_version, relationship_claim, decision, transition, etc.) are written by Vartio's Elixir core via vartio_ahti_client, never through this crate. If a Rust adapter "needs" a non-Occurrence kind it has drifted into product judgment; the fix is to emit the underlying signal as an Occurrence and let Vartio interpret it. References Vartio invariants Add gRPC and Webhook emitters #5, Add Kubernetes integration via Seppo SDK #6, feat(gateway): Channel capacity tuning and Hub LockFreeBuffer wiring #11–13.

Test coverage

  • error::tests::rejected_status_classification_covers_each_permanent_code — exhaustive over tonic::Code
  • error::tests::rejected_status_classifies_retriability — end-to-end retriability for FailedPrecondition vs Unavailable
  • tests::health_reports_healthy_with_no_outcomes_yet / health_reports_unhealthy_after_failure / health_recovers_after_subsequent_success / health_clones_share_outcome_state
  • config::tests::for_occurrence_builds_validatable_config / for_occurrence_overrides_apply_after_construction / for_occurrence_still_rejects_reserved_namespace

cargo test -p polku-ahti-emitter: 24 passed, 0 failed. cargo test --all green workspace-wide.

Pre-existing warning, unchanged

clippy::manual_checked_ops fires on gateway/src/middleware/token_bucket.rs:41 under Rust 1.95 but not on CI's pinned 1.85. Same warning on main, unrelated to this PR. Verified by stashing the branch diff and re-running locally.

Test plan

  • cargo fmt --all --check
  • cargo clippy -p polku-ahti-emitter --all-targets -- -D warnings
  • cargo test -p polku-ahti-emitter (24 unit + 3 ignored live-Ahti)
  • cargo test --all
  • scripts/check-ahti-proto-sync.sh returns 0 with sibling, skips with AHTI_PROTO_SOURCE=/dev/null, exits 1 on synthetic drift
  • (CI) Sykli pipeline green on Rust 1.85
  • Manual: instantiate for_occurrence(\"vartio-otel\", \"vartio.otel.span.detected.v1\", \"1\", token) in the planned live integration test

🤖 Generated with Claude Code

…lth, constructor

Polish pass on polku-ahti-emitter so it's safe to wire under the resilience
stack with the new Retriability classification from polku#108:

* Re-vendor `proto/ahti_append.proto` from the sibling ahti repo. The
  vendored copy had drifted: `APPEND_ERROR_CODE_SCHEMA_REGISTRY_UNCLASSIFIED`
  moved from 14 to 15 to make room for `APPEND_ERROR_CODE_UNKNOWN_NAMESPACE`,
  and `AppendError` gained a `repeated string available_namespaces = 12`
  field. `build.rs` regenerated `src/proto/ahti.append.v1.rs` accordingly.
* `scripts/check-ahti-proto-sync.sh` flags drift between the vendored copy
  and the sibling Ahti checkout. Skips cleanly when the sibling is absent
  (CI containers); honors `AHTI_PROTO_SOURCE` for non-sibling layouts.
* Map permanent gRPC statuses (`InvalidArgument`, `FailedPrecondition`,
  `OutOfRange`, `Unauthenticated`, `PermissionDenied`, `NotFound`,
  `AlreadyExists`, `Unimplemented`, `Aborted`) to `PluginError::Permanent` so
  `RetryEmitter` fails fast and `CircuitBreakerEmitter` treats them as
  neutral. Transient codes still map to `PluginError::Send` for retry.
* Outcome-based `health()`: track most-recent success/failure as epoch-millis
  atomics shared across `Clone`s; report `Unhealthy` after the most recent
  Ahti append failed, `Healthy` otherwise (including the pre-emit state).
* `AhtiEmitterConfig::for_occurrence(...)` constructor for the common case
  (Occurrence kind, localhost, Observed evidence, Short retention). All nine
  fields are still public for overrides.
* Document the Occurrence-only boundary in `docs/ahti-emitter.md`: product
  records (entity_version, relationship_claim, decision, transition, etc.)
  are written by Vartio's Elixir core via `vartio_ahti_client`, not by Rust
  adapters. Adding a non-Occurrence kind here would mean a Rust adapter has
  drifted into product judgment — fix is to emit the underlying signal as an
  Occurrence and let Vartio interpret it.

Tests: 24 lib tests pass (3 new for retriability classification, 4 new for
outcome health, 3 new for `for_occurrence`); `cargo test --all` green.

Pre-existing `clippy::manual_checked_ops` warning in
`gateway/src/middleware/token_bucket.rs:41` fires on Rust 1.95 but not on
CI's pinned 1.85 — unchanged by this PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@dimashev dimashev requested a review from yairfalse May 30, 2026 18:28
@yairfalse
Copy link
Copy Markdown
Collaborator

Review + fixes

Strong PR — the permanent-code classification is the contract that makes #108 useful for Ahti, the exhaustive tonic::Code test is exactly right, and the for_occurrence constructor + occurrence-only boundary doc are clean. Two robustness fixes committed to feat/ahti-emitter-polish:

1. Code::Aborted was mis-classified as permanent

The gRPC convention is that ABORTED (a concurrency / transaction conflict) "should be retried at a higher level." Classifying it permanent turns a retriable conflict into a poison pill — the batch is dropped without a retry that might have succeeded. Moved Aborted to the transient/unknown bucket (retried).

This does not weaken poison-pill handling: Ahti's atomic-batch rollback surfaces as FailedPrecondition in send_occurrences, which stays permanent. (InvalidArgument / FailedPrecondition / Unauthenticated / AlreadyExists / etc. unchanged.)

If Ahti actually uses ABORTED exclusively for content-determined rollback that will always re-fail, say so and we'll move it back — but the gRPC-spec-aligned default is the safer one.

2. health() wall-clock comparison could mask a failure

last_failure > last_success over two epoch-millis atomics has two problems: it's vulnerable to wall-clock skew (an NTP step can reorder outcomes), and on a same-millisecond tie it returns Healthy — so a failure landing in the same ms as a prior success is silently masked. (The health_recovers_after_subsequent_success test had to spin-loop until the clock advanced 1ms to dodge exactly this — a tell that the approach was fighting its own resolution.)

Replaced both timestamp atomics with a single last-writer-wins AtomicU8 (OUTCOME_NONE / OK / FAIL). "Most recent observed result" is exactly the last store — no clock, no skew, no tie. Dropped now_ms() and the spin-loop; added health_failure_after_success_is_unhealthy as a regression guard.

Not code — for the rollout

The proto enum renumber (SCHEMA_REGISTRY_UNCLASSIFIED 14→15, new UNKNOWN_NAMESPACE = 14) is wire-breaking: gateway and Ahti must deploy together, and any persisted/in-flight code-14 values now decode as UNKNOWN_NAMESPACE. Re-vendored from authoritative Ahti so this is expected — just call it out in the deploy notes.

cargo fmt + clippy --all-targets -D warnings + tests (28 passed) green.

Copy link
Copy Markdown
Collaborator

@yairfalse yairfalse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Branch feat/ahti-emitter-polish

  • Comment posted: #issuecomment-4588161369
  • Committed 5d15b8e with 2 fixes:
    a. Code::Aborted → retriable (gRPC says retry at higher level; atomic rollback still surfaces as FailedPrecondition = permanent)
    b. health() → single last-writer-wins AtomicU8, removing wall-clock skew + the same-ms masking bug; dropped now_ms()/spin-loop; added health_failure_after_success_is_unhealthy
  • ✅ fmt + clippy -D warnings + tests (28 passed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants