Skip to content

fix(ci): require N consecutive HTTP successes in wait-for-grpc.sh (next)#63

Merged
WiktorStarczewski merged 1 commit into
nextfrom
wiktor/fix-probe-stable-next
Apr 30, 2026
Merged

fix(ci): require N consecutive HTTP successes in wait-for-grpc.sh (next)#63
WiktorStarczewski merged 1 commit into
nextfrom
wiktor/fix-probe-stable-next

Conversation

@WiktorStarczewski
Copy link
Copy Markdown
Collaborator

Summary

Tighten scripts/wait-for-grpc.sh so a single transient HTTP success doesn't pass the readiness gate. The probe now requires N consecutive HTTP successes spaced PROBE_INTERVAL apart before declaring the gRPC server ready (defaults: 3 successes, 0.5s apart → ~1s of stable response).

Why

Observed flake mode: wait-for-grpc.sh declared the server responsive after the FIRST attempt that returned an HTTP code in [1-5][0-9][0-9]. Empirically, the testing-node-builder's HTTP layer can flicker up briefly during init (tonic-health, for example, comes online before the rest of the dispatcher is fully wired). The probe exits, tests start, and ALL tests fail with TypeError: Failed to fetch against the gRPC backend.

Recent occurrences this would have caught:

Behaviour change

Before After
First HTTP 1xx-5xx → ready Must see the same kind of response 3× in a row, ~0.5s apart
Single transient blip can pass the gate Streak resets to 0 on any 000 / connect-refused, so the slow-poll loop resumes

The slow sleep 1 and the 90s deadline are unchanged; the streak adds at most ~1s to the steady-state ready-detection time.

Test plan

  • CI's Detect relevant changes flags this as non_docs (it's a script in scripts/).
  • All Test/Build jobs go green on this PR.
  • Watch the next 5–10 PR runs across the repo and confirm the gRPC fetch flake doesn't recur in shards that previously hit it.

If the flake persists after this lands, the next escalation is probing with a real grpc-web request (vs plain GET) so we exercise the actual dispatch path. Tracking that as a follow-up.

Observed flake: probe returns HTTP 200 once on the first attempt that
clears the connection-refused phase, exits, tests start, ALL tests fail
with 'TypeError: Failed to fetch' to the gRPC backend. The single-probe
gate isn't strict enough — a one-shot 200 (e.g. tonic-health responding
before the rest of the dispatcher is fully wired) currently passes.

Upgrade the readiness signal to N consecutive HTTP successes spaced
PROBE_INTERVAL apart (defaults: 3 successes, 0.5s apart), so the probe
only declares the server ready after ~1s of demonstrably-stable
response. Any non-success in the streak resets it to zero and the
slow-poll loop resumes — so a momentary blip during init doesn't get
counted twice on either side.

Tracked occurrences across recent PR runs: web-sdk PR #23 ci-shard-4,
PR #29 ci-shard-1 + ci-shard-4, PR #27 multiple shards.
@WiktorStarczewski WiktorStarczewski added the no changelog PR doesn't need a CHANGELOG entry (trivial / non-user-visible) label Apr 30, 2026
@WiktorStarczewski WiktorStarczewski merged commit 6834abd into next Apr 30, 2026
30 of 31 checks passed
@WiktorStarczewski WiktorStarczewski deleted the wiktor/fix-probe-stable-next branch April 30, 2026 13:35
WiktorStarczewski added a commit that referenced this pull request Apr 30, 2026
…en configs

  - vitest.config.js: exclude crates/web-client/js/passkey-keystore.js
    from coverage scope. The new file depends on browser-only WebAuthn
    PRF APIs that aren't reachable from node, so it can't be unit-tested
    via vitest. It's covered end-to-end by
    crates/web-client/test/passkey-keystore.test.ts under Playwright
    instead. Without this exclusion, coverage was 73.67% (vs 95% gate).

  - knip.jsonc: remove 'dexie' from ignoreDependencies. dexie was
    previously only loaded transitively at test runtime (via rollup-bundled
    page.evaluate scripts that knip's static scan can't see), but
    passkey-keystore.js now imports it directly at the source level —
    making the ignore unnecessary. Knip flagged this with 'Remove from
    ignoreDependencies'.

  - Pulls in scripts/wait-for-grpc.sh's stricter probe (#62 / #63), which
    addresses the gRPC dispatch flake that previously hit several PRs'
    integration shards.

cargo check --workspace --target wasm32-unknown-unknown is clean (build
artifact compiles against the dep retarget at miden-client wiktor-storekeys
that this PR carries on Cargo.toml). Test stage may surface additional
issues now that the build clears — particularly the 'webclient_new
undefined' WASM-symbol issue we saw on prior runs of this branch — but
those need to be diagnosed against a fresh CI run rather than guessed at.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no changelog PR doesn't need a CHANGELOG entry (trivial / non-user-visible)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant