Skip to content

ci: probe /healthz, not /, when waiting for wash dev#2

Open
ericgregory wants to merge 2 commits into
mainfrom
fix-ci-readiness-probe
Open

ci: probe /healthz, not /, when waiting for wash dev#2
ericgregory wants to merge 2 commits into
mainfrom
fix-ci-readiness-probe

Conversation

@ericgregory

@ericgregory ericgregory commented May 11, 2026

Copy link
Copy Markdown
Contributor

Summary

The integration runner (tests/api/_runner.sh) waited for the dev server by polling GET /. The api-gateway has an explicit early-return for / in components/api-gateway/src/routes.rs::dispatch that returns 200 "ocelaudit booting" before AppState::startup() has finished — every other path returns 503 in that window. Result: the runner declared ready as soon as / came up, then tests/api/m*.sh hit /healthz and /api/v1/* and got 503 across the board (the symptom you've been seeing on main since 2026-05-01, commit 3da49e4 — the M14 split into csl-service + api-gateway).

This PR:

  1. Switches the runner's readiness probe to /healthz (only 200 once AppState is Ok — storage initialized, signer loaded).
  2. On readiness timeout, dumps the final /healthz status + body and the full wash dev log. The 503 body carries the exact AppState::startup() error via RouteResponse::err — discarding it was hiding the real bug.

What the new diagnostics reveal

With this PR applied, CI still fails — but now with a clear pointer to the underlying issue. Latest run:

!! wash dev did not become ready within 60s.
-- final /healthz status + body --
  status=503
  body:
    {"error":"io: No such file or directory (os error 44)"}

WASI errno 44 = ENOENT. So AppState::startup() is failing because some filesystem op against /data returns "no such file or directory" inside the wasm sandbox. The likely failure points are in components/api-gateway/src/state.rs:

  • JsonFsStorage::open("/data")fs::create_dir_all
  • storage.users_seed_if_empty() → write /data/users.json
  • SessionSigner::from_env_or_keyfile → write /data/session.key

The runner pre-stages .cache/ocelaudit-data before booting wash dev and .wash/config.yaml maps it to /data via the volumes block, so the host directory exists at boot. My guess (not verified) is that the M14 introduction of service_file alongside the existing volumes block changes how wash dev wires preopens for the main component vs. the service. Worth looking at:

  • Are /data preopens being applied to both the api-gateway component and the csl-service in wash 2.0.5? (Code path: wash-runtime/src/engine/workload.rs ~line 1126, the for (host_path, mount) in components.values().flat_map(...) loop.)
  • Is the relative ./.cache/ocelaudit-data host_path being resolved against the wrong cwd somewhere downstream?
  • Would an absolute host_path (set by the runner via temp config) sidestep the issue?

A maintainer with wash-dev internals fluency could probably localize this faster than I can from CI logs alone.

Test plan

  • CI exercises the new probe + diagnostic on every run.
  • Once the underlying preopen / startup issue is fixed, expect /healthz to flip 200 quickly and all tests/api/m*.sh to pass.

The integration runner waits for the dev server by polling GET /. The
api-gateway has an explicit early-return for / in
components/api-gateway/src/routes.rs::dispatch that returns 200
"ocelaudit booting" *before* AppState::startup() has finished — every
other path returns 503 in that window. Result: the runner declared
ready as soon as / came up, then the tests/api/m*.sh scripts immediately
hit /healthz and /api/v1/* and got 503 across the board.

Switch the probe to /healthz, which is only 200 once AppState is Ok
(storage initialized, signer loaded). All the m*.sh scripts already
wait_for /healthz themselves, so this aligns the runner with what the
tests expect.

Refs: tests/api/_runner.sh ready-loop; routes.rs dispatch Err-arm.
The /healthz probe is the right semantic gate but it surfaced a deeper
issue: AppState::startup() never succeeds in CI, so /healthz stays 503
for the full 60s. The 503 body carries the actual startup error per
RouteResponse::err in routes.rs, and the runner was discarding it.

Capture one final /healthz response (status + body) and dump the full
wash dev log (not just the last 50 lines) so the next CI failure shows
the underlying error message.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant