Skip to content

ci: fix Windows caching + re-enable windows-latest#118

Open
borisbat wants to merge 3 commits into
masterfrom
bbatkin/ci-windows-caching
Open

ci: fix Windows caching + re-enable windows-latest#118
borisbat wants to merge 3 commits into
masterfrom
bbatkin/ci-windows-caching

Conversation

@borisbat

@borisbat borisbat commented Jun 3, 2026

Copy link
Copy Markdown
Owner

Part 1 of the dasImgui CI work: fix the Windows build caching, and re-enable windows-latest so the test crash surfaces on CI. Crash diagnostics are deliberately the next step, not this PR.

Build caching (mirrors daslang build.yml)

This workflow's Windows recipe was meant to follow daslang's build.yml but never picked up its caching:

  • vcpkg/openssl — wrap vcpkg/ in actions/cache@v4 + skip-guard. A hit skips clone + bootstrap + the from-source OpenSSL build, which was ~8 min on the Windows runner every run.
  • sccache — switch from the GHA backend (one cache item per TU → quota fragmentation → ~33% hit rate observed) to build.yml's local-disk backend persisted as a single tarball. Restore always; save only on master with delete-first (GHA caches are immutable). The launcher stays passed as -D on the configure (not env) so sccache scopes to daslang's /Z7 TUs and doesn't leak into the GLFW/libhv /Zi sub-builds (glfw.pdb C1041). Adds actions: write for the cache-delete.

Re-enable windows-latest

The Windows build is green; the test step fastfails (0xC0000409) only on the CI runner — not reproducible locally or on POSIX. It's kept in the matrix so the crash surfaces on CI. The sccache save steps run before the test step, so master still seeds caches despite the red Windows test leg (per the agreed "red master, save before test" approach).

Expected on this PR's run

  • vcpkg/sccache are cold on this PR (PRs are read-only; the new local-disk slots seed only once this merges and master runs). So this PR's Windows build won't show the speedup — the win lands on the next PR after merge.
  • The Windows test step will fail with the fastfail. That's the point — it's how we get the crash onto CI to diagnose.

Deliberately NOT in this PR

Caching daslang-src/bin by the daslang master SHA (skip the daslang rebuild entirely when master is unchanged — the headline win). It needs the exact shared-module output layout pinned down first; a wrong cached path would skip the build and surface incomplete-cache test failures that mimic the crash we're about to diagnose. Follow-up once Windows is characterized.

🤖 Generated with Claude Code

@borisbat borisbat force-pushed the bbatkin/ci-windows-caching branch from cbae9b4 to 3262d3a Compare June 3, 2026 08:13
borisbat added a commit that referenced this pull request Jun 5, 2026
The live-API client (live_api_transport's POST /command, plus wait_until_ready /
get_text / post_signal) used dasHV's convenience GET/POST, which inherit libhv's
default (effectively unbounded) response timeout. The host serves the live API on
the GLFW main thread that also advances frames, so when a synth command wedges that
thread the host stops answering AND stops rendering — and the client blocks on the
POST with no timeout, running the test body all the way to the 120s popen watchdog
(DEFAULT_TEST_TIMEOUT_SEC) instead of failing in seconds.

Seen on #118 windows-latest: test_imgui_synth_ctrl_shortcuts wedged for 159s and
tripped "daslang-live timed out (>120s)".

Add http_get / http_post helpers in imgui_transport that build an HttpRequest via the
existing request() rail with a hard per-request timeout (DEFAULT_REQUEST_TIMEOUT_SEC =
10s, clamped to 1..65535 so misuse can't wrap the uint16 field), and route every client
call through them. A wedged host now yields a null / status-0 reply in ~10s, so the
wait_until budgets actually bound the failure and the diagnostic is legible instead of
a mute 120s hang.

Client-side only — no host or daslang change. The convenience GET/POST stay as-is; this
uses dasHV's existing request(req) rail (HttpRequest.timeout). The new public http_get /
http_post are grouped under "Raw HTTP" in utils/imgui2rst.das to satisfy the
Uncategorized doc gate. Functional validation is the #118 rebase on CI (the
windows-headless env where the wedge reproduces).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
borisbat and others added 3 commits June 5, 2026 16:39
Mirrors the daslang build.yml caching that this workflow's Windows recipe
was supposed to follow but didn't:

- vcpkg/openssl: wrap vcpkg/ in actions/cache + skip-guard so a hit skips
  clone + bootstrap + the from-source OpenSSL build (~8 min on the Windows
  runner, paid every run before this).
- sccache: switch from the GHA backend (one cache item per TU -> quota
  fragmentation -> ~33% hit rate observed) to build.yml's local-disk
  backend persisted as a single tarball. Restore always; save only on
  master with delete-first (GHA caches are immutable). The launcher stays
  passed as -D on the configure (not env) so sccache scopes to daslang's
  /Z7 TUs and does not leak into the GLFW/libhv /Zi sub-builds (glfw.pdb
  C1041). Adds actions:write for the gh-cache-delete step.

- Re-enable windows-latest. The build is green; the test step fastfails
  (0xC0000409) only on the CI runner and is not reproducible locally or on
  POSIX. Kept in the matrix so the crash surfaces on CI. The sccache save
  steps run BEFORE the test step, so master still seeds caches despite the
  red Windows test leg.

Deliberately NOT in this PR: caching daslang-src/bin by the daslang master
SHA (skip the daslang rebuild entirely when master is unchanged). It needs
the exact shared-module output layout pinned down first; getting the cached
paths wrong would skip the build and surface incomplete-cache test failures
that mimic the crash we are about to diagnose. Follow-up once Windows is
characterized.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…round)

The 0xC0000409 /GS fast-fail in the docking reset() path only triggers under
the full 4-worker concurrent load (test_docking_basic). Running Windows at 3
workers stays under the threshold and goes green, unblocking this caching PR
while the root cause is investigated (ASan). Restore to 4 once fixed.
Dropping to 3 isolated workers did NOT avoid the test_docking_basic
0xC0000409 /GS stack-smash fast-fail — it still crashes at 3. So the
worker-count workaround buys nothing; restore the standard 4 workers
across all platforms and keep chasing the root cause via ASan.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant