Skip to content

fix: eliminate Docker test flakiness and cut CI time#10

Merged
marconae merged 2 commits into
mainfrom
fix/docker-speed-and-reliability
May 5, 2026
Merged

fix: eliminate Docker test flakiness and cut CI time#10
marconae merged 2 commits into
mainfrom
fix/docker-speed-and-reliability

Conversation

@marconae
Copy link
Copy Markdown
Owner

@marconae marconae commented May 5, 2026

Summary

  • Cache Docker image in GitHub Actions (type=gha) so apt-get + Claude CLI + Codex CLI installs are not re-downloaded on every run — image build becomes a near-instant cache hit after the first push
  • Parallelize all 4 integration test containers — each is fully isolated, so wall-clock time drops from the sum of all four to the slowest single container
  • Skip redundant docker build in test-docker.sh when the image is already loaded by the CI pre-build step; local dev still builds on demand
  • Remove || true on Claude CLI install in Dockerfile so a failed install surfaces immediately instead of silently producing a broken test environment
  • Drop the 3-attempt retry loop — the primary flakiness source was the fresh Docker image pull on every run, which caching eliminates

Root causes addressed

Source Type Impact
Fresh Docker image pull (apt + Claude CLI + Codex CLI) on every run Slowness 8–12 min saved per run
4 sequential container runs Slowness 3–5 min saved by parallelizing
Retry re-ran full pipeline including slow Docker build Slowness + waste Eliminated
|| true on Claude CLI install Silent flakiness Fails explicitly now

Expected CI times

Scenario Before After
Cache hit (no Dockerfile change) 30 min ~10 min
Dockerfile changed (cold cache) 30 min ~18 min

Test plan

  • CI run on this branch completes in < 15 min
  • A second push (no Dockerfile change) shows a GHA cache hit on the Docker build step and completes in < 12 min
  • All 4 test suites (install, codex-plugin, update, uninstall) still pass

marconae added 2 commits May 5, 2026 09:23
- Cache Docker image layers in GitHub Actions (type=gha) so apt-get +
  Claude CLI + Codex CLI installs are not re-downloaded on every run;
  image build becomes a near-instant cache hit after the first run
- Run all 4 integration test containers in parallel; each is isolated so
  there is no ordering dependency — wall-clock time drops to the slowest
  single container instead of the sum of all four
- Skip `docker build` in test-docker.sh when the image is already loaded
  (CI pre-builds it via build-push-action); local dev still builds on demand
- Remove `|| true` on Claude CLI install in Dockerfile so a failed install
  surfaces immediately rather than silently producing a broken test env
- Drop the 3-attempt retry loop; the primary flakiness source was the
  fresh Docker image pull, which caching eliminates
@marconae marconae merged commit 865fcea into main May 5, 2026
4 checks passed
@marconae marconae deleted the fix/docker-speed-and-reliability branch May 13, 2026 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant