Skip to content

test: live e2e tests + Test - Live workflow (port of pytest -m live)#34

Merged
dacorvo merged 5 commits into
mainfrom
rust-live-tests
Jun 25, 2026
Merged

test: live e2e tests + Test - Live workflow (port of pytest -m live)#34
dacorvo merged 5 commits into
mainfrom
rust-live-tests

Conversation

@dacorvo

@dacorvo dacorvo commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Ports agentcap's live tier to Rust — the prerequisite to removing the Python client. Replaces test_cli_live.py + test_drivers_live.py + linux-live-tests.yml.

tests/live.rs

Per-agent end-to-end tests that run the real agentcap run binary (via CARGO_BIN_EXE_agentcap) against a live OpenAI-compatible server and assert the wire path (not task quality):

  • run.json shape (agent/model/upstream/turns), completed_turns == 1, a minted session_id;
  • captures landed in <run>/captures/;
  • for pi, the streamed .jsonl trace.

This single e2e per agent subsumes both Python live files (the CLI run test and the per-driver tests — completed_turns==1 + captures ⇒ the agent reached the model through the proxy and the turn succeeded). Covers pi, hermes, goose; opencode is omitted for the same reason it's @pytest.mark.skip'd (1.15.x doesn't pick up the baked agent.minimal).

Gating: #[ignore]d so cargo test stays hermetic; resolves the server from AGENTCAP_TEST_LLM_URL (else a :8000/:8080 probe) and skips (passes) when none is reachable.

.github/workflows/live.yml — "Test - Live"

Ports the proven Python live setup: install podman, cache + download the pinned Qwen3-1.7B GGUF, cache the rootless image store, spawn the pinned llama.cpp:server container (wait for /v1/models), set AGENTCAP_TEST_LLM_URL, then cargo test --test live -- --ignored --test-threads=1. Per-agent sandbox images build on demand via the binary (cached across runs). Triggers on push/PR/dispatch like the Python one.

Validated locally

fmt/clippy green; full suite hermetic (live shows as ignored); cargo test --test live -- --ignored skip-passes with no server.

⚠️ The live workflow itself (podman + GGUF + real inference) can only be exercised in CI — I can't run podman/GGUF here. This PR's Test - Live run is its first real execution; I'll watch it and fix anything that trips. Once it's green, the Python live tier can be removed in the cutover.

🤖 Generated with Claude Code

dacorvo and others added 5 commits June 25, 2026 08:34
Prerequisite to removing the Python client: replace the Python live tier
(linux-live-tests.yml, test_cli_live.py, test_drivers_live.py) with a Rust port.

- tests/live.rs: per-agent end-to-end tests that run the real `agentcap run`
  binary (via CARGO_BIN_EXE) against a live OpenAI-compatible server, asserting
  the wire path — run.json shape, completed_turns, captures landed, and pi's
  streamed JSONL trace. Subsumes both Python live files (CLI e2e + per-driver).
  `#[ignore]`d so `cargo test` stays hermetic; each test skips (passes) when no
  server is reachable. opencode omitted (same reason as the Python skip).
- .github/workflows/live.yml ("Test - Live"): ports the proven Python live
  setup — install podman, cache + download the Qwen3-1.7B GGUF, cache the
  rootless image store, spawn the pinned llama.cpp server, then
  `cargo test --test live -- --ignored` (serial). Sandbox images build on
  demand via the binary.

Gated by AGENTCAP_TEST_LLM_URL (else a :8000/:8080 probe). Verified locally:
fmt/clippy green, full suite hermetic (live ignored), live tests skip-pass with
no server. The live workflow itself needs podman + GGUF, so it's validated in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…default

reqwest::blocking defaults to a 30s total-request timeout; a slow streamed
generation (e.g. an agent turn on a CPU runner) blows past it, so the proxy's
upstream read errors mid-stream and the agent's turn never completes. Cap at a
generous-but-finite 900s instead (synth follow-up: 300s). Restores the live
workflow to all agents.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s not on `run`)

hermes' base prompt exceeds the tiny CI model's budget and bails before any
model call; the Python suite only ran hermes at the driver level with
ignore_rules/toolsets, which `agentcap run` doesn't expose. Keep pi + goose,
which cover the full stack across both trace mechanisms.
@dacorvo dacorvo merged commit 4eab149 into main Jun 25, 2026
9 checks passed
@dacorvo dacorvo deleted the rust-live-tests branch June 26, 2026 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant