feat(luce-bench): in-tree bench harness + multi-turn agent_recorded + LLM judge#337
Open
easel wants to merge 1 commit into
Open
feat(luce-bench): in-tree bench harness + multi-turn agent_recorded + LLM judge#337easel wants to merge 1 commit into
easel wants to merge 1 commit into
Conversation
5ae6880 to
421f852
Compare
This was referenced Jun 4, 2026
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 4, 2026
## What Containerization stack for lucebox-hub. Dockerfile + docker-bake.hcl build the lucebox-hub image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars consumed by /props. GitHub Actions add .github/workflows/docker.yml (build & publish), update ci.yml, and add release-luce-bench.yml for tagging. Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README) live here because the Dockerfile uv-syncs the workspace at build time. ## Why Provides the reproducible image and CI pipeline every other split PR deploys into. Centralizing build/publish here keeps Dockerfile, entrypoint, and workspace-root pinning in one reviewable change. ## Dependencies - Luce-Org#335 (lucebox-cli): Dockerfile COPYs lucebox/ into the image - Luce-Org#337 (lucebench-harness): Dockerfile COPYs luce-bench/ into the image
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 4, 2026
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
da4e316 to
c20a0f2
Compare
Contributor
There was a problem hiding this comment.
16 issues found across 116 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="scripts/check_card_bundle_drift.sh">
<violation number="1" location="scripts/check_card_bundle_drift.sh:18">
P2: Drift guard is one-way and misses extra/stale files in the bundled wheel, so CI can report success even when bundle contents do not exactly match `share/model_cards`.</violation>
</file>
<file name="luce-bench/src/lucebench/fixtures/agent_prompts/codex_apply_patch.md">
<violation number="1" location="luce-bench/src/lucebench/fixtures/agent_prompts/codex_apply_patch.md:325">
P2: Grammar for `Hunk` does not allow multiple `@@` context-scoping lines, contradicting the documented feature and examples. The production `Hunk := "@@" [ header ] NEWLINE { HunkLine }` expects only HunkLines (starting with space, `-`, or `+`) after the `@@` header, but the text explicitly says "use multiple `@@` statements to jump to the right context" and shows consecutive `@@` lines (e.g., `@@ class BaseClass` then `@@ def method():`). This inconsistency will confuse agent implementors or cause strict parsers to reject valid patch syntax.</violation>
</file>
Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.
Tip: cubic used a learning from your PR history. Let your coding agent read cubic learnings directly with the cubic MCP.
Re-trigger cubic
c20a0f2 to
6f27131
Compare
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 4, 2026
## What Containerization stack for lucebox-hub. Dockerfile + docker-bake.hcl build the lucebox-hub image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars consumed by /props. GitHub Actions add .github/workflows/docker.yml (build & publish), update ci.yml, and add release-luce-bench.yml for tagging. Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README) live here because the Dockerfile uv-syncs the workspace at build time. ## Why Provides the reproducible image and CI pipeline every other split PR deploys into. Centralizing build/publish here keeps Dockerfile, entrypoint, and workspace-root pinning in one reviewable change. ## Dependencies - Luce-Org#335 (lucebox-cli): Dockerfile COPYs lucebox/ into the image - Luce-Org#337 (lucebench-harness): Dockerfile COPYs luce-bench/ into the image
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 4, 2026
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 4, 2026
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 4, 2026
## What Containerization stack for lucebox-hub. Dockerfile + docker-bake.hcl build the lucebox-hub image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars consumed by /props. GitHub Actions add .github/workflows/docker.yml (build & publish), update ci.yml, and add release-luce-bench.yml for tagging. Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README) live here because the Dockerfile uv-syncs the workspace at build time. ## Why Provides the reproducible image and CI pipeline every other split PR deploys into. Centralizing build/publish here keeps Dockerfile, entrypoint, and workspace-root pinning in one reviewable change. ## Dependencies - Luce-Org#335 (lucebox-cli): Dockerfile COPYs lucebox/ into the image - Luce-Org#337 (lucebench-harness): Dockerfile COPYs luce-bench/ into the image
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 4, 2026
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 4, 2026
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
6f27131 to
27dd7f1
Compare
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 4, 2026
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 5, 2026
…i-turn agent_recorded + LLM judge
…ge grading ## What Adds the luce-bench/ Python package as a standalone bench harness: - Core areas: smoke, gsm8k, hellaswag, humaneval, truthfulqa_mc1, longctx, agent, agent_recorded (multi-turn replay), forge, ds4_eval. - Multi-turn agent_recorded replay with an LLM-judge grader and per-turn metrics; forge_eval fixture imported under fixtures/forge_eval/_forge/. - Card sampling, snapshot/submit_baseline, model-card schema, thinking-budget client, normalize+regrade pipeline, hostinfo, and CLI entrypoint. - Fixtures: agent_cases, agent_recorded (single + multi_turn), ds4_eval_cases, agent_prompts (codex variants). - Tests: ~25 pytest modules covering every area plus thinking control, normalize/regrade, snapshot, runner, and host-info paths. - scripts/extract-agentic-fixture.py (loaded by test_extract_agentic_fixture.py via path) and the scripts/check_card_bundle_drift.sh CI gate. ## Why Splits the bench harness out of lucebox-hub's main tree so it can ship as its own installable package and be consumed by lucebox bench without pulling in the C++ server build context. Multi-turn replay + LLM-judge grading is what unblocks the coding-agent-loop sweep workflow. ## Dependencies None - this PR is independent.
27dd7f1 to
ac972b7
Compare
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 5, 2026
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the standalone
luce-bench/Python eval harness package (vendored into the repo), plus a multi-turnagent_recordedreplay area graded by a Claude Sonnet LLM judge. Also wires up a card-bundle drift check script and CI workflow updates.Files
luce-bench/— new top-level package (pyproject, src layout, tests, fixtures, docs)src/lucebench/areas/— eval areas: smoke, agent, agent_recorded, ds4_eval, forge, gsm8k, hellaswag, humaneval, longctx, truthfulqa_mc1src/lucebench/grading/llm_judge.py— Sonnet judge with on-disk cachesrc/lucebench/fixtures/— vendored eval fixtures (incl. forge_eval scenarios)tests/— ~20 test modules covering graders, runners, snapshot, cardsscripts/check_card_bundle_drift.sh— new drift CI helper.github/workflows/ci.yml— wires luce-bench tests + card bundle drift checkDependencies
None. This PR is self-contained: pure-Python harness with no source-level references to the server, lucebox CLI, or docker-stack PRs. Independent of #334, #335, #336, #338, #339, #340, #341.
Test plan
cd luce-bench && pytestpasses locallyagent_recordedagainst a live server with judge mocked--areas alldoes not enable the LLM-judge path (cost containment)Judge cost estimate: ~$0.30-$1.50 per full 48-case pass on Anthropic Sonnet.
Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com