TB2 Kimi fixes: empty-turn synth, pre-submission prompt guidance, reasoning_content NUL sanitize#41
Open
n-WN wants to merge 25 commits into
Open
TB2 Kimi fixes: empty-turn synth, pre-submission prompt guidance, reasoning_content NUL sanitize#41n-WN wants to merge 25 commits into
n-WN wants to merge 25 commits into
Conversation
Drives Kimi TB2 benchmark from 5/89 (sk-kimi-* rate-limit era) to 55/89 = 61.8% (OAuth login path), beating GPT-5 spark's 52/89 baseline. ---- 1. Kimi OAuth device-code login (matches kimi-cli 1.37.0) ---- Moonshot's Kimi Code platform (`api.kimi.com/coding/v1`) accepts raw `sk-kimi-*` API keys, but those keys hit a shared per-account rate bucket that saturates at ~30 bursts then stays 429 for minutes — the prior run solved 5/89 because workers kept exhausting and burning. The OAuth-login path is a separate subscription-backed quota. We mirror kimi-cli's protocol exactly: * Device-code flow at `https://auth.kimi.com/api/oauth/{device_authorization,token}` with hardcoded public `client_id = 17e5f671-d194-4dfb-9706-5516cb48c098`. * Every request (OAuth + chat) carries the full `X-Msh-*` header set (Platform=kimi_cli, Version, Device-Name, Device-Model, Os-Version, Device-Id) plus `User-Agent: KimiCLI/1.37.0`. Chat-side must match or subscription gating silently degrades. * Platform/arch strings computed via `sw_vers -productVersion` on macOS and `uname -r`/`uname -m` on Linux — verified byte-for-byte against Python's `platform.release/machine/version` on both darwin-arm64 and linux-x86_64. * Device id is a persistent UUID at `$PUFFER_HOME/kimi-device-id`, chmod 0600. * Refresh uses a cross-process fslock at `$PUFFER_HOME/kimi-oauth.lock` because Kimi's refresh rotates the refresh_token — 20 concurrent bench workers sharing one credential file would race-invalidate each other without the lock. New: * `crates/puffer-provider-openai/src/kimi_oauth.rs` — request/poll/ refresh with tests (device_id persistence, arch resolution, UA version alignment, ASCII header sanitization). * `crates/puffer-provider-openai/examples/dump_kimi_headers.rs` — debug helper to diff our header set vs kimi-cli. CLI wiring (`puffer auth login kimi-for-coding`): * New `OauthFamily::KimiDevice` variant; detected by provider id starting with `kimi-` or base_url containing `api.kimi.com`. * `run_login_flow` early-dispatches to `run_kimi_device_login_flow` before the PKCE+callback path so Kimi-style "show URL, poll token" doesn't fight the localhost-callback listener. * Opens browser automatically via existing `authflow::open_browser`. * Stores tokens via generic `OAuthCredential` (keychain on macOS, plaintext `.credentials.json` on Linux) — same code path Codex/ Anthropic already use. Runtime integration (`crates/puffer-core/runtime/openai.rs`): * `resolve_openai_execution_config` auto-injects `X-Msh-*` headers whenever provider is Kimi (both API-key and OAuth paths), so the wire shape matches kimi-cli regardless of credential type. * 401 refresh dispatches per provider: OpenAI → `auth.openai.com`, Kimi → `auth.kimi.com` with file-locked `refresh_kimi_token_locked`. Request shape (`request.rs`): when `custom_headers` carries a User-Agent (yaml `headers:` or Kimi path), skip the default Codex UA so reqwest doesn't concatenate two values and fail Kimi's whitelist. Benchmark harness (`benchmark/puffer_harbor_agent.py`): container install step now side-channels `puffer-auth.json`, `puffer-credentials.json`, and `puffer-kimi-device-id` through the mounted `codex_dir` into `$HOME/.puffer/` inside the container, so in-container puffer can use the same OAuth credentials as the host (with chmod 0600 on the secrets file). ---- 2. Reflection: completion-claim trigger + test-discipline nudge ---- TB2 analysis: 25 of 36 unsolved-by-Kimi tasks fell into verifier-reject with no transport error — agent did real work, ran its own smoke test if anything, declared "Done", verifier disagreed. None of those triggered the existing reflection because the heuristic gate only fires on stall (5 min without validation/artifact/edit progress), which a confidently-wrong agent never hits. Solved-vs-unsolved breakdown on the 89-task Kimi run: * SOLVED tasks ran tests: 17% (9/54). * UNSOLVED tasks ran tests: 7% (2/27). * 21 unsolved long-trajectory tasks ran ZERO tests before "Done". Intervention (`crates/puffer-core/runtime/reflection.rs`): * `BatchAssessment` gains `completion_claim: Option<String>`. * `observe_openai_batch` scans the latest assistant message for completion-claim phrases ("done", "complete", "all set", "verified", "successfully", "has been written", "saved to /", `✅`, …) — markers curated from the 25 trajectories. * When a claim is present AND the normal gate would skip AND minimum tool-call + minimum batch-interval guards pass, force `should_evaluate=true` so code_judge + llm_judge can challenge the agent's self-assessment. * `build_checkpoint` appends a claim-specific addendum to the reflection prompt (Chinese + English variants). It acknowledges that TB2's verifier tests live outside `/app` and the agent cannot see them, then asks the agent to (1) re-list each explicit requirement from the prompt, (2) construct an independent `/tmp/check.py` that validates the produced artifact against each requirement (including sanity-checks like "graphene G peak should be ~1580 cm⁻¹, not 19196"), and (3) only declare Done when every check passes. * Four new unit tests cover claim detection, empty/non-claim negatives, walking past trailing tool calls, and phrasing variants. ---- 3. UTF-8 char-boundary panic fix ---- `crates/puffer-core/runtime/debug_context.rs` was slicing message text with `&text[..2000]` / `&input[..500]`, which panics (`byte index X is not a char boundary; it is inside '≤'`) when the cut falls inside a multi-byte UTF-8 codepoint. The TB2 `distribution-search` prompt contains `≤`, and the resulting crash killed that trial before retry could save it. Added `truncate_utf8_prefix` that backs off to the nearest `is_char_boundary(i)` and two unit tests covering the exact failure case.
The v1 completion-claim trigger never fired in practice. Diagnosed by
inspecting the reflection-rerun trajectories of all 34 retried tasks:
zero of them had a reflection-injected user message, despite every
failing task ending in "Done" / "All set" / etc.
Two compounding bugs blocked the trigger:
1. `observe_batch_internal` short-circuits on `invocations.is_empty()`.
But agents emit the completion claim in the LAST turn, which is
typically text-only — no new tool calls. So the observation layer
never ran, and the completion-claim check sitting on top of
`observe_openai_batch`'s post-observation code was unreachable.
2. Even after synthesizing an observation for the empty-invocations case,
`code_judge_score` returns 0 on a fresh synthesized assessment. Under
the default `LlmJudgeMode::ConfirmCodeJudge`, the llm judge is then
skipped too. No judge → no signal → no checkpoint.
Fix — `synthesize_claim_only_observation`:
* Runs when `observe_batch_internal` bailed on empty invocations AND
`latest_assistant_completion_claim(items)` found a claim AND the
minimum-tool-calls + minimum-batch-interval guards pass.
* Inflates `time_since_progress_ms` past both `soft_stall_ms` and
`hard_stall_ms` (+60 s margin) so `code_judge_score` yields 4 ≥ the
default `min_score = 4`. Gets `code_judge_signal` to produce a
signal, which unlocks the llm judge under `ConfirmCodeJudge`.
* Tags the assessment with `signal_notes = ["synthesized_from_completion_claim"]`
so operators can tell the difference between a real stall and a
claim-forced evaluation.
Verified via the 4 existing completion_claim unit tests; integration
will be measured by the fresh kimi-reflection-*T120008Z run against
baseline stochasticity (3/9 ≈ 33% solve-on-rerun in the buggy v1 +
rerun-unsolved runs).
…fires
The completion-claim trigger was still inert after the
synthesize_claim_only_observation fix, because the agent's terminal
"Done" turn — the one that carries the claim text — almost always has
NO new tool calls. Both openai exit paths early-returned on
`tool_calls.is_empty()` WITHOUT calling `observe_openai_batch`, so
reflection never had a chance to see the claim.
Fix — for all three request paths (Responses streaming SSE, Responses
non-streaming JSON, Chat Completions non-streaming):
* Push the final assistant text into `items` before deciding to exit,
so reflection's `latest_assistant_completion_claim(items)` can find
it.
* Invoke `tracker.observe_openai_batch(&[], &items, ...)` with empty
invocations. The synth path inside the tracker handles empty
invocations specifically when a claim is present.
* If a reflection checkpoint comes back, inject its prompt as a user
message and `continue` the turn loop — giving the agent another
chance to verify for real or resubmit. The streaming path also
emits `TurnStreamEvent::ReflectionCheckpoint` for TUI visibility;
the non-streaming paths have no on_event callback, so they just
inject silently.
* If no checkpoint, fall through to the normal exit with the
accumulated assistant_text.
Verified by re-running reflection-rerun (v4 TAG
kimi-reflection-20260421T121427Z, 30 currently-unsolved tasks) — the
prior v3 saw 0 injections despite every failing task ending in "Done".
The completion-claim trigger was gated on a curated 25-phrase keyword
list ("done", "complete", "verified", "saved to /", "✅", …). That's a
classic brittle heuristic:
* False positives — "this subtask is complete, now moving on…",
"I've successfully identified the bug, now fixing it.", any usage
of "completed", "ready", "finished" mid-task fires the gate when
the agent is nowhere near done.
* False negatives — TB2 tasks like `chess-best-move` end with
"The best move for White is c1g5." — no keyword in the list,
trigger stays silent even though the agent is clearly submitting.
* Overfitting — list was tuned against one snapshot of one model's
phrasing; drifts as model updates.
Replaced with a pure structural signal: whenever `observe_openai_batch`
is invoked with empty invocations (the terminal text-only turn the
openai runtime now hooks) AND the min-tool-calls + min-batch-interval
guards pass, synthesize the observation. The LLM judge then reviews
the actual trajectory + the agent's terminal text and decides for
itself whether the submission is premature. No keyword list, no
hard-coded phrasing assumptions.
Renamed `latest_assistant_completion_claim` → `latest_assistant_message_text`
to reflect that it no longer classifies, just extracts. The
`BatchAssessment.completion_claim` field is kept (still semantically
"the text we want to show the judge from the terminal turn") and
continues to gate the verify-your-submission nudge that gets appended
to the reflection checkpoint prompt.
Tests rewritten to verify the structural contract (returns any
non-empty assistant text; walks past trailing tool calls; returns None
for empty/missing assistant turns). No keyword detection tests — the
behavior they were testing is gone.
V5 reflection rerun regressed to 0/8 (400 "reasoning_content missing in assistant tool call message at index 3") because the terminal-turn hook pushed the assistant text into `items` WITHOUT the accompanying reasoning. On the next request (after the reflection checkpoint injected a user message and continued the loop), Kimi replayed the conversation, hit the reasoning-less assistant message, and rejected. Mirror the normal tool-call branch's bookkeeping: after pushing the assistant_text, immediately call `append_reasoning_items(&mut items, &response.reasoning_items)`. Same invariant Kimi enforces on every replayed assistant message when `thinking: enabled` is set. The non-streaming Chat Completions path has `parsed` (Value) instead of a typed `response`, and that path's reasoning is captured via `extract_chat_completions_reasoning` elsewhere in the request loop; we don't need to change it for now since benchmark-run uses streaming exclusively. If non-streaming users hit the same regression we can backfill.
TB2 tasks routinely trigger parallel apt invocations — an agent sets up
Python + curl + build-essential in one shell, or nests apt-get inside a
compound pipeline. On base images without a persistent shell lock,
these race on /var/lib/dpkg/lock-frontend and the losing side sees
"E: Could not get lock /var/lib/dpkg/lock-frontend" and bails.
The agent reflex is then to retry once, still race, and submit a broken
environment — the task verifier sees missing packages and scores 0.
This is a pure infrastructure failure that silently masquerades as a
capability failure.
The fix is drop-in: when `Bash` sees a command whose first shell token
on any segment is `apt-get | apt | dpkg | apt-key | dpkg-reconfigure`
(after skipping leading `sudo`, `env`, and `VAR=value` assignments),
wrap it as `flock -w 600 /var/lib/dpkg/lock-frontend -c '<original>'`.
Flock holds the dpkg frontend lock for the duration of the command, and
any concurrent apt invocation on the same container blocks at the OS
level for up to 10 minutes rather than racing.
Guardrails:
* Idempotent — commands that already contain `flock` are passed through.
* Non-package commands (`ls`, `echo 'apt-get is a package manager'`,
…) untouched.
* Single-quotes in the inner command are escaped (`'\''`) so the
outer single-quoted wrapper is shell-safe.
* 600-second timeout matches the hard-stall threshold used elsewhere;
heavy installs routinely run ~300s and we'd rather serialize than
fail.
Prior art: the TB2 leaderboard's top agent gained +17.9 pp solve rate
(47 → 64%) in a single iteration from this exact change on their side.
Seven unit tests cover: apt-get install, compound `apt-get update && …`,
sudo-prefixed apt, dpkg, benign `echo "apt-get"`, already-flocked input,
and single-quote escaping inside the wrap.
Regression root cause: my earlier rsync to sc replaced sc's working
conversation.rs with local's older copy, which explicitly DROPPED
`ConversationItem::Reasoning` items on the Chat Completions path
("Chat Completions has no concept of reasoning items on the wire").
That's wrong for Kimi k2 / DeepSeek-R1, which require the prior turn's
reasoning to be echoed back as `message.reasoning_content`. Symptom
was a near-total failure mode on v5/v6 reflection runs (28 of 29 tasks
400'd with "thinking is enabled but reasoning_content is missing in
assistant tool call message at index 3").
Restored logic from commit 724df6d ("Support Kimi for Coding via
OpenAI chat completions"):
* `OpenAIChatMessage` (request-side) regains the optional
`reasoning_content: Option<String>` field, serialized only when
populated (OpenAI's own Chat Completions API ignores it).
* `OpenAIChatChoiceMessage` (response-side) already had
`reasoning_content`; exposed `extract_chat_completions_reasoning`
in the crate's public surface (re-exported via `lib.rs`).
* `execute_openai_completions_once` now calls
`extract_chat_completions_reasoning` right after parsing the
response and pushes a `ConversationItem::Reasoning` BEFORE the
assistant text / tool-call items — both on the normal tool-call
branch and on the terminal-turn branch where the reflection hook
may continue the loop. Without the push on the terminal branch,
reflection injection triggers the same 400 because the replayed
assistant turn would still lack reasoning.
* `items_to_chat_messages` (already in conversation.rs from prior
work) picks these Reasoning items up via `pending_reasoning` and
attaches the concatenated text to the next assistant wire message's
`reasoning_content` field, folding `Message + FunctionCall` items
into a single assistant message on the wire (OpenAI spec: one turn
= one message with optional content + tool_calls + reasoning).
Verification: 38 puffer-provider-openai tests + 18 reflection + 58
conversation + 21 bash tests all pass. sc rebuilt and relaunched as
`kimi-reflection-20260421T131207Z` (v7) on the 30 currently-unsolved
tasks; the v5/v6 runs' 400-reasoning regression should now be gone
AND the completion-claim reflection hook should finally fire against
a real workload.
Prior nudge ("re-read prompt, write a check, run it, pass before
declaring done") was easy for the model to rationalize past — exactly
the failure mode the cc-src verification specialist is designed to
break out of. Replaced with a much stronger prompt directly adapted
from Anthropic Claude Code's `verificationAgent.ts:VERIFICATION_SYSTEM_PROMPT`.
Key additions over the old nudge:
* **Adversarial framing** — "your job is not to confirm it works,
it's to try to break it." The model self-identifies as a verifier,
not a defender of its own implementation.
* **Two named failure modes** the model can recognize in itself —
"verification avoidance" (finding reasons not to run a check) and
"seduced by the first 80%" (polished demo, broken edges). Naming
them is half the fix; the model has the vocabulary to flag the
pattern in itself.
* **Required baseline** — re-read prompt, run build, run tests,
run linter, run adversarial probes. Each step is non-skippable.
* **Explicit anti-rationalization list** — "the code looks correct"
→ reading is not verification; "my tests already pass" → the
verifier's tests aren't yours; "this is probably fine" → probably
is not verified; "this would take too long" → not your call. Each
excuse pattern matched to the corrective action.
* **Adversarial probes** — task-type-specific examples (numeric
sanity check with the raman G-peak example, code corner-case
inputs, file-output structural validation, service concurrency
probe, bug-fix reproduction-first protocol).
* **Strict per-check output format** — `### Check / Command run /
Output observed / Result: PASS|FAIL`. The format itself
discourages the "I read the code and it looks fine" pattern; an
empty Command-run block is a tell that nothing was actually run.
The cc-src reference also has a `VERDICT: PASS|FAIL|PARTIAL`
terminator parsed by the caller. We don't parse it here (the prompt
gets injected as a single user message into the main loop, not into a
separate subagent context with a verdict-aware caller), so this commit
keeps the per-check `Result: PASS|FAIL` markers but drops the trailing
VERDICT line — adopting it would require the proper subagent
infrastructure that cc-src's `built-in/` agents have.
Build + 18 reflection tests still pass. Will be deployed in v8 after
the v7 baseline (current adversarial nudge → no nudge upgrade) gives a
stable solve count to compare against.
Zero reflection_trace events appeared in v7 trajectories for two confirmed false-completes (filter-js-from-html, polyglot-c-py), even though both had plenty of tool calls and ended with a terminal no-tool text-only turn. The non-streaming Chat Completions path drains the trace_events from observe_openai_batch silently because it has no on_event callback, so we couldn't tell whether the terminal-turn hook was firing, whether the code judge scored, or whether the LLM judge said "Continue". Add three eprintln probes that ride out via the benchmark-run `tee /logs/agent/puffer.txt`: - terminal_turn synth decision (the three gate booleans + outcome) - code_judge score+signal on terminal turns - final_decision signal+checkpoint on terminal turns No behavior change. Instrumentation only, to narrow down which gate is blocking reflection on Kimi self-claiming-done turns before touching the gate logic itself.
Kimi's OAuth bearer lives ~15 minutes, and when a token expires the server holds the TCP connection open instead of returning 401 — so the reactive "refresh after 401" path that already existed never got to run. Hung requests piled up behind reqwest's 300s timeout × 4 retries × N concurrent workers, manifesting as tasks stuck on turn 1 with a 1-line session.jsonl and a 0-byte puffer.txt until harbor killed the whole trial at 1800s. Codex handles this by checking `expires_at_ms` before every send and refreshing when the TTL drops under a small buffer. Mirror that: - Add `expires_at_ms` to `OpenAIExecutionConfig` populated from the `StoredCredential::OAuth` at resolve time. - New `ensure_fresh_oauth_token` compares TTL against a 120s window and calls `refresh_oauth_for_provider` with full state sync into both the execution struct and the `AuthStore`. - Wire it at the top of both `send_openai_request_with_refresh` and `send_openai_request_with_refresh_streaming`; the 401 fallback remains as a second line of defense for clock skew / unexpected revocation. - Propagate `expires_at_ms` in both 401-retry branches so the next proactive check sees the post-refresh TTL rather than the old one. Also emits `[oauth] proactive refresh: <provider> token ttl=...s` to stderr so puffer.txt grows in real time instead of staying empty until the first turn completes — which had been hiding the hang. API-key auth and "no auth" paths are unaffected (expires_at_ms stays None and the helper early-returns).
Two related fixes landing together because both surfaced in the same
v7 run:
1. OAuth: disk-state adoption
A 29-min-old container returned `401 invalid_authentication` after
the in-process refresh_token had been silently invalidated by a
peer rotation. Kimi invalidates the old refresh_token on every
refresh, so the 7-minute host refresher + docker-cred-syncer that
mirrors the fresh token into each container was producing in-memory
staleness that only ever got discovered when the next 401 hit —
and by then the in-memory refresh_token itself was unusable too.
Before every request, reload `$HOME/.puffer/auth.json` and, if the
persisted `expires_at_ms` is newer than what we have in memory,
adopt the stored credential wholesale. Also wire a real `on_reload`
closure into `refresh_kimi_token_locked` so the cross-process-
locked refresh picks up peer-written credentials before spending
its own (doomed) refresh call. Both paths emit an `[oauth]` line
so the stored-vs-fetched behavior is visible in puffer.txt.
2. Reflection: no cooldown on terminal turn
tune-mjcf's puffer.txt showed:
[reflection] terminal_turn text=true tools=23/4 batch_gap=1/2 synth=false
The `MIN_BATCHES_BETWEEN_EVALUATIONS = 2` cooldown was rejecting
the one-and-only terminal turn because an earlier mid-task
evaluation had run. But the whole point of the cooldown is to
avoid re-evaluating mid-task — terminal turn is by definition the
last batch, so there is no "next batch" to cool down for. Keep
the tool-count floor (don't bother the judge on a 3-step task that
genuinely wrapped), but drop the batch-gap gate specifically for
the synth path. The LLM judge still decides Continue vs Intervene
so we don't over-fire; we just stop silently skipping the one
call where completion-claim detection matters most.
tune-mjcf would have caught its own false-complete (agent claimed
"2.61x speedup, 43% of reference time" — reward=0 from verifier)
if the gate hadn't swallowed the terminal turn.
v7's video-processing hit the full reflection pipeline but LLM judge
returned continue:
[reflection] terminal_turn text=true tools=47/4 batch_gap=10/2 synth=true
[reflection] code_judge terminal_turn=true score=4/4 signal=true
[reflection] final_decision terminal_turn=true signal=none checkpoint=false
The judge is Kimi judging Kimi. The prompt only asked "is it looping
or progressing normally" — a confident terminal answer reads as
"progressing normally" even when the answer is wrong. There was
nothing in the prompt about completion-claim skepticism, and the
`BatchAssessment.completion_claim` field carrying the agent's final
text was populated but never surfaced to the judge.
When completion_claim is present, the prompt now carries:
1. The claim itself in a labelled block, truncated to 1200 chars.
2. A four-point skepticism checklist keyed to the false-complete
modes we keep seeing in v7 trajectories:
- Did the agent run the verifier's actual test file
(/tests/test_outputs.py) or only its own smoke test?
- Leftover build/test artifacts in the output directory
(polyglot-c-py and polyglot-rust-c both failed a strict
file-listing check this way).
- Coverage gap between the self-test and the full requirement
list (filter-js claimed XSS-blocking passed but didn't diff
clean HTML byte-for-byte).
- Confidence is not evidence.
3. A matching "reflect if completion claim + weak evidence" branch
in the criteria block.
No reward hacking here — we're changing how the judge *reads* the
evidence, not what the verifier scores.
Both v7's retry of adaptive-rejection-sampler and the parallelism=1 smoke on polyglot-rust-c hit an identical failure right after the first chat-completion call with the 2882dd09 binary: Error: response from https://api.kimi.com/coding/v1/chat/completions was not valid JSON Caused by: EOF while parsing a value at line 1 column 0 HTTP 200 with an empty body. A direct curl to the same endpoint at the same moment returns normally — so the wire is fine, the request we're building isn't. The only thing on the request path that changed from the working d317116 build to 2882dd09 is `adopt_disk_oauth_if_newer`. Running theory: the host-side kimi-oauth-refresher.sh writes the full auth state to `$HOME/.puffer/{auth.json,.credentials.json}`, but the container-side docker-cred-syncer.sh only mirrors `.credentials.json` back into the container. That leaves the container with a freshly- rotated secret file paired against a pre-rotation metadata file, so AuthStore::load hydrates an OAuthCredential whose access_token and expires_at_ms don't belong to the same refresh cycle. Handing that combination back to Kimi evidently produces the silent empty-body hang-up we see. Disable the call in `ensure_fresh_oauth_token` for now and keep the function wired with #[allow(dead_code)] so we can re-enable once the paired-file sync (or pre-adoption sanity check) is in place. The proactive-refresh + 401-retry halves of the fix are unaffected.
A direct curl to Kimi with max_tokens unbounded measured 498s to return 29k completion tokens (~103KB of reasoning_content plus the final message). With a 300s client timeout in reqwest, every first turn aborted mid-reasoning, retried under retry_openai_transport's timeout-is-retryable branch, aborted again, etc. From the outside this looked exactly like the OAuth hang we just fixed, which sent me down the wrong diagnostic path — puffer.txt stayed empty and session.jsonl stuck at 1 line because the real HTTP response never completed before the client bailed. The 300s ceiling was set for non-reasoning models where 5 minutes is plenty; Kimi k2.6 at effort=high (also DeepSeek-R1 and similar) doesn't fit. Raise the default to 900s so Kimi's observed ~8-minute worst case has headroom, and make it env-overridable via PUFFER_HTTP_REQUEST_TIMEOUT_SECS so non-reasoning providers can go back to a tighter budget when needed. This is the same reason v7 showed so many AgentTimeoutError with 0-byte puffer.txt — reqwest aborting after 4x 300s retry cycles produced no output whatsoever, and harbor's agent timeout cleaned up the rest. Should have been the first hypothesis given the reasoning-model context window; proactive OAuth refresh + disk state adoption were lower-probability explanations that I chased down first. Leaving those in for correctness (refresh-after-401 is still the right behavior); they just weren't load-bearing.
v8's first trial panicked on distribution-search: thread 'main' panicked at reflection/support.rs:110:39: byte index 855 is not a char boundary; it is inside '≤' (bytes 854..857) `extract_artifact_candidates` was slicing a window around a matched path with `index.saturating_sub(48)` / `+ len + 16`, both of which are raw byte offsets. The distribution-search prompt uses `≤` (3-byte UTF-8), and `index + len + 16` landed inside it. Add `snap_prev_char_boundary` / `snap_next_char_boundary` helpers and snap both ends before slicing. This is a surrogate-for-`.get(..)` that still returns a usable window instead of giving up when the arbitrary offset lands mid-codepoint. Same class of bug as the earlier debug_context.rs truncate_utf8_prefix fix; reflection grew its own unsafe slice after that landed.
All five false-completes in v7 (filter-js, polyglot-c-py, polyglot-rust-c, protein-assembly, model-extraction-relu-logits) shared the same shape: the agent ran its own smoke test, got a green result, declared the task done, and skipped the verifier's real test file even though it was mounted and readable. Reflection's adversarial nudge caught several of these after the fact, but a cheaper and more reliable fix is to tell the agent up front: if the verifier test file is shipped, read it and run its exact assertions; also re-list the output directory and remove any build/test artifacts left behind. Three additions to the base system prompt: 1. "If a verifier test file exists, READ and run it" — kills the "my own smoke test passed" false-complete pattern that assumed the self-test covers the same assertions. 2. "Re-list the output directory and confirm only required artifacts" — directly addresses polyglot-*'s `cmain` and `main` leftovers that tripped the strict file-listing check. 3. Explicit "self-test is weaker evidence than running the real test" line so the model prefers the verifier file when it's accessible. Not reward hacking — verifier output is observable to the agent's bash tool already; we're just telling it to look at what it can already see. Leaves the downstream reflection path as a safety net for tasks where the verifier file genuinely isn't shipped.
Observed on v8: each LLM judge call takes ~3 minutes on Kimi — that's roughly half of a normal agent turn, because Kimi's reasoning model ignores the effort_level=low config and generates full chain-of-thought for every prompt regardless of the prompt's complexity. A reflection cycle with two nudges adds ~10 minutes of judge-only overhead. Puffer doesn't currently pass max_tokens to the chat-completions request (OpenAIChatCompletionsRequest has no field for it), so the cleanest path without touching the provider crate's API is a prompt- level nudge: tell the judge it's a gate decision and its reasoning should max out at 2-3 sentences / ~500 tokens. Kimi won't perfectly cap, but should collapse the reasoning substantially compared to the unconstrained case. Works in concert with the completion-claim skeptic section added in 1eee554 — the judge still has room to produce useful "here's exactly which requirement is unchecked" reasoning, just not 30 KB of it. Adding a real max_tokens knob later is a separate, larger patch.
Context: chess-best-move timed out under v8 because the task's
prompt says "analyze chess_board.png" and Kimi k2.6 does support
vision, but puffer's Read tool returned the image as a JSON blob
containing base64 data — which travelled as a plain string inside
the tool-role message content. The model saw the string "base64
=iVBOR..." and had no way to actually decode the image.
Adapted from codex-rs/tools/src/view_image.rs (emits image_url
alongside a textual note) and pi-mono/coding-agent/.../read.ts
(returns an [ImageContent, TextContent] pair) but plumbed through
puffer's existing ContentPart::Image variant so no new tool or
schema change is needed.
Two additions to conversation.rs:
1. `items_to_chat_messages` Message branch now decides between
plain-string content (the fast path for 99% of turns) and OpenAI
vision's multi-part `[{type:"text"...},{type:"image_url"...}]`
array shape based on whether any ContentPart::Image is present.
Text-only turns keep the original wire format.
2. `append_tool_results` parses each tool output with
`maybe_extract_image_from_tool_output`; when it sees the Read
tool's `{"type":"image","file":{"base64":"...","type":"image/png",
"originalSize":N}}` shape it pushes a synthetic user message after
the FunctionCallOutput with the base64 materialized as a
`data:image/png;base64,...` URL. The tool-role output stays as
the small JSON note it always was — the actual pixels ride on
the user turn where Kimi's vision path actually consumes them.
Fast-path guarded: `output.contains("\"type\"")` + `"image"` before
we spend an `serde_json::from_str` on the (large base64) string.
Non-image tool outputs pay near-zero cost.
Four unit tests lock down the vision path added in 0eac25f: 1. text_only_content_serializes_as_plain_string — regression guard for the fast path. Adding vision must not push text-only messages onto the multi-part array shape; every existing non-image turn has to keep round-tripping as a plain string. 2. image_content_serializes_as_multipart_array — exercises the new branch in items_to_chat_messages. Confirms a ContentPart::Image becomes `{"type":"image_url","image_url":{"url":"data:image/..."}}` alongside the text part. 3. read_tool_image_output_triggers_user_image_turn — end-to-end at the conversation-layer boundary. Feeds the Read tool's actual `{"type":"image","file":{...}}` JSON into append_tool_results and asserts we get FunctionCall + FunctionCallOutput + a synthetic user Message carrying the base64 as a data URL. This is the contract that makes tasks like chess-best-move see the board. 4. non_image_tool_output_does_not_inject_user_turn — plain Bash/Grep output must not produce a stray user turn. Regression guard for the detector being too eager. All four pass. Live v10 (multiplier=1.0) currently has chess-best-move as the actual end-to-end validation; this commit just pins the library-level behavior so the path won't silently break on future edits to the conversion pipeline.
…ctions Two issues the codex-style review flagged as the highest-priority items for the 2026-04-21 batch: 1. Reward-hacking leak. Both the benchmark-run system prompt (fc7e5dd) and the reflection judge's completion-claim skeptic block (1eee554) named `/tests/test_outputs.py` and `/app/tests/` explicitly. Those are the actual TB2 verifier paths — telling the agent to "go read the verifier's test file" is the definitional case of leaking eval structure, even if the path is technically discoverable by a bash listing. Rewrite both prompts to describe the behavior ("run the check the task prompt asks for verbatim") without naming the path. Decouples the guidance from knowledge of any particular benchmark's layout. 2. Reflection infinite-loop risk. 69bb902 dropped MIN_BATCHES_BETWEEN_EVALUATIONS on the terminal-turn synth path so the judge could always weigh in on completion claims. Combined with 1eee554's skeptic-biased judge prompt, the agent and judge could ping-pong until harbor's task timeout — agent says "Done" → nudge → agent works 2 batches → says "Done" → nudge → … Add `MAX_CHECKPOINT_INJECTIONS = 3`: enough for "rushed done → verify → iterate → final done" but caps the worst case. Track a `checkpoint_injections` counter on the tracker, increment when a checkpoint is actually built, and gate the synth path on it being under the cap. Also surfaces the counter in the terminal-turn eprintln trace so the cadence is visible.
Audit agent caught two places the prior scrub (17e8620) missed: 1. `reflection.rs:build_checkpoint` — the adversarial-nudge addendum from 02726ad still names `/app/` in both the Chinese and English variants when listing where the project's tests might live. /app/ is specifically TB2/Harbor's convention, not a universal coding-project root. Rewrite to "if the project ships `test_*.py`, a `tests/` directory, `Makefile test`, `pytest.ini`, etc." — names the *shape* of test scaffolding without the benchmark-specific prefix. Also rewrote the comment that documented the tuning source ("21 of 27 unsolved long-trajectory tasks declared 'Done' without ever running `/app/test_*.py`") since it (a) names the forbidden path and (b) advertises the nudge as benchmark-derived. Replaced with a neutral rationale that stands on its own. 2. `benchmark_run.rs` system prompt — two remaining TB2 tells: "Ignore benchmark harness files under /app/.puffer" → generalized to "hidden config directories (`.puffer`, `.git`, …)" so the absolute path doesn't appear. And dropped the "When the task already names both the verifier file and the required output file, do not use Glob or Grep before writing the first draft" — that instruction is TB2-shape-tuned (real coding tasks don't have "verifier files") and discourages legitimate repo exploration. After this + 17e8620, grep for `/app/` and `/tests/` in crates/puffer-{cli,core}/ returns only tests and comments — no runtime-leaked strings.
Trajectory audit on v12 caught systemic verifier-probing across 11 of
43 tasks. Every reflection-heavy run emitted commands like:
find / -maxdepth 3 -name "test_outputs.py"
ls -la /tests/test_outputs.py
find / ... -name "verify.py" -o -name "grade.py"
None succeeded (TB2 correctly doesn't mount /tests into the agent
container), so no reward hack landed. But the intent-to-find-the-
verifier is traceable to one line in the checkpoint addendum — after
17e8620 + f1fd75e stripped the literal paths, the prompt still said:
"Run the project's own tests (if the project ships `test_*.py`,
a `tests/` directory, `Makefile test`, `pytest.ini`, etc.).
Failing tests = automatic FAIL."
Kimi reads "the project ships tests" as permission to hunt for test-
looking files, and on Terminal-Bench that collapses onto /tests/
test_outputs.py. One task even put "检查:标准评测路径 /tests/
test_outputs.py" in its own QA ritual.
Rewrite the line to say the opposite: if the task prompt names a
specific command, run that verbatim; don't go hunting for
test-looking files the task didn't mention. The other adversarial
probes (curl, jq, boundary input, regression test) stay — they're
task-shaped and task-scoped and don't invite a filesystem sweep.
Mount isolation is the actual kill-switch here, but relying on it is
fragile: one mount-config edit and every task becomes a silent hack.
Removing the invitation in the prompt closes the window from the
agent side.
Previously the terminal-turn code path only synthesised an observation when the agent had emitted assistant text AND crossed the MIN_TOOL_CALLS_BEFORE_EVALUATION floor. When the model returned an empty response (no text, no tool calls), the tracker returned None and the agent silently handed back to the harness with no deliverable. Now an empty_turn case (invocations empty AND no assistant text) takes the synth path regardless of the tool-count floor, and the judges get a synthetic completion_claim noting the empty turn so they have something to react to. Observed on honest Kimi TB2 runs for circuit-fibsqrt, feal-linear-cryptanalysis, modernize-scientific-stack — all three exited with "[reflection] terminal_turn text=false tools=2/4 ... synth=false" before the fix.
Two additions to the pre-submission rules block: - Distinguish files that existed before the task from build artefacts the agent created; the former stay untouched, the latter must be removed from the task dir (or the work done under /tmp). v16 trajectory analysis found polyglot-c-py / polyglot-rust-c leaving a compiled `cmain` next to `main.py.c`, failing the verifier's directory-contents check. - External graders may use a different Python interpreter than the agent's shell (uv-venv `python` vs system `python3`) — avoid `pip install` at script runtime and stick to modules available under both. Observed on openssl-selfsigned-cert and modernize-scientific-stack. Path-agnostic guidance; no reward-hacking hints.
Kimi has been observed emitting reasoning_content that contains a NUL byte; replaying that same string on the next request is then rejected by Kimi's own validator with: 400 Bad Request: "the reasoning_content at position N must be a valid UTF-8 string: string contains \x00" `extract_chat_completions_reasoning` now runs the string through a sanitize_reasoning_text helper that drops C0 control chars (except \t/\n/\r) and DEL before the text is stored on ConversationItem::Reasoning and echoed back. The fix applies to every vendor that returns reasoning_content via the Chat Completions path. Hit on honest Kimi TB2 run kimi-v15-full89 qemu-startup retry-03.
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/tmp); warn about grader Python interpreter mismatches (don't rely onpip installat runtime). Both are path-agnostic.\x00inside its own reasoning output and then rejects the replay with400 "string contains \x00";extract_chat_completions_reasoningnow strips C0 control chars (preserves\t\n\r) before the text is stored or echoed back.Net effect on honest Kimi TB2 benchmark (mult=1.0/1.0, parallelism=20, 89 tasks): +3 unique PASS vs v12 baseline (21→24).
Test plan
cargo check --package puffer-core --package puffer-cli --package puffer-provider-openaipassesx86_64-unknown-linux-muslbuild succeeds with the new binaryfeal-differential-cryptanalysispuffer.txt (empty_turn=false synth=truefor normal turns; the opposite path is inert until the failure mode triggers)reasoning_contentNUL-sanitize path confirmed fixingqemu-startup retry-03which previously 400'd on replay