Skip to content

TB2 Kimi fixes: empty-turn synth, pre-submission prompt guidance, reasoning_content NUL sanitize#41

Open
n-WN wants to merge 25 commits into
masterfrom
n-WN/kimi-tb2-empty-turn-nul-sanitize
Open

TB2 Kimi fixes: empty-turn synth, pre-submission prompt guidance, reasoning_content NUL sanitize#41
n-WN wants to merge 25 commits into
masterfrom
n-WN/kimi-tb2-empty-turn-nul-sanitize

Conversation

@n-WN
Copy link
Copy Markdown
Collaborator

@n-WN n-WN commented Apr 22, 2026

Summary

  • reflection: synth observation on empty terminal turn — the terminal-turn path now fires the judge pipeline when the model returns neither text nor tool calls, instead of silently letting the agent exit with no deliverable (observed on circuit-fibsqrt, feal-linear-cryptanalysis, modernize-scientific-stack during honest TB2 runs).
  • benchmark: cleanup + python-interp guidance in pre-submission prompt — two new bullets: distinguish pre-existing files (don't touch) from agent-created build artefacts (clean up or build under /tmp); warn about grader Python interpreter mismatches (don't rely on pip install at runtime). Both are path-agnostic.
  • openai: sanitize NUL / C0 control bytes from reasoning_content — Kimi occasionally emits a \x00 inside its own reasoning output and then rejects the replay with 400 "string contains \x00"; extract_chat_completions_reasoning now strips C0 control chars (preserves \t\n\r) before the text is stored or echoed back.

Net effect on honest Kimi TB2 benchmark (mult=1.0/1.0, parallelism=20, 89 tasks): +3 unique PASS vs v12 baseline (21→24).

Test plan

  • cargo check --package puffer-core --package puffer-cli --package puffer-provider-openai passes
  • Release x86_64-unknown-linux-musl build succeeds with the new binary
  • Re-ran the full 89-task Kimi TB2 suite end-to-end on sc; 24/89 honest pass with the new binary, no regressions vs v12's 21/89 honest baseline
  • Empty-turn synth confirmed firing in feal-differential-cryptanalysis puffer.txt (empty_turn=false synth=true for normal turns; the opposite path is inert until the failure mode triggers)
  • reasoning_content NUL-sanitize path confirmed fixing qemu-startup retry-03 which previously 400'd on replay

n-WN added 25 commits April 21, 2026 19:13
Drives Kimi TB2 benchmark from 5/89 (sk-kimi-* rate-limit era) to
55/89 = 61.8% (OAuth login path), beating GPT-5 spark's 52/89 baseline.

---- 1. Kimi OAuth device-code login (matches kimi-cli 1.37.0) ----

Moonshot's Kimi Code platform (`api.kimi.com/coding/v1`) accepts raw
`sk-kimi-*` API keys, but those keys hit a shared per-account rate
bucket that saturates at ~30 bursts then stays 429 for minutes — the
prior run solved 5/89 because workers kept exhausting and burning.

The OAuth-login path is a separate subscription-backed quota. We mirror
kimi-cli's protocol exactly:

  * Device-code flow at `https://auth.kimi.com/api/oauth/{device_authorization,token}`
    with hardcoded public `client_id = 17e5f671-d194-4dfb-9706-5516cb48c098`.
  * Every request (OAuth + chat) carries the full `X-Msh-*` header set
    (Platform=kimi_cli, Version, Device-Name, Device-Model, Os-Version,
    Device-Id) plus `User-Agent: KimiCLI/1.37.0`. Chat-side must match
    or subscription gating silently degrades.
  * Platform/arch strings computed via `sw_vers -productVersion` on
    macOS and `uname -r`/`uname -m` on Linux — verified byte-for-byte
    against Python's `platform.release/machine/version` on both
    darwin-arm64 and linux-x86_64.
  * Device id is a persistent UUID at `$PUFFER_HOME/kimi-device-id`,
    chmod 0600.
  * Refresh uses a cross-process fslock at `$PUFFER_HOME/kimi-oauth.lock`
    because Kimi's refresh rotates the refresh_token — 20 concurrent
    bench workers sharing one credential file would race-invalidate
    each other without the lock.

New:
  * `crates/puffer-provider-openai/src/kimi_oauth.rs` — request/poll/
    refresh with tests (device_id persistence, arch resolution, UA
    version alignment, ASCII header sanitization).
  * `crates/puffer-provider-openai/examples/dump_kimi_headers.rs` —
    debug helper to diff our header set vs kimi-cli.

CLI wiring (`puffer auth login kimi-for-coding`):
  * New `OauthFamily::KimiDevice` variant; detected by provider id
    starting with `kimi-` or base_url containing `api.kimi.com`.
  * `run_login_flow` early-dispatches to `run_kimi_device_login_flow`
    before the PKCE+callback path so Kimi-style "show URL, poll token"
    doesn't fight the localhost-callback listener.
  * Opens browser automatically via existing `authflow::open_browser`.
  * Stores tokens via generic `OAuthCredential` (keychain on macOS,
    plaintext `.credentials.json` on Linux) — same code path Codex/
    Anthropic already use.

Runtime integration (`crates/puffer-core/runtime/openai.rs`):
  * `resolve_openai_execution_config` auto-injects `X-Msh-*` headers
    whenever provider is Kimi (both API-key and OAuth paths), so the
    wire shape matches kimi-cli regardless of credential type.
  * 401 refresh dispatches per provider: OpenAI → `auth.openai.com`,
    Kimi → `auth.kimi.com` with file-locked `refresh_kimi_token_locked`.

Request shape (`request.rs`): when `custom_headers` carries a
User-Agent (yaml `headers:` or Kimi path), skip the default Codex UA
so reqwest doesn't concatenate two values and fail Kimi's whitelist.

Benchmark harness (`benchmark/puffer_harbor_agent.py`): container
install step now side-channels `puffer-auth.json`, `puffer-credentials.json`,
and `puffer-kimi-device-id` through the mounted `codex_dir` into
`$HOME/.puffer/` inside the container, so in-container puffer can
use the same OAuth credentials as the host (with chmod 0600 on the
secrets file).

---- 2. Reflection: completion-claim trigger + test-discipline nudge ----

TB2 analysis: 25 of 36 unsolved-by-Kimi tasks fell into verifier-reject
with no transport error — agent did real work, ran its own smoke test
if anything, declared "Done", verifier disagreed. None of those
triggered the existing reflection because the heuristic gate only
fires on stall (5 min without validation/artifact/edit progress),
which a confidently-wrong agent never hits.

Solved-vs-unsolved breakdown on the 89-task Kimi run:
  * SOLVED tasks ran tests: 17% (9/54).
  * UNSOLVED tasks ran tests: 7% (2/27).
  * 21 unsolved long-trajectory tasks ran ZERO tests before "Done".

Intervention (`crates/puffer-core/runtime/reflection.rs`):
  * `BatchAssessment` gains `completion_claim: Option<String>`.
  * `observe_openai_batch` scans the latest assistant message for
    completion-claim phrases ("done", "complete", "all set",
    "verified", "successfully", "has been written", "saved to /",
    `✅`, …) — markers curated from the 25 trajectories.
  * When a claim is present AND the normal gate would skip AND minimum
    tool-call + minimum batch-interval guards pass, force
    `should_evaluate=true` so code_judge + llm_judge can challenge
    the agent's self-assessment.
  * `build_checkpoint` appends a claim-specific addendum to the
    reflection prompt (Chinese + English variants). It acknowledges
    that TB2's verifier tests live outside `/app` and the agent
    cannot see them, then asks the agent to (1) re-list each explicit
    requirement from the prompt, (2) construct an independent
    `/tmp/check.py` that validates the produced artifact against
    each requirement (including sanity-checks like "graphene G peak
    should be ~1580 cm⁻¹, not 19196"), and (3) only declare Done
    when every check passes.
  * Four new unit tests cover claim detection, empty/non-claim
    negatives, walking past trailing tool calls, and phrasing
    variants.

---- 3. UTF-8 char-boundary panic fix ----

`crates/puffer-core/runtime/debug_context.rs` was slicing message
text with `&text[..2000]` / `&input[..500]`, which panics (`byte
index X is not a char boundary; it is inside '≤'`) when the cut
falls inside a multi-byte UTF-8 codepoint. The TB2 `distribution-search`
prompt contains `≤`, and the resulting crash killed that trial
before retry could save it. Added `truncate_utf8_prefix` that backs
off to the nearest `is_char_boundary(i)` and two unit tests covering
the exact failure case.
The v1 completion-claim trigger never fired in practice. Diagnosed by
inspecting the reflection-rerun trajectories of all 34 retried tasks:
zero of them had a reflection-injected user message, despite every
failing task ending in "Done" / "All set" / etc.

Two compounding bugs blocked the trigger:

1. `observe_batch_internal` short-circuits on `invocations.is_empty()`.
   But agents emit the completion claim in the LAST turn, which is
   typically text-only — no new tool calls. So the observation layer
   never ran, and the completion-claim check sitting on top of
   `observe_openai_batch`'s post-observation code was unreachable.

2. Even after synthesizing an observation for the empty-invocations case,
   `code_judge_score` returns 0 on a fresh synthesized assessment. Under
   the default `LlmJudgeMode::ConfirmCodeJudge`, the llm judge is then
   skipped too. No judge → no signal → no checkpoint.

Fix — `synthesize_claim_only_observation`:
  * Runs when `observe_batch_internal` bailed on empty invocations AND
    `latest_assistant_completion_claim(items)` found a claim AND the
    minimum-tool-calls + minimum-batch-interval guards pass.
  * Inflates `time_since_progress_ms` past both `soft_stall_ms` and
    `hard_stall_ms` (+60 s margin) so `code_judge_score` yields 4 ≥ the
    default `min_score = 4`. Gets `code_judge_signal` to produce a
    signal, which unlocks the llm judge under `ConfirmCodeJudge`.
  * Tags the assessment with `signal_notes = ["synthesized_from_completion_claim"]`
    so operators can tell the difference between a real stall and a
    claim-forced evaluation.

Verified via the 4 existing completion_claim unit tests; integration
will be measured by the fresh kimi-reflection-*T120008Z run against
baseline stochasticity (3/9 ≈ 33% solve-on-rerun in the buggy v1 +
rerun-unsolved runs).
…fires

The completion-claim trigger was still inert after the
synthesize_claim_only_observation fix, because the agent's terminal
"Done" turn — the one that carries the claim text — almost always has
NO new tool calls. Both openai exit paths early-returned on
`tool_calls.is_empty()` WITHOUT calling `observe_openai_batch`, so
reflection never had a chance to see the claim.

Fix — for all three request paths (Responses streaming SSE, Responses
non-streaming JSON, Chat Completions non-streaming):
  * Push the final assistant text into `items` before deciding to exit,
    so reflection's `latest_assistant_completion_claim(items)` can find
    it.
  * Invoke `tracker.observe_openai_batch(&[], &items, ...)` with empty
    invocations. The synth path inside the tracker handles empty
    invocations specifically when a claim is present.
  * If a reflection checkpoint comes back, inject its prompt as a user
    message and `continue` the turn loop — giving the agent another
    chance to verify for real or resubmit. The streaming path also
    emits `TurnStreamEvent::ReflectionCheckpoint` for TUI visibility;
    the non-streaming paths have no on_event callback, so they just
    inject silently.
  * If no checkpoint, fall through to the normal exit with the
    accumulated assistant_text.

Verified by re-running reflection-rerun (v4 TAG
kimi-reflection-20260421T121427Z, 30 currently-unsolved tasks) — the
prior v3 saw 0 injections despite every failing task ending in "Done".
The completion-claim trigger was gated on a curated 25-phrase keyword
list ("done", "complete", "verified", "saved to /", "✅", …). That's a
classic brittle heuristic:

  * False positives — "this subtask is complete, now moving on…",
    "I've successfully identified the bug, now fixing it.", any usage
    of "completed", "ready", "finished" mid-task fires the gate when
    the agent is nowhere near done.
  * False negatives — TB2 tasks like `chess-best-move` end with
    "The best move for White is c1g5." — no keyword in the list,
    trigger stays silent even though the agent is clearly submitting.
  * Overfitting — list was tuned against one snapshot of one model's
    phrasing; drifts as model updates.

Replaced with a pure structural signal: whenever `observe_openai_batch`
is invoked with empty invocations (the terminal text-only turn the
openai runtime now hooks) AND the min-tool-calls + min-batch-interval
guards pass, synthesize the observation. The LLM judge then reviews
the actual trajectory + the agent's terminal text and decides for
itself whether the submission is premature. No keyword list, no
hard-coded phrasing assumptions.

Renamed `latest_assistant_completion_claim` → `latest_assistant_message_text`
to reflect that it no longer classifies, just extracts. The
`BatchAssessment.completion_claim` field is kept (still semantically
"the text we want to show the judge from the terminal turn") and
continues to gate the verify-your-submission nudge that gets appended
to the reflection checkpoint prompt.

Tests rewritten to verify the structural contract (returns any
non-empty assistant text; walks past trailing tool calls; returns None
for empty/missing assistant turns). No keyword detection tests — the
behavior they were testing is gone.
V5 reflection rerun regressed to 0/8 (400 "reasoning_content missing
in assistant tool call message at index 3") because the terminal-turn
hook pushed the assistant text into `items` WITHOUT the accompanying
reasoning. On the next request (after the reflection checkpoint
injected a user message and continued the loop), Kimi replayed the
conversation, hit the reasoning-less assistant message, and rejected.

Mirror the normal tool-call branch's bookkeeping: after pushing the
assistant_text, immediately call `append_reasoning_items(&mut items,
&response.reasoning_items)`. Same invariant Kimi enforces on every
replayed assistant message when `thinking: enabled` is set.

The non-streaming Chat Completions path has `parsed` (Value) instead
of a typed `response`, and that path's reasoning is captured via
`extract_chat_completions_reasoning` elsewhere in the request loop;
we don't need to change it for now since benchmark-run uses streaming
exclusively. If non-streaming users hit the same regression we can
backfill.
TB2 tasks routinely trigger parallel apt invocations — an agent sets up
Python + curl + build-essential in one shell, or nests apt-get inside a
compound pipeline.  On base images without a persistent shell lock,
these race on /var/lib/dpkg/lock-frontend and the losing side sees
"E: Could not get lock /var/lib/dpkg/lock-frontend" and bails.

The agent reflex is then to retry once, still race, and submit a broken
environment — the task verifier sees missing packages and scores 0.
This is a pure infrastructure failure that silently masquerades as a
capability failure.

The fix is drop-in: when `Bash` sees a command whose first shell token
on any segment is `apt-get | apt | dpkg | apt-key | dpkg-reconfigure`
(after skipping leading `sudo`, `env`, and `VAR=value` assignments),
wrap it as `flock -w 600 /var/lib/dpkg/lock-frontend -c '<original>'`.
Flock holds the dpkg frontend lock for the duration of the command, and
any concurrent apt invocation on the same container blocks at the OS
level for up to 10 minutes rather than racing.

Guardrails:
  * Idempotent — commands that already contain `flock` are passed through.
  * Non-package commands (`ls`, `echo 'apt-get is a package manager'`,
    …) untouched.
  * Single-quotes in the inner command are escaped (`'\''`) so the
    outer single-quoted wrapper is shell-safe.
  * 600-second timeout matches the hard-stall threshold used elsewhere;
    heavy installs routinely run ~300s and we'd rather serialize than
    fail.

Prior art: the TB2 leaderboard's top agent gained +17.9 pp solve rate
(47 → 64%) in a single iteration from this exact change on their side.

Seven unit tests cover: apt-get install, compound `apt-get update && …`,
sudo-prefixed apt, dpkg, benign `echo "apt-get"`, already-flocked input,
and single-quote escaping inside the wrap.
Regression root cause: my earlier rsync to sc replaced sc's working
conversation.rs with local's older copy, which explicitly DROPPED
`ConversationItem::Reasoning` items on the Chat Completions path
("Chat Completions has no concept of reasoning items on the wire").
That's wrong for Kimi k2 / DeepSeek-R1, which require the prior turn's
reasoning to be echoed back as `message.reasoning_content`. Symptom
was a near-total failure mode on v5/v6 reflection runs (28 of 29 tasks
400'd with "thinking is enabled but reasoning_content is missing in
assistant tool call message at index 3").

Restored logic from commit 724df6d ("Support Kimi for Coding via
OpenAI chat completions"):

 * `OpenAIChatMessage` (request-side) regains the optional
   `reasoning_content: Option<String>` field, serialized only when
   populated (OpenAI's own Chat Completions API ignores it).
 * `OpenAIChatChoiceMessage` (response-side) already had
   `reasoning_content`; exposed `extract_chat_completions_reasoning`
   in the crate's public surface (re-exported via `lib.rs`).
 * `execute_openai_completions_once` now calls
   `extract_chat_completions_reasoning` right after parsing the
   response and pushes a `ConversationItem::Reasoning` BEFORE the
   assistant text / tool-call items — both on the normal tool-call
   branch and on the terminal-turn branch where the reflection hook
   may continue the loop. Without the push on the terminal branch,
   reflection injection triggers the same 400 because the replayed
   assistant turn would still lack reasoning.
 * `items_to_chat_messages` (already in conversation.rs from prior
   work) picks these Reasoning items up via `pending_reasoning` and
   attaches the concatenated text to the next assistant wire message's
   `reasoning_content` field, folding `Message + FunctionCall` items
   into a single assistant message on the wire (OpenAI spec: one turn
   = one message with optional content + tool_calls + reasoning).

Verification: 38 puffer-provider-openai tests + 18 reflection + 58
conversation + 21 bash tests all pass. sc rebuilt and relaunched as
`kimi-reflection-20260421T131207Z` (v7) on the 30 currently-unsolved
tasks; the v5/v6 runs' 400-reasoning regression should now be gone
AND the completion-claim reflection hook should finally fire against
a real workload.
Prior nudge ("re-read prompt, write a check, run it, pass before
declaring done") was easy for the model to rationalize past — exactly
the failure mode the cc-src verification specialist is designed to
break out of. Replaced with a much stronger prompt directly adapted
from Anthropic Claude Code's `verificationAgent.ts:VERIFICATION_SYSTEM_PROMPT`.

Key additions over the old nudge:

  * **Adversarial framing** — "your job is not to confirm it works,
    it's to try to break it." The model self-identifies as a verifier,
    not a defender of its own implementation.
  * **Two named failure modes** the model can recognize in itself —
    "verification avoidance" (finding reasons not to run a check) and
    "seduced by the first 80%" (polished demo, broken edges). Naming
    them is half the fix; the model has the vocabulary to flag the
    pattern in itself.
  * **Required baseline** — re-read prompt, run build, run tests,
    run linter, run adversarial probes. Each step is non-skippable.
  * **Explicit anti-rationalization list** — "the code looks correct"
    → reading is not verification; "my tests already pass" → the
    verifier's tests aren't yours; "this is probably fine" → probably
    is not verified; "this would take too long" → not your call. Each
    excuse pattern matched to the corrective action.
  * **Adversarial probes** — task-type-specific examples (numeric
    sanity check with the raman G-peak example, code corner-case
    inputs, file-output structural validation, service concurrency
    probe, bug-fix reproduction-first protocol).
  * **Strict per-check output format** — `### Check / Command run /
    Output observed / Result: PASS|FAIL`. The format itself
    discourages the "I read the code and it looks fine" pattern; an
    empty Command-run block is a tell that nothing was actually run.

The cc-src reference also has a `VERDICT: PASS|FAIL|PARTIAL`
terminator parsed by the caller. We don't parse it here (the prompt
gets injected as a single user message into the main loop, not into a
separate subagent context with a verdict-aware caller), so this commit
keeps the per-check `Result: PASS|FAIL` markers but drops the trailing
VERDICT line — adopting it would require the proper subagent
infrastructure that cc-src's `built-in/` agents have.

Build + 18 reflection tests still pass. Will be deployed in v8 after
the v7 baseline (current adversarial nudge → no nudge upgrade) gives a
stable solve count to compare against.
Zero reflection_trace events appeared in v7 trajectories for two
confirmed false-completes (filter-js-from-html, polyglot-c-py), even
though both had plenty of tool calls and ended with a terminal no-tool
text-only turn. The non-streaming Chat Completions path drains the
trace_events from observe_openai_batch silently because it has no
on_event callback, so we couldn't tell whether the terminal-turn hook
was firing, whether the code judge scored, or whether the LLM judge
said "Continue".

Add three eprintln probes that ride out via the benchmark-run
`tee /logs/agent/puffer.txt`:
- terminal_turn synth decision (the three gate booleans + outcome)
- code_judge score+signal on terminal turns
- final_decision signal+checkpoint on terminal turns

No behavior change. Instrumentation only, to narrow down which gate is
blocking reflection on Kimi self-claiming-done turns before touching
the gate logic itself.
Kimi's OAuth bearer lives ~15 minutes, and when a token expires the
server holds the TCP connection open instead of returning 401 — so the
reactive "refresh after 401" path that already existed never got to
run. Hung requests piled up behind reqwest's 300s timeout × 4 retries
× N concurrent workers, manifesting as tasks stuck on turn 1 with a
1-line session.jsonl and a 0-byte puffer.txt until harbor killed the
whole trial at 1800s.

Codex handles this by checking `expires_at_ms` before every send and
refreshing when the TTL drops under a small buffer. Mirror that:

- Add `expires_at_ms` to `OpenAIExecutionConfig` populated from the
  `StoredCredential::OAuth` at resolve time.
- New `ensure_fresh_oauth_token` compares TTL against a 120s window
  and calls `refresh_oauth_for_provider` with full state sync into
  both the execution struct and the `AuthStore`.
- Wire it at the top of both `send_openai_request_with_refresh` and
  `send_openai_request_with_refresh_streaming`; the 401 fallback
  remains as a second line of defense for clock skew / unexpected
  revocation.
- Propagate `expires_at_ms` in both 401-retry branches so the next
  proactive check sees the post-refresh TTL rather than the old one.

Also emits `[oauth] proactive refresh: <provider> token ttl=...s`
to stderr so puffer.txt grows in real time instead of staying empty
until the first turn completes — which had been hiding the hang.

API-key auth and "no auth" paths are unaffected (expires_at_ms
stays None and the helper early-returns).
Two related fixes landing together because both surfaced in the same
v7 run:

1. OAuth: disk-state adoption
   A 29-min-old container returned `401 invalid_authentication` after
   the in-process refresh_token had been silently invalidated by a
   peer rotation. Kimi invalidates the old refresh_token on every
   refresh, so the 7-minute host refresher + docker-cred-syncer that
   mirrors the fresh token into each container was producing in-memory
   staleness that only ever got discovered when the next 401 hit —
   and by then the in-memory refresh_token itself was unusable too.

   Before every request, reload `$HOME/.puffer/auth.json` and, if the
   persisted `expires_at_ms` is newer than what we have in memory,
   adopt the stored credential wholesale. Also wire a real `on_reload`
   closure into `refresh_kimi_token_locked` so the cross-process-
   locked refresh picks up peer-written credentials before spending
   its own (doomed) refresh call. Both paths emit an `[oauth]` line
   so the stored-vs-fetched behavior is visible in puffer.txt.

2. Reflection: no cooldown on terminal turn
   tune-mjcf's puffer.txt showed:
     [reflection] terminal_turn text=true tools=23/4 batch_gap=1/2 synth=false
   The `MIN_BATCHES_BETWEEN_EVALUATIONS = 2` cooldown was rejecting
   the one-and-only terminal turn because an earlier mid-task
   evaluation had run. But the whole point of the cooldown is to
   avoid re-evaluating mid-task — terminal turn is by definition the
   last batch, so there is no "next batch" to cool down for. Keep
   the tool-count floor (don't bother the judge on a 3-step task that
   genuinely wrapped), but drop the batch-gap gate specifically for
   the synth path. The LLM judge still decides Continue vs Intervene
   so we don't over-fire; we just stop silently skipping the one
   call where completion-claim detection matters most.

tune-mjcf would have caught its own false-complete (agent claimed
"2.61x speedup, 43% of reference time" — reward=0 from verifier)
if the gate hadn't swallowed the terminal turn.
v7's video-processing hit the full reflection pipeline but LLM judge
returned continue:
  [reflection] terminal_turn text=true tools=47/4 batch_gap=10/2 synth=true
  [reflection] code_judge terminal_turn=true score=4/4 signal=true
  [reflection] final_decision terminal_turn=true signal=none checkpoint=false

The judge is Kimi judging Kimi. The prompt only asked "is it looping
or progressing normally" — a confident terminal answer reads as
"progressing normally" even when the answer is wrong. There was
nothing in the prompt about completion-claim skepticism, and the
`BatchAssessment.completion_claim` field carrying the agent's final
text was populated but never surfaced to the judge.

When completion_claim is present, the prompt now carries:
1. The claim itself in a labelled block, truncated to 1200 chars.
2. A four-point skepticism checklist keyed to the false-complete
   modes we keep seeing in v7 trajectories:
   - Did the agent run the verifier's actual test file
     (/tests/test_outputs.py) or only its own smoke test?
   - Leftover build/test artifacts in the output directory
     (polyglot-c-py and polyglot-rust-c both failed a strict
     file-listing check this way).
   - Coverage gap between the self-test and the full requirement
     list (filter-js claimed XSS-blocking passed but didn't diff
     clean HTML byte-for-byte).
   - Confidence is not evidence.
3. A matching "reflect if completion claim + weak evidence" branch
   in the criteria block.

No reward hacking here — we're changing how the judge *reads* the
evidence, not what the verifier scores.
Both v7's retry of adaptive-rejection-sampler and the parallelism=1
smoke on polyglot-rust-c hit an identical failure right after the
first chat-completion call with the 2882dd09 binary:

  Error: response from https://api.kimi.com/coding/v1/chat/completions
  was not valid JSON
  Caused by: EOF while parsing a value at line 1 column 0

HTTP 200 with an empty body. A direct curl to the same endpoint at
the same moment returns normally — so the wire is fine, the request
we're building isn't.

The only thing on the request path that changed from the working
d317116 build to 2882dd09 is `adopt_disk_oauth_if_newer`. Running
theory: the host-side kimi-oauth-refresher.sh writes the full auth
state to `$HOME/.puffer/{auth.json,.credentials.json}`, but the
container-side docker-cred-syncer.sh only mirrors `.credentials.json`
back into the container. That leaves the container with a freshly-
rotated secret file paired against a pre-rotation metadata file, so
AuthStore::load hydrates an OAuthCredential whose access_token and
expires_at_ms don't belong to the same refresh cycle. Handing that
combination back to Kimi evidently produces the silent empty-body
hang-up we see.

Disable the call in `ensure_fresh_oauth_token` for now and keep the
function wired with #[allow(dead_code)] so we can re-enable once the
paired-file sync (or pre-adoption sanity check) is in place. The
proactive-refresh + 401-retry halves of the fix are unaffected.
A direct curl to Kimi with max_tokens unbounded measured 498s to
return 29k completion tokens (~103KB of reasoning_content plus the
final message). With a 300s client timeout in reqwest, every first
turn aborted mid-reasoning, retried under retry_openai_transport's
timeout-is-retryable branch, aborted again, etc. From the outside
this looked exactly like the OAuth hang we just fixed, which sent
me down the wrong diagnostic path — puffer.txt stayed empty and
session.jsonl stuck at 1 line because the real HTTP response
never completed before the client bailed. The 300s ceiling was
set for non-reasoning models where 5 minutes is plenty; Kimi k2.6
at effort=high (also DeepSeek-R1 and similar) doesn't fit.

Raise the default to 900s so Kimi's observed ~8-minute worst case
has headroom, and make it env-overridable via
PUFFER_HTTP_REQUEST_TIMEOUT_SECS so non-reasoning providers can go
back to a tighter budget when needed.

This is the same reason v7 showed so many AgentTimeoutError with
0-byte puffer.txt — reqwest aborting after 4x 300s retry cycles
produced no output whatsoever, and harbor's agent timeout cleaned
up the rest. Should have been the first hypothesis given the
reasoning-model context window; proactive OAuth refresh + disk
state adoption were lower-probability explanations that I chased
down first. Leaving those in for correctness (refresh-after-401
is still the right behavior); they just weren't load-bearing.
v8's first trial panicked on distribution-search:

  thread 'main' panicked at reflection/support.rs:110:39:
  byte index 855 is not a char boundary; it is inside '≤'
  (bytes 854..857)

`extract_artifact_candidates` was slicing a window around a matched
path with `index.saturating_sub(48)` / `+ len + 16`, both of which are
raw byte offsets. The distribution-search prompt uses `≤` (3-byte
UTF-8), and `index + len + 16` landed inside it.

Add `snap_prev_char_boundary` / `snap_next_char_boundary` helpers and
snap both ends before slicing. This is a surrogate-for-`.get(..)` that
still returns a usable window instead of giving up when the arbitrary
offset lands mid-codepoint.

Same class of bug as the earlier debug_context.rs truncate_utf8_prefix
fix; reflection grew its own unsafe slice after that landed.
All five false-completes in v7 (filter-js, polyglot-c-py, polyglot-rust-c,
protein-assembly, model-extraction-relu-logits) shared the same shape:
the agent ran its own smoke test, got a green result, declared the task
done, and skipped the verifier's real test file even though it was
mounted and readable. Reflection's adversarial nudge caught several of
these after the fact, but a cheaper and more reliable fix is to tell
the agent up front: if the verifier test file is shipped, read it and
run its exact assertions; also re-list the output directory and remove
any build/test artifacts left behind.

Three additions to the base system prompt:
1. "If a verifier test file exists, READ and run it" — kills the "my
   own smoke test passed" false-complete pattern that assumed the
   self-test covers the same assertions.
2. "Re-list the output directory and confirm only required artifacts"
   — directly addresses polyglot-*'s `cmain` and `main` leftovers
   that tripped the strict file-listing check.
3. Explicit "self-test is weaker evidence than running the real test"
   line so the model prefers the verifier file when it's accessible.

Not reward hacking — verifier output is observable to the agent's
bash tool already; we're just telling it to look at what it can
already see. Leaves the downstream reflection path as a safety net
for tasks where the verifier file genuinely isn't shipped.
Observed on v8: each LLM judge call takes ~3 minutes on Kimi — that's
roughly half of a normal agent turn, because Kimi's reasoning model
ignores the effort_level=low config and generates full chain-of-thought
for every prompt regardless of the prompt's complexity. A reflection
cycle with two nudges adds ~10 minutes of judge-only overhead.

Puffer doesn't currently pass max_tokens to the chat-completions
request (OpenAIChatCompletionsRequest has no field for it), so the
cleanest path without touching the provider crate's API is a prompt-
level nudge: tell the judge it's a gate decision and its reasoning
should max out at 2-3 sentences / ~500 tokens. Kimi won't perfectly
cap, but should collapse the reasoning substantially compared to
the unconstrained case.

Works in concert with the completion-claim skeptic section added in
1eee554 — the judge still has room to produce useful "here's exactly
which requirement is unchecked" reasoning, just not 30 KB of it.
Adding a real max_tokens knob later is a separate, larger patch.
Context: chess-best-move timed out under v8 because the task's
prompt says "analyze chess_board.png" and Kimi k2.6 does support
vision, but puffer's Read tool returned the image as a JSON blob
containing base64 data — which travelled as a plain string inside
the tool-role message content. The model saw the string "base64
=iVBOR..." and had no way to actually decode the image.

Adapted from codex-rs/tools/src/view_image.rs (emits image_url
alongside a textual note) and pi-mono/coding-agent/.../read.ts
(returns an [ImageContent, TextContent] pair) but plumbed through
puffer's existing ContentPart::Image variant so no new tool or
schema change is needed.

Two additions to conversation.rs:

1. `items_to_chat_messages` Message branch now decides between
   plain-string content (the fast path for 99% of turns) and OpenAI
   vision's multi-part `[{type:"text"...},{type:"image_url"...}]`
   array shape based on whether any ContentPart::Image is present.
   Text-only turns keep the original wire format.

2. `append_tool_results` parses each tool output with
   `maybe_extract_image_from_tool_output`; when it sees the Read
   tool's `{"type":"image","file":{"base64":"...","type":"image/png",
   "originalSize":N}}` shape it pushes a synthetic user message after
   the FunctionCallOutput with the base64 materialized as a
   `data:image/png;base64,...` URL. The tool-role output stays as
   the small JSON note it always was — the actual pixels ride on
   the user turn where Kimi's vision path actually consumes them.

Fast-path guarded: `output.contains("\"type\"")` + `"image"` before
we spend an `serde_json::from_str` on the (large base64) string.
Non-image tool outputs pay near-zero cost.
Four unit tests lock down the vision path added in 0eac25f:

1. text_only_content_serializes_as_plain_string — regression guard
   for the fast path. Adding vision must not push text-only messages
   onto the multi-part array shape; every existing non-image turn has
   to keep round-tripping as a plain string.

2. image_content_serializes_as_multipart_array — exercises the new
   branch in items_to_chat_messages. Confirms a ContentPart::Image
   becomes `{"type":"image_url","image_url":{"url":"data:image/..."}}`
   alongside the text part.

3. read_tool_image_output_triggers_user_image_turn — end-to-end at
   the conversation-layer boundary. Feeds the Read tool's actual
   `{"type":"image","file":{...}}` JSON into append_tool_results and
   asserts we get FunctionCall + FunctionCallOutput + a synthetic
   user Message carrying the base64 as a data URL. This is the
   contract that makes tasks like chess-best-move see the board.

4. non_image_tool_output_does_not_inject_user_turn — plain Bash/Grep
   output must not produce a stray user turn. Regression guard for
   the detector being too eager.

All four pass. Live v10 (multiplier=1.0) currently has chess-best-move
as the actual end-to-end validation; this commit just pins the
library-level behavior so the path won't silently break on future
edits to the conversion pipeline.
…ctions

Two issues the codex-style review flagged as the highest-priority
items for the 2026-04-21 batch:

1. Reward-hacking leak. Both the benchmark-run system prompt
   (fc7e5dd) and the reflection judge's completion-claim skeptic
   block (1eee554) named `/tests/test_outputs.py` and `/app/tests/`
   explicitly. Those are the actual TB2 verifier paths — telling the
   agent to "go read the verifier's test file" is the definitional
   case of leaking eval structure, even if the path is technically
   discoverable by a bash listing. Rewrite both prompts to describe
   the behavior ("run the check the task prompt asks for verbatim")
   without naming the path. Decouples the guidance from knowledge of
   any particular benchmark's layout.

2. Reflection infinite-loop risk. 69bb902 dropped
   MIN_BATCHES_BETWEEN_EVALUATIONS on the terminal-turn synth path so
   the judge could always weigh in on completion claims. Combined
   with 1eee554's skeptic-biased judge prompt, the agent and judge
   could ping-pong until harbor's task timeout — agent says "Done"
   → nudge → agent works 2 batches → says "Done" → nudge → …
   Add `MAX_CHECKPOINT_INJECTIONS = 3`: enough for "rushed done →
   verify → iterate → final done" but caps the worst case. Track a
   `checkpoint_injections` counter on the tracker, increment when a
   checkpoint is actually built, and gate the synth path on it being
   under the cap. Also surfaces the counter in the terminal-turn
   eprintln trace so the cadence is visible.
Audit agent caught two places the prior scrub (17e8620) missed:

1. `reflection.rs:build_checkpoint` — the adversarial-nudge
   addendum from 02726ad still names `/app/` in both the Chinese and
   English variants when listing where the project's tests might live.
   /app/ is specifically TB2/Harbor's convention, not a universal
   coding-project root. Rewrite to "if the project ships `test_*.py`,
   a `tests/` directory, `Makefile test`, `pytest.ini`, etc." — names
   the *shape* of test scaffolding without the benchmark-specific
   prefix.

   Also rewrote the comment that documented the tuning source ("21 of
   27 unsolved long-trajectory tasks declared 'Done' without ever
   running `/app/test_*.py`") since it (a) names the forbidden path
   and (b) advertises the nudge as benchmark-derived. Replaced with
   a neutral rationale that stands on its own.

2. `benchmark_run.rs` system prompt — two remaining TB2 tells:
   "Ignore benchmark harness files under /app/.puffer" → generalized
   to "hidden config directories (`.puffer`, `.git`, …)" so the
   absolute path doesn't appear. And dropped the "When the task
   already names both the verifier file and the required output
   file, do not use Glob or Grep before writing the first draft" —
   that instruction is TB2-shape-tuned (real coding tasks don't have
   "verifier files") and discourages legitimate repo exploration.

After this + 17e8620, grep for `/app/` and `/tests/` in
crates/puffer-{cli,core}/ returns only tests and comments — no
runtime-leaked strings.
Trajectory audit on v12 caught systemic verifier-probing across 11 of
43 tasks. Every reflection-heavy run emitted commands like:

    find / -maxdepth 3 -name "test_outputs.py"
    ls -la /tests/test_outputs.py
    find / ... -name "verify.py" -o -name "grade.py"

None succeeded (TB2 correctly doesn't mount /tests into the agent
container), so no reward hack landed. But the intent-to-find-the-
verifier is traceable to one line in the checkpoint addendum — after
17e8620 + f1fd75e stripped the literal paths, the prompt still said:

    "Run the project's own tests (if the project ships `test_*.py`,
     a `tests/` directory, `Makefile test`, `pytest.ini`, etc.).
     Failing tests = automatic FAIL."

Kimi reads "the project ships tests" as permission to hunt for test-
looking files, and on Terminal-Bench that collapses onto /tests/
test_outputs.py. One task even put "检查:标准评测路径 /tests/
test_outputs.py" in its own QA ritual.

Rewrite the line to say the opposite: if the task prompt names a
specific command, run that verbatim; don't go hunting for
test-looking files the task didn't mention. The other adversarial
probes (curl, jq, boundary input, regression test) stay — they're
task-shaped and task-scoped and don't invite a filesystem sweep.

Mount isolation is the actual kill-switch here, but relying on it is
fragile: one mount-config edit and every task becomes a silent hack.
Removing the invitation in the prompt closes the window from the
agent side.
Previously the terminal-turn code path only synthesised an observation
when the agent had emitted assistant text AND crossed the
MIN_TOOL_CALLS_BEFORE_EVALUATION floor. When the model returned an empty
response (no text, no tool calls), the tracker returned None and the
agent silently handed back to the harness with no deliverable.

Now an empty_turn case (invocations empty AND no assistant text) takes
the synth path regardless of the tool-count floor, and the judges get a
synthetic completion_claim noting the empty turn so they have something
to react to.

Observed on honest Kimi TB2 runs for circuit-fibsqrt,
feal-linear-cryptanalysis, modernize-scientific-stack — all three
exited with "[reflection] terminal_turn text=false tools=2/4 ...
synth=false" before the fix.
Two additions to the pre-submission rules block:

- Distinguish files that existed before the task from build artefacts
  the agent created; the former stay untouched, the latter must be
  removed from the task dir (or the work done under /tmp). v16
  trajectory analysis found polyglot-c-py / polyglot-rust-c leaving a
  compiled `cmain` next to `main.py.c`, failing the verifier's
  directory-contents check.
- External graders may use a different Python interpreter than the
  agent's shell (uv-venv `python` vs system `python3`) — avoid
  `pip install` at script runtime and stick to modules available under
  both. Observed on openssl-selfsigned-cert and
  modernize-scientific-stack.

Path-agnostic guidance; no reward-hacking hints.
Kimi has been observed emitting reasoning_content that contains a NUL
byte; replaying that same string on the next request is then rejected by
Kimi's own validator with:

  400 Bad Request: "the reasoning_content at position N must be a valid
  UTF-8 string: string contains \x00"

`extract_chat_completions_reasoning` now runs the string through a
sanitize_reasoning_text helper that drops C0 control chars (except
\t/\n/\r) and DEL before the text is stored on ConversationItem::Reasoning
and echoed back. The fix applies to every vendor that returns
reasoning_content via the Chat Completions path.

Hit on honest Kimi TB2 run kimi-v15-full89 qemu-startup retry-03.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant