embed-perf-quality codex review follow-ups (PR #105-#115)

## Context

The `embed-perf-quality` show shipped 11 PRs (#105–#115) for tokenizer parity + embed quality fixes + HF parity regression gate. khive's ship-criterion models (BGE-small, MultilingualE5-small, all-MiniLM-L6-v2, paraphrase-multilingual-MiniLM-L12-v2) all pass HF parity at **cosine ≥ 0.999** on the 5 measured inputs.

Codex adversarial review (round 1) was run on all 11 PRs in parallel via `li agent -a reviewer --effort high`. Verdicts were REQUEST CHANGES across the board, but findings are predominantly **edge cases or systemic infra gaps that don't affect khive's actual production embeddings**.

Per Ocean's directive (2026-05-25): ship lattice with documented limitations, file follow-ups, launch khive. This issue tracks the deferred work.

## High-priority follow-ups (blockers for clean codex APPROVE, not for khive ship)

### F1 — CI silent-skip gate gap (PR #105, #111, #114)

**Severity**: High / Blocker (per codex)
**Files**: `crates/inference/tests/audit_tokenizer_parity.rs:74,132,192`, `crates/embed/tests/embed_parity_vs_hf.rs:151,158,223,230`, `crates/embed/tests/tokenizer_parity_e2e.rs`, `scripts/ci.sh:20`, `.github/workflows/ci.yml:42`

**Issue**: Audit parity tests and HF parity regression test all `return` early (silently SKIP) when HF snapshots / `~/.lattice/models/` weights are absent. CI does not provision these — `scripts/ci.sh` runs `cargo test` without downloading any models. On a clean CI runner, all parity tests "pass" without exercising a single case. **The advertised regression gate is not actually gating CI.**

**Fix options**:
1. Commit minimal tokenizer fixtures (~2-5 MB total) into `crates/inference/tests/fixtures/` and reroute `load_tokenizer()` test path to read from the repo, not `~/.cache/huggingface/`. Weights tests can stay skippable.
2. Add a CI step (`huggingface-cli download` or similar) that provisions the snapshots before tests run. Costs ~5-10 min of CI time per build.
3. Make missing snapshots a hard failure in CI (`LATTICE_REQUIRE_FIXTURES=1` env or similar), with local opt-out.

Option 1 is cleanest for tokenizer-only tests. Option 2 needed if vector parity tests must run in CI.

### F2 — Qwen3 BPE `\s+(?!\S)` whitespace regex semantics (PR #107)

**Severity**: Major (per codex)
**Files**: `crates/inference/src/tokenizer/bpe.rs:974,977`

**Issue**: The hand-coded GPT-4 regex implementation only treats `\s+(?!\S)` as "trailing whitespace at end of input". HF's regex permits matching when followed by ANY whitespace. Result: lattice diverges on inputs with repeated internal whitespace.

**Measured divergences** (codex):
- `"foo  bar"`: HF `[7975, 220, 3619, 151643]` vs lattice `[7975, 256, 2257, 151643]`
- `"foo   bar"`: HF `[7975, 256, 3619, 151643]` vs lattice `[7975, 262, 2257, 151643]`
- `"foo\t\tbar"`: HF `[7975, 197, 90709, 151643]` vs lattice `[7975, 298, 2257, 151643]`
- `"foo \tbar"`: HF `[7975, 220, 90709, 151643]` vs lattice `[7975, 7018, 2257, 151643]`

**khive impact**: Khive uses MiniLM/E5, not Qwen3-Embedding (it's `#[ignore]`d in the parity gate). No production impact today.

**Fix**: Implement actual `\s+(?!\S)` semantics — emit all but the last whitespace char as one piece, leaving the final whitespace for the following letter/punctuation branch. Add `\"foo  bar\"`, `\"foo   bar\"`, `\"foo\\t\\tbar\"`, `\"foo \\tbar\"` to `audit_tokenizer_parity.rs::qwen3_embedding_0_6b_bpe_parity`.

### F3 — WordPiece AddedToken structured fields not enforced (PR #108)

**Severity**: Major (per codex)
**Files**: `crates/inference/src/tokenizer/common.rs:729`, `crates/inference/src/tokenizer/wordpiece.rs:527,676`

**Issue**: PR #108 added longest-match scan for added tokens but `parse_added_tokens` still returns `HashMap<String, u32>` (content+id only). The structured fields `single_word`, `lstrip`, `rstrip`, `normalized` from HF's `AddedToken` are parsed away. Also, `tokenize_pair()` bypasses the new scanner entirely.

**Concrete behavior**: HF's `AddedToken("ing", single_word=True)` does NOT match inside `"tokenizing"`. Lattice would match it. None of the shipping models (BGE/E5/MiniLM/paraphrase) have non-default AddedToken metadata, so this is edge-case.

**Fix**: Introduce internal `AddedToken { id, content, single_word, lstrip, rstrip, normalized, special }` struct. Parse all fields. Enforce in match logic. Refactor `tokenize_pair` to share the added-token-aware payload scanner.

### F4 — WordPiece NFD voicing mark stripping (PR #109)

**Severity**: Major (per codex)
**Files**: `crates/inference/src/tokenizer/wordpiece.rs:722`

**Issue**: PR #109 handles precomposed kana (`で` → `て`) via `fold_diacritic`, but doesn't strip U+3099 / U+309A combining marks themselves. Already-decomposed NFD input (`て\u{3099}`) still survives normalization with the standalone combining mark as a separate token.

**Fix**: Add `0x3099..=0x309A` to `is_combining_mark`. Keep `U+309B/U+309C` and `U+FF9E/U+FF9F` excluded (HF BertNormalizer NFD leaves spacing/halfwidth marks intact).

**khive impact**: None — khive's content arrives in NFC form, not NFD.

### F5 — SentencePiece global whitespace stripping + literal `▁` data loss (PR #110)

**Severity**: 2 Major (per codex)
**Files**: `crates/inference/src/tokenizer/sentencepiece.rs` (the trailing-ws strip logic)

**Issue**: PR #110's `META_SPACE` strip is too aggressive — it strips trailing `▁` unconditionally, but HF's Metaspace allows literal `▁` characters as part of legitimate user content (rare but valid). Also, the `escape_whitespaces=false` ASCII path has different semantics than the metaspace path that aren't fully aligned.

**khive impact**: None — khive text doesn't include literal `▁`. But future Unicode-heavy use could expose this.

**Fix**: Distinguish "trailing whitespace as space character" from "trailing literal `▁` as content". Per-codepoint check, not blanket strip.

### F6 — Service-path tokenizer test coverage overstated (PR #111)

**Severity**: Major (per codex)
**Files**: `crates/embed/tests/tokenizer_parity_e2e.rs:7,11,111,226`

**Issue**: The e2e test calls `load_tokenizer(&dir)` directly, NOT through `NativeEmbeddingService`. The Qwen path's actual service flow goes through `load_qwen_model()` + `QwenModel::tokenize_for_embedding()` which double-appends EOS — the e2e test's "expected IDs include EOS at end" misses this double-EOS path.

**Fix**: Refactor the e2e test to spin up `NativeEmbeddingService` and assert tokens through `.embed_query()` / `.embed_passage()` instrumentation. May require adding an `internal_tokenize_for_test()` hook to the service.

### F7 — Generic cache key migration not preserved (PR #112)

**Severity**: Major (per codex)
**Files**: `crates/embed/src/cache.rs` (CacheKey signature), `crates/embed/src/service/cached.rs`

**Issue**: PR #112 added role-aware cache keys but the implementation **changes** the `compute_key()` signature in a way that makes existing on-disk caches with `role:generic` not match the new Generic computation. Migration claim in PR description ("Old caches remain readable under Generic") may not actually hold.

**Fix**: Verify with a test that loads a v0-format cache file and confirms Generic role hits. If broken, either preserve old hash format under Generic or add a cache schema version bump + migration step.

### F8 — BGE query_instruction omitted (PR #112)

**Severity**: Major (per codex)
**Files**: `crates/embed/src/model.rs::query_instruction()`

**Issue**: BGE model card specifies a retrieval query instruction (`\"Represent this sentence for searching relevant passages: \"`). PR #112's `query_instruction()` returns `None` for all BGE variants. `embed_query()` for BGE produces no-prefix embeddings, which is wrong per HF model card.

**khive impact**: khive uses MiniLM/E5/paraphrase. BGE has no prompt OR mean-pool currently, only CLS-pool — instruction is for retrieval queries specifically. Defer until BGE retrieval is actually used.

**Fix**: Either (a) return BGE retrieval query instruction for `BgeSmallEnV15`/`BgeBaseEnV15`/`BgeLargeEnV15`, or (b) explicitly document that BGE query prompting is caller-owned and remove `embed_query()` semantics for BGE.

### F9 — BertModel default-mean fallthrough (PR #113)

**Severity**: Major (per codex)
**Files**: `crates/inference/src/model/bert.rs`

**Issue**: Direct `BertModel` callers (outside the embed service) still get `Mean` as default pooling. If a downstream consumer constructs a `BertModel` directly for BGE, they'll get wrong-pool semantics unless they remember to call `set_pooling(BertPooling::CLS)`.

**Fix**: Either change the default to `None` (require explicit set_pooling), or have `BertModel::from_pretrained_with_config()` infer pooling from model_id at construction time.

### F10 — E5 golden idempotency + max_abs_diff not enforced (PR #114)

**Severity**: Blocker 2 + Medium (per codex)
**Files**: `crates/embed/tests/embed_parity_vs_hf.rs`, `scripts/gen_embed_parity_goldens.py`

**Issue**: The `MAX_ABS_DIFF_F32 = 1e-3` constant is documented as a tolerance but never asserted in the test. Also, E5 golden regeneration is NOT byte-idempotent between identical runs of the generator (observed during the show — fixture mtime changed but content actually was identical, suggesting some non-determinism worth tracking down).

**Fix**: Enforce max_abs_diff in `run_parity_for_model()`. Add a CI step that regenerates a fixture and asserts byte-identity to the committed version (or document why drift is expected).

### F11 — Qwen `#[ignore]` masks the 2 passing cases (PR #114)

**Severity**: Major (per codex)

**Issue**: `#[ignore]` on the whole `qwen3_embedding_0_6b_parity_vs_hf` test disables coverage for the 2 inputs that DO pass at cosine ≥ 0.995, not just the 3 failing ones.

**Fix**: Split Qwen test into 2 functions — one for the 2 known-passing inputs (active), one for the 3 known-failing inputs (`#[ignore]` pending lattice#103).

### F12 — HF snapshot "latest" selection is alphabetical, not chronological (PR #115)

**Severity**: Medium (per codex)
**Files**: `scripts/gen_embed_parity_goldens.py::find_hf_cache_snapshot()`

**Issue**: `sorted(snapshots_dir.iterdir())[-1]` picks alphabetically last snapshot hash, not chronologically latest. If HF cache has multiple snapshots, regeneration may pick an old one silently.

**Fix**: Sort by `mtime`, not by name.

## Low-priority follow-ups

### F13 — Required PR attribution line missing (all 11 PRs)

Per `AGENTS.md:11`, agent-authored PR bodies must start with `> _PR description authored by Claude (Anthropic agent) on behalf of @ohdearquant._`. All 11 PRs miss this — they have the "Generated with Claude Code" footer instead.

### F14 — `make bench-compare` evidence missing (multiple PRs)

Per `CLAUDE.md:7`, every PR touching `crates/inference/` or `crates/embed/` must include bench-compare output. PRs #106, #107, #108, #110 lack it (tokenizer-only correctness fixes). Suggest amending CLAUDE.md to exempt non-perf changes, OR add bench-compare evidence to PR descriptions.

### F15 — Stale embed docs after API changes (PR #112)

`crates/embed/CONFIG.md:82`, `crates/embed/DESIGN.md:18`, `crates/embed/README.md:123` still describe the pre-#112 API (single embed(), no roles, no role-aware cache).

**Fix**: Refresh after merging the role-aware stack.

## Ship rationale

Per Ocean's directive 2026-05-25: ship lattice v0.2.5 with these follow-ups documented. khive's recall pipeline needs the new role-aware embed API + parity regression infrastructure to start its work, and the production model set (MiniLM/paraphrase) is correct. Address F1-F12 in a follow-up sweep after khive launch validates the new pipeline end-to-end.

---
Filed by: Ocean (via Claude Code) as part of the embed-perf-quality show. Umbrella PR: #104.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

embed-perf-quality codex review follow-ups (PR #105-#115) #116

Context

High-priority follow-ups (blockers for clean codex APPROVE, not for khive ship)

F1 — CI silent-skip gate gap (PR #105, #111, #114)

F2 — Qwen3 BPE `\s+(?!\S)` whitespace regex semantics (PR #107)

F3 — WordPiece AddedToken structured fields not enforced (PR #108)

F4 — WordPiece NFD voicing mark stripping (PR #109)

F5 — SentencePiece global whitespace stripping + literal `▁` data loss (PR #110)

F6 — Service-path tokenizer test coverage overstated (PR #111)

F7 — Generic cache key migration not preserved (PR #112)

F8 — BGE query_instruction omitted (PR #112)

F9 — BertModel default-mean fallthrough (PR #113)

F10 — E5 golden idempotency + max_abs_diff not enforced (PR #114)

F11 — Qwen `#[ignore]` masks the 2 passing cases (PR #114)

F12 — HF snapshot "latest" selection is alphabetical, not chronological (PR #115)

Low-priority follow-ups

F13 — Required PR attribution line missing (all 11 PRs)

F14 — `make bench-compare` evidence missing (multiple PRs)

F15 — Stale embed docs after API changes (PR #112)

Ship rationale

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

embed-perf-quality codex review follow-ups (PR #105-#115) #116

Description

Context

High-priority follow-ups (blockers for clean codex APPROVE, not for khive ship)

F1 — CI silent-skip gate gap (PR #105, #111, #114)

F2 — Qwen3 BPE \s+(?!\S) whitespace regex semantics (PR #107)

F3 — WordPiece AddedToken structured fields not enforced (PR #108)

F4 — WordPiece NFD voicing mark stripping (PR #109)

F5 — SentencePiece global whitespace stripping + literal ▁ data loss (PR #110)

F6 — Service-path tokenizer test coverage overstated (PR #111)

F7 — Generic cache key migration not preserved (PR #112)

F8 — BGE query_instruction omitted (PR #112)

F9 — BertModel default-mean fallthrough (PR #113)

F10 — E5 golden idempotency + max_abs_diff not enforced (PR #114)

F11 — Qwen #[ignore] masks the 2 passing cases (PR #114)

F12 — HF snapshot "latest" selection is alphabetical, not chronological (PR #115)

Low-priority follow-ups

F13 — Required PR attribution line missing (all 11 PRs)

F14 — make bench-compare evidence missing (multiple PRs)

F15 — Stale embed docs after API changes (PR #112)

Ship rationale

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

F2 — Qwen3 BPE `\s+(?!\S)` whitespace regex semantics (PR #107)

F5 — SentencePiece global whitespace stripping + literal `▁` data loss (PR #110)

F11 — Qwen `#[ignore]` masks the 2 passing cases (PR #114)

F14 — `make bench-compare` evidence missing (multiple PRs)