Skip to content

embed-perf-quality codex review follow-ups (PR #105-#115) #116

@ohdearquant

Description

@ohdearquant

Context

The embed-perf-quality show shipped 11 PRs (#105#115) for tokenizer parity + embed quality fixes + HF parity regression gate. khive's ship-criterion models (BGE-small, MultilingualE5-small, all-MiniLM-L6-v2, paraphrase-multilingual-MiniLM-L12-v2) all pass HF parity at cosine ≥ 0.999 on the 5 measured inputs.

Codex adversarial review (round 1) was run on all 11 PRs in parallel via li agent -a reviewer --effort high. Verdicts were REQUEST CHANGES across the board, but findings are predominantly edge cases or systemic infra gaps that don't affect khive's actual production embeddings.

Per Ocean's directive (2026-05-25): ship lattice with documented limitations, file follow-ups, launch khive. This issue tracks the deferred work.

High-priority follow-ups (blockers for clean codex APPROVE, not for khive ship)

F1 — CI silent-skip gate gap (PR #105, #111, #114)

Severity: High / Blocker (per codex)
Files: crates/inference/tests/audit_tokenizer_parity.rs:74,132,192, crates/embed/tests/embed_parity_vs_hf.rs:151,158,223,230, crates/embed/tests/tokenizer_parity_e2e.rs, scripts/ci.sh:20, .github/workflows/ci.yml:42

Issue: Audit parity tests and HF parity regression test all return early (silently SKIP) when HF snapshots / ~/.lattice/models/ weights are absent. CI does not provision these — scripts/ci.sh runs cargo test without downloading any models. On a clean CI runner, all parity tests "pass" without exercising a single case. The advertised regression gate is not actually gating CI.

Fix options:

  1. Commit minimal tokenizer fixtures (~2-5 MB total) into crates/inference/tests/fixtures/ and reroute load_tokenizer() test path to read from the repo, not ~/.cache/huggingface/. Weights tests can stay skippable.
  2. Add a CI step (huggingface-cli download or similar) that provisions the snapshots before tests run. Costs ~5-10 min of CI time per build.
  3. Make missing snapshots a hard failure in CI (LATTICE_REQUIRE_FIXTURES=1 env or similar), with local opt-out.

Option 1 is cleanest for tokenizer-only tests. Option 2 needed if vector parity tests must run in CI.

F2 — Qwen3 BPE \s+(?!\S) whitespace regex semantics (PR #107)

Severity: Major (per codex)
Files: crates/inference/src/tokenizer/bpe.rs:974,977

Issue: The hand-coded GPT-4 regex implementation only treats \s+(?!\S) as "trailing whitespace at end of input". HF's regex permits matching when followed by ANY whitespace. Result: lattice diverges on inputs with repeated internal whitespace.

Measured divergences (codex):

  • "foo bar": HF [7975, 220, 3619, 151643] vs lattice [7975, 256, 2257, 151643]
  • "foo bar": HF [7975, 256, 3619, 151643] vs lattice [7975, 262, 2257, 151643]
  • "foo\t\tbar": HF [7975, 197, 90709, 151643] vs lattice [7975, 298, 2257, 151643]
  • "foo \tbar": HF [7975, 220, 90709, 151643] vs lattice [7975, 7018, 2257, 151643]

khive impact: Khive uses MiniLM/E5, not Qwen3-Embedding (it's #[ignore]d in the parity gate). No production impact today.

Fix: Implement actual \s+(?!\S) semantics — emit all but the last whitespace char as one piece, leaving the final whitespace for the following letter/punctuation branch. Add \"foo bar\", \"foo bar\", \"foo\\t\\tbar\", \"foo \\tbar\" to audit_tokenizer_parity.rs::qwen3_embedding_0_6b_bpe_parity.

F3 — WordPiece AddedToken structured fields not enforced (PR #108)

Severity: Major (per codex)
Files: crates/inference/src/tokenizer/common.rs:729, crates/inference/src/tokenizer/wordpiece.rs:527,676

Issue: PR #108 added longest-match scan for added tokens but parse_added_tokens still returns HashMap<String, u32> (content+id only). The structured fields single_word, lstrip, rstrip, normalized from HF's AddedToken are parsed away. Also, tokenize_pair() bypasses the new scanner entirely.

Concrete behavior: HF's AddedToken("ing", single_word=True) does NOT match inside "tokenizing". Lattice would match it. None of the shipping models (BGE/E5/MiniLM/paraphrase) have non-default AddedToken metadata, so this is edge-case.

Fix: Introduce internal AddedToken { id, content, single_word, lstrip, rstrip, normalized, special } struct. Parse all fields. Enforce in match logic. Refactor tokenize_pair to share the added-token-aware payload scanner.

F4 — WordPiece NFD voicing mark stripping (PR #109)

Severity: Major (per codex)
Files: crates/inference/src/tokenizer/wordpiece.rs:722

Issue: PR #109 handles precomposed kana () via fold_diacritic, but doesn't strip U+3099 / U+309A combining marks themselves. Already-decomposed NFD input (て\u{3099}) still survives normalization with the standalone combining mark as a separate token.

Fix: Add 0x3099..=0x309A to is_combining_mark. Keep U+309B/U+309C and U+FF9E/U+FF9F excluded (HF BertNormalizer NFD leaves spacing/halfwidth marks intact).

khive impact: None — khive's content arrives in NFC form, not NFD.

F5 — SentencePiece global whitespace stripping + literal data loss (PR #110)

Severity: 2 Major (per codex)
Files: crates/inference/src/tokenizer/sentencepiece.rs (the trailing-ws strip logic)

Issue: PR #110's META_SPACE strip is too aggressive — it strips trailing unconditionally, but HF's Metaspace allows literal characters as part of legitimate user content (rare but valid). Also, the escape_whitespaces=false ASCII path has different semantics than the metaspace path that aren't fully aligned.

khive impact: None — khive text doesn't include literal . But future Unicode-heavy use could expose this.

Fix: Distinguish "trailing whitespace as space character" from "trailing literal as content". Per-codepoint check, not blanket strip.

F6 — Service-path tokenizer test coverage overstated (PR #111)

Severity: Major (per codex)
Files: crates/embed/tests/tokenizer_parity_e2e.rs:7,11,111,226

Issue: The e2e test calls load_tokenizer(&dir) directly, NOT through NativeEmbeddingService. The Qwen path's actual service flow goes through load_qwen_model() + QwenModel::tokenize_for_embedding() which double-appends EOS — the e2e test's "expected IDs include EOS at end" misses this double-EOS path.

Fix: Refactor the e2e test to spin up NativeEmbeddingService and assert tokens through .embed_query() / .embed_passage() instrumentation. May require adding an internal_tokenize_for_test() hook to the service.

F7 — Generic cache key migration not preserved (PR #112)

Severity: Major (per codex)
Files: crates/embed/src/cache.rs (CacheKey signature), crates/embed/src/service/cached.rs

Issue: PR #112 added role-aware cache keys but the implementation changes the compute_key() signature in a way that makes existing on-disk caches with role:generic not match the new Generic computation. Migration claim in PR description ("Old caches remain readable under Generic") may not actually hold.

Fix: Verify with a test that loads a v0-format cache file and confirms Generic role hits. If broken, either preserve old hash format under Generic or add a cache schema version bump + migration step.

F8 — BGE query_instruction omitted (PR #112)

Severity: Major (per codex)
Files: crates/embed/src/model.rs::query_instruction()

Issue: BGE model card specifies a retrieval query instruction (\"Represent this sentence for searching relevant passages: \"). PR #112's query_instruction() returns None for all BGE variants. embed_query() for BGE produces no-prefix embeddings, which is wrong per HF model card.

khive impact: khive uses MiniLM/E5/paraphrase. BGE has no prompt OR mean-pool currently, only CLS-pool — instruction is for retrieval queries specifically. Defer until BGE retrieval is actually used.

Fix: Either (a) return BGE retrieval query instruction for BgeSmallEnV15/BgeBaseEnV15/BgeLargeEnV15, or (b) explicitly document that BGE query prompting is caller-owned and remove embed_query() semantics for BGE.

F9 — BertModel default-mean fallthrough (PR #113)

Severity: Major (per codex)
Files: crates/inference/src/model/bert.rs

Issue: Direct BertModel callers (outside the embed service) still get Mean as default pooling. If a downstream consumer constructs a BertModel directly for BGE, they'll get wrong-pool semantics unless they remember to call set_pooling(BertPooling::CLS).

Fix: Either change the default to None (require explicit set_pooling), or have BertModel::from_pretrained_with_config() infer pooling from model_id at construction time.

F10 — E5 golden idempotency + max_abs_diff not enforced (PR #114)

Severity: Blocker 2 + Medium (per codex)
Files: crates/embed/tests/embed_parity_vs_hf.rs, scripts/gen_embed_parity_goldens.py

Issue: The MAX_ABS_DIFF_F32 = 1e-3 constant is documented as a tolerance but never asserted in the test. Also, E5 golden regeneration is NOT byte-idempotent between identical runs of the generator (observed during the show — fixture mtime changed but content actually was identical, suggesting some non-determinism worth tracking down).

Fix: Enforce max_abs_diff in run_parity_for_model(). Add a CI step that regenerates a fixture and asserts byte-identity to the committed version (or document why drift is expected).

F11 — Qwen #[ignore] masks the 2 passing cases (PR #114)

Severity: Major (per codex)

Issue: #[ignore] on the whole qwen3_embedding_0_6b_parity_vs_hf test disables coverage for the 2 inputs that DO pass at cosine ≥ 0.995, not just the 3 failing ones.

Fix: Split Qwen test into 2 functions — one for the 2 known-passing inputs (active), one for the 3 known-failing inputs (#[ignore] pending lattice#103).

F12 — HF snapshot "latest" selection is alphabetical, not chronological (PR #115)

Severity: Medium (per codex)
Files: scripts/gen_embed_parity_goldens.py::find_hf_cache_snapshot()

Issue: sorted(snapshots_dir.iterdir())[-1] picks alphabetically last snapshot hash, not chronologically latest. If HF cache has multiple snapshots, regeneration may pick an old one silently.

Fix: Sort by mtime, not by name.

Low-priority follow-ups

F13 — Required PR attribution line missing (all 11 PRs)

Per AGENTS.md:11, agent-authored PR bodies must start with > _PR description authored by Claude (Anthropic agent) on behalf of @ohdearquant._. All 11 PRs miss this — they have the "Generated with Claude Code" footer instead.

F14 — make bench-compare evidence missing (multiple PRs)

Per CLAUDE.md:7, every PR touching crates/inference/ or crates/embed/ must include bench-compare output. PRs #106, #107, #108, #110 lack it (tokenizer-only correctness fixes). Suggest amending CLAUDE.md to exempt non-perf changes, OR add bench-compare evidence to PR descriptions.

F15 — Stale embed docs after API changes (PR #112)

crates/embed/CONFIG.md:82, crates/embed/DESIGN.md:18, crates/embed/README.md:123 still describe the pre-#112 API (single embed(), no roles, no role-aware cache).

Fix: Refresh after merging the role-aware stack.

Ship rationale

Per Ocean's directive 2026-05-25: ship lattice v0.2.5 with these follow-ups documented. khive's recall pipeline needs the new role-aware embed API + parity regression infrastructure to start its work, and the production model set (MiniLM/paraphrase) is correct. Address F1-F12 in a follow-up sweep after khive launch validates the new pipeline end-to-end.


Filed by: Ocean (via Claude Code) as part of the embed-perf-quality show. Umbrella PR: #104.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions