Context
The embed-perf-quality show shipped 11 PRs (#105–#115) for tokenizer parity + embed quality fixes + HF parity regression gate. khive's ship-criterion models (BGE-small, MultilingualE5-small, all-MiniLM-L6-v2, paraphrase-multilingual-MiniLM-L12-v2) all pass HF parity at cosine ≥ 0.999 on the 5 measured inputs.
Codex adversarial review (round 1) was run on all 11 PRs in parallel via li agent -a reviewer --effort high. Verdicts were REQUEST CHANGES across the board, but findings are predominantly edge cases or systemic infra gaps that don't affect khive's actual production embeddings.
Per Ocean's directive (2026-05-25): ship lattice with documented limitations, file follow-ups, launch khive. This issue tracks the deferred work.
High-priority follow-ups (blockers for clean codex APPROVE, not for khive ship)
F1 — CI silent-skip gate gap (PR #105, #111, #114)
Severity: High / Blocker (per codex)
Files: crates/inference/tests/audit_tokenizer_parity.rs:74,132,192, crates/embed/tests/embed_parity_vs_hf.rs:151,158,223,230, crates/embed/tests/tokenizer_parity_e2e.rs, scripts/ci.sh:20, .github/workflows/ci.yml:42
Issue: Audit parity tests and HF parity regression test all return early (silently SKIP) when HF snapshots / ~/.lattice/models/ weights are absent. CI does not provision these — scripts/ci.sh runs cargo test without downloading any models. On a clean CI runner, all parity tests "pass" without exercising a single case. The advertised regression gate is not actually gating CI.
Fix options:
- Commit minimal tokenizer fixtures (~2-5 MB total) into
crates/inference/tests/fixtures/ and reroute load_tokenizer() test path to read from the repo, not ~/.cache/huggingface/. Weights tests can stay skippable.
- Add a CI step (
huggingface-cli download or similar) that provisions the snapshots before tests run. Costs ~5-10 min of CI time per build.
- Make missing snapshots a hard failure in CI (
LATTICE_REQUIRE_FIXTURES=1 env or similar), with local opt-out.
Option 1 is cleanest for tokenizer-only tests. Option 2 needed if vector parity tests must run in CI.
F2 — Qwen3 BPE \s+(?!\S) whitespace regex semantics (PR #107)
Severity: Major (per codex)
Files: crates/inference/src/tokenizer/bpe.rs:974,977
Issue: The hand-coded GPT-4 regex implementation only treats \s+(?!\S) as "trailing whitespace at end of input". HF's regex permits matching when followed by ANY whitespace. Result: lattice diverges on inputs with repeated internal whitespace.
Measured divergences (codex):
"foo bar": HF [7975, 220, 3619, 151643] vs lattice [7975, 256, 2257, 151643]
"foo bar": HF [7975, 256, 3619, 151643] vs lattice [7975, 262, 2257, 151643]
"foo\t\tbar": HF [7975, 197, 90709, 151643] vs lattice [7975, 298, 2257, 151643]
"foo \tbar": HF [7975, 220, 90709, 151643] vs lattice [7975, 7018, 2257, 151643]
khive impact: Khive uses MiniLM/E5, not Qwen3-Embedding (it's #[ignore]d in the parity gate). No production impact today.
Fix: Implement actual \s+(?!\S) semantics — emit all but the last whitespace char as one piece, leaving the final whitespace for the following letter/punctuation branch. Add \"foo bar\", \"foo bar\", \"foo\\t\\tbar\", \"foo \\tbar\" to audit_tokenizer_parity.rs::qwen3_embedding_0_6b_bpe_parity.
F3 — WordPiece AddedToken structured fields not enforced (PR #108)
Severity: Major (per codex)
Files: crates/inference/src/tokenizer/common.rs:729, crates/inference/src/tokenizer/wordpiece.rs:527,676
Issue: PR #108 added longest-match scan for added tokens but parse_added_tokens still returns HashMap<String, u32> (content+id only). The structured fields single_word, lstrip, rstrip, normalized from HF's AddedToken are parsed away. Also, tokenize_pair() bypasses the new scanner entirely.
Concrete behavior: HF's AddedToken("ing", single_word=True) does NOT match inside "tokenizing". Lattice would match it. None of the shipping models (BGE/E5/MiniLM/paraphrase) have non-default AddedToken metadata, so this is edge-case.
Fix: Introduce internal AddedToken { id, content, single_word, lstrip, rstrip, normalized, special } struct. Parse all fields. Enforce in match logic. Refactor tokenize_pair to share the added-token-aware payload scanner.
F4 — WordPiece NFD voicing mark stripping (PR #109)
Severity: Major (per codex)
Files: crates/inference/src/tokenizer/wordpiece.rs:722
Issue: PR #109 handles precomposed kana (で → て) via fold_diacritic, but doesn't strip U+3099 / U+309A combining marks themselves. Already-decomposed NFD input (て\u{3099}) still survives normalization with the standalone combining mark as a separate token.
Fix: Add 0x3099..=0x309A to is_combining_mark. Keep U+309B/U+309C and U+FF9E/U+FF9F excluded (HF BertNormalizer NFD leaves spacing/halfwidth marks intact).
khive impact: None — khive's content arrives in NFC form, not NFD.
F5 — SentencePiece global whitespace stripping + literal ▁ data loss (PR #110)
Severity: 2 Major (per codex)
Files: crates/inference/src/tokenizer/sentencepiece.rs (the trailing-ws strip logic)
Issue: PR #110's META_SPACE strip is too aggressive — it strips trailing ▁ unconditionally, but HF's Metaspace allows literal ▁ characters as part of legitimate user content (rare but valid). Also, the escape_whitespaces=false ASCII path has different semantics than the metaspace path that aren't fully aligned.
khive impact: None — khive text doesn't include literal ▁. But future Unicode-heavy use could expose this.
Fix: Distinguish "trailing whitespace as space character" from "trailing literal ▁ as content". Per-codepoint check, not blanket strip.
F6 — Service-path tokenizer test coverage overstated (PR #111)
Severity: Major (per codex)
Files: crates/embed/tests/tokenizer_parity_e2e.rs:7,11,111,226
Issue: The e2e test calls load_tokenizer(&dir) directly, NOT through NativeEmbeddingService. The Qwen path's actual service flow goes through load_qwen_model() + QwenModel::tokenize_for_embedding() which double-appends EOS — the e2e test's "expected IDs include EOS at end" misses this double-EOS path.
Fix: Refactor the e2e test to spin up NativeEmbeddingService and assert tokens through .embed_query() / .embed_passage() instrumentation. May require adding an internal_tokenize_for_test() hook to the service.
F7 — Generic cache key migration not preserved (PR #112)
Severity: Major (per codex)
Files: crates/embed/src/cache.rs (CacheKey signature), crates/embed/src/service/cached.rs
Issue: PR #112 added role-aware cache keys but the implementation changes the compute_key() signature in a way that makes existing on-disk caches with role:generic not match the new Generic computation. Migration claim in PR description ("Old caches remain readable under Generic") may not actually hold.
Fix: Verify with a test that loads a v0-format cache file and confirms Generic role hits. If broken, either preserve old hash format under Generic or add a cache schema version bump + migration step.
F8 — BGE query_instruction omitted (PR #112)
Severity: Major (per codex)
Files: crates/embed/src/model.rs::query_instruction()
Issue: BGE model card specifies a retrieval query instruction (\"Represent this sentence for searching relevant passages: \"). PR #112's query_instruction() returns None for all BGE variants. embed_query() for BGE produces no-prefix embeddings, which is wrong per HF model card.
khive impact: khive uses MiniLM/E5/paraphrase. BGE has no prompt OR mean-pool currently, only CLS-pool — instruction is for retrieval queries specifically. Defer until BGE retrieval is actually used.
Fix: Either (a) return BGE retrieval query instruction for BgeSmallEnV15/BgeBaseEnV15/BgeLargeEnV15, or (b) explicitly document that BGE query prompting is caller-owned and remove embed_query() semantics for BGE.
F9 — BertModel default-mean fallthrough (PR #113)
Severity: Major (per codex)
Files: crates/inference/src/model/bert.rs
Issue: Direct BertModel callers (outside the embed service) still get Mean as default pooling. If a downstream consumer constructs a BertModel directly for BGE, they'll get wrong-pool semantics unless they remember to call set_pooling(BertPooling::CLS).
Fix: Either change the default to None (require explicit set_pooling), or have BertModel::from_pretrained_with_config() infer pooling from model_id at construction time.
F10 — E5 golden idempotency + max_abs_diff not enforced (PR #114)
Severity: Blocker 2 + Medium (per codex)
Files: crates/embed/tests/embed_parity_vs_hf.rs, scripts/gen_embed_parity_goldens.py
Issue: The MAX_ABS_DIFF_F32 = 1e-3 constant is documented as a tolerance but never asserted in the test. Also, E5 golden regeneration is NOT byte-idempotent between identical runs of the generator (observed during the show — fixture mtime changed but content actually was identical, suggesting some non-determinism worth tracking down).
Fix: Enforce max_abs_diff in run_parity_for_model(). Add a CI step that regenerates a fixture and asserts byte-identity to the committed version (or document why drift is expected).
F11 — Qwen #[ignore] masks the 2 passing cases (PR #114)
Severity: Major (per codex)
Issue: #[ignore] on the whole qwen3_embedding_0_6b_parity_vs_hf test disables coverage for the 2 inputs that DO pass at cosine ≥ 0.995, not just the 3 failing ones.
Fix: Split Qwen test into 2 functions — one for the 2 known-passing inputs (active), one for the 3 known-failing inputs (#[ignore] pending lattice#103).
F12 — HF snapshot "latest" selection is alphabetical, not chronological (PR #115)
Severity: Medium (per codex)
Files: scripts/gen_embed_parity_goldens.py::find_hf_cache_snapshot()
Issue: sorted(snapshots_dir.iterdir())[-1] picks alphabetically last snapshot hash, not chronologically latest. If HF cache has multiple snapshots, regeneration may pick an old one silently.
Fix: Sort by mtime, not by name.
Low-priority follow-ups
F13 — Required PR attribution line missing (all 11 PRs)
Per AGENTS.md:11, agent-authored PR bodies must start with > _PR description authored by Claude (Anthropic agent) on behalf of @ohdearquant._. All 11 PRs miss this — they have the "Generated with Claude Code" footer instead.
F14 — make bench-compare evidence missing (multiple PRs)
Per CLAUDE.md:7, every PR touching crates/inference/ or crates/embed/ must include bench-compare output. PRs #106, #107, #108, #110 lack it (tokenizer-only correctness fixes). Suggest amending CLAUDE.md to exempt non-perf changes, OR add bench-compare evidence to PR descriptions.
F15 — Stale embed docs after API changes (PR #112)
crates/embed/CONFIG.md:82, crates/embed/DESIGN.md:18, crates/embed/README.md:123 still describe the pre-#112 API (single embed(), no roles, no role-aware cache).
Fix: Refresh after merging the role-aware stack.
Ship rationale
Per Ocean's directive 2026-05-25: ship lattice v0.2.5 with these follow-ups documented. khive's recall pipeline needs the new role-aware embed API + parity regression infrastructure to start its work, and the production model set (MiniLM/paraphrase) is correct. Address F1-F12 in a follow-up sweep after khive launch validates the new pipeline end-to-end.
Filed by: Ocean (via Claude Code) as part of the embed-perf-quality show. Umbrella PR: #104.
Context
The
embed-perf-qualityshow shipped 11 PRs (#105–#115) for tokenizer parity + embed quality fixes + HF parity regression gate. khive's ship-criterion models (BGE-small, MultilingualE5-small, all-MiniLM-L6-v2, paraphrase-multilingual-MiniLM-L12-v2) all pass HF parity at cosine ≥ 0.999 on the 5 measured inputs.Codex adversarial review (round 1) was run on all 11 PRs in parallel via
li agent -a reviewer --effort high. Verdicts were REQUEST CHANGES across the board, but findings are predominantly edge cases or systemic infra gaps that don't affect khive's actual production embeddings.Per Ocean's directive (2026-05-25): ship lattice with documented limitations, file follow-ups, launch khive. This issue tracks the deferred work.
High-priority follow-ups (blockers for clean codex APPROVE, not for khive ship)
F1 — CI silent-skip gate gap (PR #105, #111, #114)
Severity: High / Blocker (per codex)
Files:
crates/inference/tests/audit_tokenizer_parity.rs:74,132,192,crates/embed/tests/embed_parity_vs_hf.rs:151,158,223,230,crates/embed/tests/tokenizer_parity_e2e.rs,scripts/ci.sh:20,.github/workflows/ci.yml:42Issue: Audit parity tests and HF parity regression test all
returnearly (silently SKIP) when HF snapshots /~/.lattice/models/weights are absent. CI does not provision these —scripts/ci.shrunscargo testwithout downloading any models. On a clean CI runner, all parity tests "pass" without exercising a single case. The advertised regression gate is not actually gating CI.Fix options:
crates/inference/tests/fixtures/and rerouteload_tokenizer()test path to read from the repo, not~/.cache/huggingface/. Weights tests can stay skippable.huggingface-cli downloador similar) that provisions the snapshots before tests run. Costs ~5-10 min of CI time per build.LATTICE_REQUIRE_FIXTURES=1env or similar), with local opt-out.Option 1 is cleanest for tokenizer-only tests. Option 2 needed if vector parity tests must run in CI.
F2 — Qwen3 BPE
\s+(?!\S)whitespace regex semantics (PR #107)Severity: Major (per codex)
Files:
crates/inference/src/tokenizer/bpe.rs:974,977Issue: The hand-coded GPT-4 regex implementation only treats
\s+(?!\S)as "trailing whitespace at end of input". HF's regex permits matching when followed by ANY whitespace. Result: lattice diverges on inputs with repeated internal whitespace.Measured divergences (codex):
"foo bar": HF[7975, 220, 3619, 151643]vs lattice[7975, 256, 2257, 151643]"foo bar": HF[7975, 256, 3619, 151643]vs lattice[7975, 262, 2257, 151643]"foo\t\tbar": HF[7975, 197, 90709, 151643]vs lattice[7975, 298, 2257, 151643]"foo \tbar": HF[7975, 220, 90709, 151643]vs lattice[7975, 7018, 2257, 151643]khive impact: Khive uses MiniLM/E5, not Qwen3-Embedding (it's
#[ignore]d in the parity gate). No production impact today.Fix: Implement actual
\s+(?!\S)semantics — emit all but the last whitespace char as one piece, leaving the final whitespace for the following letter/punctuation branch. Add\"foo bar\",\"foo bar\",\"foo\\t\\tbar\",\"foo \\tbar\"toaudit_tokenizer_parity.rs::qwen3_embedding_0_6b_bpe_parity.F3 — WordPiece AddedToken structured fields not enforced (PR #108)
Severity: Major (per codex)
Files:
crates/inference/src/tokenizer/common.rs:729,crates/inference/src/tokenizer/wordpiece.rs:527,676Issue: PR #108 added longest-match scan for added tokens but
parse_added_tokensstill returnsHashMap<String, u32>(content+id only). The structured fieldssingle_word,lstrip,rstrip,normalizedfrom HF'sAddedTokenare parsed away. Also,tokenize_pair()bypasses the new scanner entirely.Concrete behavior: HF's
AddedToken("ing", single_word=True)does NOT match inside"tokenizing". Lattice would match it. None of the shipping models (BGE/E5/MiniLM/paraphrase) have non-default AddedToken metadata, so this is edge-case.Fix: Introduce internal
AddedToken { id, content, single_word, lstrip, rstrip, normalized, special }struct. Parse all fields. Enforce in match logic. Refactortokenize_pairto share the added-token-aware payload scanner.F4 — WordPiece NFD voicing mark stripping (PR #109)
Severity: Major (per codex)
Files:
crates/inference/src/tokenizer/wordpiece.rs:722Issue: PR #109 handles precomposed kana (
で→て) viafold_diacritic, but doesn't strip U+3099 / U+309A combining marks themselves. Already-decomposed NFD input (て\u{3099}) still survives normalization with the standalone combining mark as a separate token.Fix: Add
0x3099..=0x309Atois_combining_mark. KeepU+309B/U+309CandU+FF9E/U+FF9Fexcluded (HF BertNormalizer NFD leaves spacing/halfwidth marks intact).khive impact: None — khive's content arrives in NFC form, not NFD.
F5 — SentencePiece global whitespace stripping + literal
▁data loss (PR #110)Severity: 2 Major (per codex)
Files:
crates/inference/src/tokenizer/sentencepiece.rs(the trailing-ws strip logic)Issue: PR #110's
META_SPACEstrip is too aggressive — it strips trailing▁unconditionally, but HF's Metaspace allows literal▁characters as part of legitimate user content (rare but valid). Also, theescape_whitespaces=falseASCII path has different semantics than the metaspace path that aren't fully aligned.khive impact: None — khive text doesn't include literal
▁. But future Unicode-heavy use could expose this.Fix: Distinguish "trailing whitespace as space character" from "trailing literal
▁as content". Per-codepoint check, not blanket strip.F6 — Service-path tokenizer test coverage overstated (PR #111)
Severity: Major (per codex)
Files:
crates/embed/tests/tokenizer_parity_e2e.rs:7,11,111,226Issue: The e2e test calls
load_tokenizer(&dir)directly, NOT throughNativeEmbeddingService. The Qwen path's actual service flow goes throughload_qwen_model()+QwenModel::tokenize_for_embedding()which double-appends EOS — the e2e test's "expected IDs include EOS at end" misses this double-EOS path.Fix: Refactor the e2e test to spin up
NativeEmbeddingServiceand assert tokens through.embed_query()/.embed_passage()instrumentation. May require adding aninternal_tokenize_for_test()hook to the service.F7 — Generic cache key migration not preserved (PR #112)
Severity: Major (per codex)
Files:
crates/embed/src/cache.rs(CacheKey signature),crates/embed/src/service/cached.rsIssue: PR #112 added role-aware cache keys but the implementation changes the
compute_key()signature in a way that makes existing on-disk caches withrole:genericnot match the new Generic computation. Migration claim in PR description ("Old caches remain readable under Generic") may not actually hold.Fix: Verify with a test that loads a v0-format cache file and confirms Generic role hits. If broken, either preserve old hash format under Generic or add a cache schema version bump + migration step.
F8 — BGE query_instruction omitted (PR #112)
Severity: Major (per codex)
Files:
crates/embed/src/model.rs::query_instruction()Issue: BGE model card specifies a retrieval query instruction (
\"Represent this sentence for searching relevant passages: \"). PR #112'squery_instruction()returnsNonefor all BGE variants.embed_query()for BGE produces no-prefix embeddings, which is wrong per HF model card.khive impact: khive uses MiniLM/E5/paraphrase. BGE has no prompt OR mean-pool currently, only CLS-pool — instruction is for retrieval queries specifically. Defer until BGE retrieval is actually used.
Fix: Either (a) return BGE retrieval query instruction for
BgeSmallEnV15/BgeBaseEnV15/BgeLargeEnV15, or (b) explicitly document that BGE query prompting is caller-owned and removeembed_query()semantics for BGE.F9 — BertModel default-mean fallthrough (PR #113)
Severity: Major (per codex)
Files:
crates/inference/src/model/bert.rsIssue: Direct
BertModelcallers (outside the embed service) still getMeanas default pooling. If a downstream consumer constructs aBertModeldirectly for BGE, they'll get wrong-pool semantics unless they remember to callset_pooling(BertPooling::CLS).Fix: Either change the default to
None(require explicit set_pooling), or haveBertModel::from_pretrained_with_config()infer pooling from model_id at construction time.F10 — E5 golden idempotency + max_abs_diff not enforced (PR #114)
Severity: Blocker 2 + Medium (per codex)
Files:
crates/embed/tests/embed_parity_vs_hf.rs,scripts/gen_embed_parity_goldens.pyIssue: The
MAX_ABS_DIFF_F32 = 1e-3constant is documented as a tolerance but never asserted in the test. Also, E5 golden regeneration is NOT byte-idempotent between identical runs of the generator (observed during the show — fixture mtime changed but content actually was identical, suggesting some non-determinism worth tracking down).Fix: Enforce max_abs_diff in
run_parity_for_model(). Add a CI step that regenerates a fixture and asserts byte-identity to the committed version (or document why drift is expected).F11 — Qwen
#[ignore]masks the 2 passing cases (PR #114)Severity: Major (per codex)
Issue:
#[ignore]on the wholeqwen3_embedding_0_6b_parity_vs_hftest disables coverage for the 2 inputs that DO pass at cosine ≥ 0.995, not just the 3 failing ones.Fix: Split Qwen test into 2 functions — one for the 2 known-passing inputs (active), one for the 3 known-failing inputs (
#[ignore]pending lattice#103).F12 — HF snapshot "latest" selection is alphabetical, not chronological (PR #115)
Severity: Medium (per codex)
Files:
scripts/gen_embed_parity_goldens.py::find_hf_cache_snapshot()Issue:
sorted(snapshots_dir.iterdir())[-1]picks alphabetically last snapshot hash, not chronologically latest. If HF cache has multiple snapshots, regeneration may pick an old one silently.Fix: Sort by
mtime, not by name.Low-priority follow-ups
F13 — Required PR attribution line missing (all 11 PRs)
Per
AGENTS.md:11, agent-authored PR bodies must start with> _PR description authored by Claude (Anthropic agent) on behalf of @ohdearquant._. All 11 PRs miss this — they have the "Generated with Claude Code" footer instead.F14 —
make bench-compareevidence missing (multiple PRs)Per
CLAUDE.md:7, every PR touchingcrates/inference/orcrates/embed/must include bench-compare output. PRs #106, #107, #108, #110 lack it (tokenizer-only correctness fixes). Suggest amending CLAUDE.md to exempt non-perf changes, OR add bench-compare evidence to PR descriptions.F15 — Stale embed docs after API changes (PR #112)
crates/embed/CONFIG.md:82,crates/embed/DESIGN.md:18,crates/embed/README.md:123still describe the pre-#112 API (single embed(), no roles, no role-aware cache).Fix: Refresh after merging the role-aware stack.
Ship rationale
Per Ocean's directive 2026-05-25: ship lattice v0.2.5 with these follow-ups documented. khive's recall pipeline needs the new role-aware embed API + parity regression infrastructure to start its work, and the production model set (MiniLM/paraphrase) is correct. Address F1-F12 in a follow-up sweep after khive launch validates the new pipeline end-to-end.
Filed by: Ocean (via Claude Code) as part of the embed-perf-quality show. Umbrella PR: #104.