Context
Issue #84 covers cross-framework throughput benchmarks (tok/s). This issue covers generation quality — does lattice produce the same token distributions as PyTorch given identical model weights + inputs?
Without this, we can claim speed but cannot prove correctness at the generation level. A subtle bug in attention masking, RoPE, or sampling could produce fast nonsense.
Goal
For Qwen3.5-0.8B (focus model), prove that lattice's generation matches PyTorch's within acceptable bounds.
Test dimensions
| Dimension |
Values |
| Model |
Qwen3.5-0.8B (FP16 and Q4 QuaRot) |
| Prompt set |
50-100 representative prompts (instruction-following, code, reasoning) |
| Sampling |
Greedy (deterministic comparison), temp=0.7 (distribution comparison) |
| Metric |
Per-token KL divergence, exact-match rate at greedy, MMLU subset accuracy |
Deliverables
scripts/quality_pytorch_compare.py — loads HF Qwen3.5-0.8B, generates from prompt set, dumps (prompt, generated_tokens, logits[0:100]) per prompt
scripts/quality_lattice_compare.sh — runs lattice on same prompts, same dump format
scripts/quality_compare.py — reads both dumps, computes:
- Greedy match rate (% of prompts where first 50 tokens are identical)
- Mean per-token KL divergence on logits
- First-divergence-position histogram (where does lattice start drifting)
- Output: markdown report with per-prompt diffs for the worst cases, summary stats at top
Acceptance criteria
For Qwen3.5-0.8B FP16:
- Greedy match rate ≥ 99% for first 10 tokens
- Mean per-token KL ≤ 0.01 nats over first 100 logits
- For Q4 QuaRot: greedy match rate ≥ 95% for first 10 tokens (quantization error allowed)
Why this matters
Issue #88 (LoRA fine-tuning) needs this harness to validate that trained adapters work — if a LoRA-tuned lattice diverges from a LoRA-tuned PyTorch on the same adapter, we know the inference path has bugs (not the training).
Priority
P1 — blocking confident claims about lattice quality. Can be tackled in parallel with #88.
References
Context
Issue #84 covers cross-framework throughput benchmarks (tok/s). This issue covers generation quality — does lattice produce the same token distributions as PyTorch given identical model weights + inputs?
Without this, we can claim speed but cannot prove correctness at the generation level. A subtle bug in attention masking, RoPE, or sampling could produce fast nonsense.
Goal
For Qwen3.5-0.8B (focus model), prove that lattice's generation matches PyTorch's within acceptable bounds.
Test dimensions
Deliverables
scripts/quality_pytorch_compare.py— loads HF Qwen3.5-0.8B, generates from prompt set, dumps(prompt, generated_tokens, logits[0:100])per promptscripts/quality_lattice_compare.sh— runs lattice on same prompts, same dump formatscripts/quality_compare.py— reads both dumps, computes:Acceptance criteria
For Qwen3.5-0.8B FP16:
Why this matters
Issue #88 (LoRA fine-tuning) needs this harness to validate that trained adapters work — if a LoRA-tuned lattice diverges from a LoRA-tuned PyTorch on the same adapter, we know the inference path has bugs (not the training).
Priority
P1 — blocking confident claims about lattice quality. Can be tackled in parallel with #88.
References
scripts/bench_apples_to_apples.sh— existing throughput comparison