Skip to content

bench(quality): PyTorch generation quality comparison harness #89

@ohdearquant

Description

@ohdearquant

Context

Issue #84 covers cross-framework throughput benchmarks (tok/s). This issue covers generation quality — does lattice produce the same token distributions as PyTorch given identical model weights + inputs?

Without this, we can claim speed but cannot prove correctness at the generation level. A subtle bug in attention masking, RoPE, or sampling could produce fast nonsense.

Goal

For Qwen3.5-0.8B (focus model), prove that lattice's generation matches PyTorch's within acceptable bounds.

Test dimensions

Dimension Values
Model Qwen3.5-0.8B (FP16 and Q4 QuaRot)
Prompt set 50-100 representative prompts (instruction-following, code, reasoning)
Sampling Greedy (deterministic comparison), temp=0.7 (distribution comparison)
Metric Per-token KL divergence, exact-match rate at greedy, MMLU subset accuracy

Deliverables

  1. scripts/quality_pytorch_compare.py — loads HF Qwen3.5-0.8B, generates from prompt set, dumps (prompt, generated_tokens, logits[0:100]) per prompt
  2. scripts/quality_lattice_compare.sh — runs lattice on same prompts, same dump format
  3. scripts/quality_compare.py — reads both dumps, computes:
    • Greedy match rate (% of prompts where first 50 tokens are identical)
    • Mean per-token KL divergence on logits
    • First-divergence-position histogram (where does lattice start drifting)
  4. Output: markdown report with per-prompt diffs for the worst cases, summary stats at top

Acceptance criteria

For Qwen3.5-0.8B FP16:

  • Greedy match rate ≥ 99% for first 10 tokens
  • Mean per-token KL ≤ 0.01 nats over first 100 logits
  • For Q4 QuaRot: greedy match rate ≥ 95% for first 10 tokens (quantization error allowed)

Why this matters

Issue #88 (LoRA fine-tuning) needs this harness to validate that trained adapters work — if a LoRA-tuned lattice diverges from a LoRA-tuned PyTorch on the same adapter, we know the inference path has bugs (not the training).

Priority

P1 — blocking confident claims about lattice quality. Can be tackled in parallel with #88.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions