bench(quality): PyTorch generation quality comparison harness

## Context

Issue #84 covers cross-framework throughput benchmarks (tok/s). This issue covers **generation quality** — does lattice produce the same token distributions as PyTorch given identical model weights + inputs?

Without this, we can claim speed but cannot prove correctness at the generation level. A subtle bug in attention masking, RoPE, or sampling could produce fast nonsense.

## Goal

For Qwen3.5-0.8B (focus model), prove that lattice's generation matches PyTorch's within acceptable bounds.

## Test dimensions

| Dimension | Values |
|---|---|
| Model | Qwen3.5-0.8B (FP16 and Q4 QuaRot) |
| Prompt set | 50-100 representative prompts (instruction-following, code, reasoning) |
| Sampling | Greedy (deterministic comparison), temp=0.7 (distribution comparison) |
| Metric | Per-token KL divergence, exact-match rate at greedy, MMLU subset accuracy |

## Deliverables

1. `scripts/quality_pytorch_compare.py` — loads HF Qwen3.5-0.8B, generates from prompt set, dumps `(prompt, generated_tokens, logits[0:100])` per prompt
2. `scripts/quality_lattice_compare.sh` — runs lattice on same prompts, same dump format
3. `scripts/quality_compare.py` — reads both dumps, computes:
   - Greedy match rate (% of prompts where first 50 tokens are identical)
   - Mean per-token KL divergence on logits
   - First-divergence-position histogram (where does lattice start drifting)
4. Output: markdown report with per-prompt diffs for the worst cases, summary stats at top

## Acceptance criteria

For Qwen3.5-0.8B FP16:
- Greedy match rate ≥ 99% for first 10 tokens
- Mean per-token KL ≤ 0.01 nats over first 100 logits
- For Q4 QuaRot: greedy match rate ≥ 95% for first 10 tokens (quantization error allowed)

## Why this matters

Issue #88 (LoRA fine-tuning) needs this harness to validate that trained adapters work — if a LoRA-tuned lattice diverges from a LoRA-tuned PyTorch on the same adapter, we know the inference path has bugs (not the training).

## Priority

P1 — blocking confident claims about lattice quality. Can be tackled in parallel with #88.

## References

- Issue #84 — cross-framework throughput
- ADR-001 — pure Rust transformer
- `scripts/bench_apples_to_apples.sh` — existing throughput comparison

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench(quality): PyTorch generation quality comparison harness #89

Context

Goal

Test dimensions

Deliverables

Acceptance criteria

Why this matters

Priority

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Dimension	Values
Model	Qwen3.5-0.8B (FP16 and Q4 QuaRot)
Prompt set	50-100 representative prompts (instruction-following, code, reasoning)
Sampling	Greedy (deterministic comparison), temp=0.7 (distribution comparison)
Metric	Per-token KL divergence, exact-match rate at greedy, MMLU subset accuracy

bench(quality): PyTorch generation quality comparison harness #89

Description

Context

Goal

Test dimensions

Deliverables

Acceptance criteria

Why this matters

Priority

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions