post-training-lab

Controlled comparison of LLM post-training methods (SFT, DPO, KTO, RLHF/PPO, GRPO) on pretrained Qwen/Qwen2.5-0.5B with one eval harness. Scope, phase plan, datasets, compute budget, anti-goals — see PROJECT.md.

📖 Docs: agaonker.github.io/post-training-lab

Status: Phase 2 (DPO) closed. Phase 3 (Reward Model + PPO) next. Three methods rows in the comparison table below; each phase adds one. The README is updated as each phase lands.

Quickstart

uv sync --extra dev --extra eval

Results

All runs: vLLM/bf16, Modal L4. IFEval = prompt_level_strict_acc,none. GSM8K = exact_match,strict-match. Each row carries its own config_hash — same fingerprint means apples-to-apples; different fingerprints mean the model / tokenizer / adapter config differs (deliberately).

Method	config_hash	MMLU	GSM8K	TruthfulQA	IFEval (prompt strict)	Judge win-rate vs SFT
base (pretrained Qwen2.5-0.5B)	`fde0720e`	0.4813	0.3389	0.3988	0.1238	—
sft_v2 (UltraChat-200k, 5k, QLoRA r=16, 1 epoch)	`b133712d`	0.4713	0.3450	0.3893	0.1201	—
dpo_v1 (UltraFeedback, 5k pairs, β=0.1, 1 epoch)	`d53fd258`	0.4802	0.3495	0.3958	0.1275	Phase 6

Deltas

Metric	sft_v2 − base	dpo_v1 − sft_v2	dpo_v1 − base
MMLU	−1.00pp	+0.89pp	−0.11pp
GSM8K strict	+0.61pp	+0.45pp	+1.06pp
TruthfulQA	−0.95pp	+0.65pp	−0.30pp
IFEval prompt-strict	−0.37pp (flat)	+0.74pp	+0.37pp
IFEval inst-strict	+1.20pp	+0.24pp	+1.44pp

The headline finding so far: SFT on a 0.5B with 5k UltraChat rows was flat on the hardest IFEval metric. DPO on top of that SFT moved every lm-eval metric in the right direction and put dpo_v1 slightly above the pretrained base on IFEval prompt-strict — the first policy in this lab to clear that bar. The Phase 2 formal success criterion (LLM-judge pairwise win-rate) is gated on Phase 6 and not yet measured. See experiments/003_dpo_qwen05b.md.

Learnings

LESSONS.md — terse one-liners with dates, grouped by category (base model, tokenizer, training framework, HF Hub, wandb, process).

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
experiments		experiments
notebooks		notebooks
results		results
scripts		scripts
src/atlas		src/atlas
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LESSONS.md		LESSONS.md
Makefile		Makefile
PROJECT.md		PROJECT.md
README.md		README.md
TODOS.md		TODOS.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

post-training-lab

Quickstart

Results

Deltas

Learnings

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

post-training-lab

Quickstart

Results

Deltas

Learnings

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages