A hands-on comparison of LLM preference optimization methods, implemented with TinyLlama (1.1B) and LoRA fine-tuning on a single consumer GPU.
All methods trained on Intel/orca_dpo_pairs (12,859 preference pairs, GPT-4 vs GPT-3.5 responses) for 2 epochs. Evaluated using a trained reward model as judge on 200 held-out prompts.
| Method | Avg Reward | vs Chosen | vs Rejected |
|---|---|---|---|
| ORPO (chat) | 0.7200 | +0.1484 | +0.4108 |
| ORPO (base) | 0.6724 | +0.1008 | +0.3631 |
| KTO (chat) | 0.6617 | +0.0901 | +0.3525 |
| Chosen baseline | 0.5717 | -- | +0.2624 |
| DPO (chat) | 0.4824 | -0.0893 | +0.1731 |
| Rejected baseline | 0.3093 | -0.2624 | -- |
- ORPO performed best overall. Its combined SFT + preference optimization approach was the most effective, and even the base model variant was competitive with fine-tuned chat models.
- KTO showed strong results despite using only binary feedback (good/bad) rather than pairwise comparisons, making it practical for real-world thumbs-up/down data.
- DPO scored below the chosen baseline, suggesting it may be too conservative or mode-seeking with these hyperparameters.
- PPO (RLHF) was not included in the final comparison due to VRAM constraints -- 4-bit quantization of all four models (policy, reference, value, reward) hit dtype compatibility issues with TRL's experimental PPO trainer.
| Method | Description | Reference Model | Data Format |
|---|---|---|---|
| DPO | Direct preference optimization, no reward model | Implicit (LoRA base) | Pairwise (chosen/rejected) |
| ORPO | Combined SFT + preference in one training pass | None needed | Pairwise (chosen/rejected) |
| KTO | Binary feedback based on prospect theory | Implicit (LoRA base) | Binary (good/bad) |
| RLHF | Classic PPO with trained reward model | Separate model | Pairwise + reward model |
llm-preference-learning/
├── src/preference_learning/ # Shared library (config, data loaders, utils)
├── experiments/
│ ├── 01_dpo/ # DPO training script + configs
│ ├── 02_orpo/ # ORPO training script + configs
│ ├── 03_kto/ # KTO training script + configs
│ └── 04_rlhf/ # Reward model + PPO scripts + configs
├── scripts/
│ ├── download_datasets.py # Download datasets for offline use
│ ├── evaluate_models.py # Compare all models with reward model judge
│ ├── run_comparison.py # Run full comparison pipeline overnight
│ └── sanity_check.py # Pre-flight VRAM + timing validation
├── notebooks/ # Interactive walkthroughs
├── data/ # Cached datasets (not tracked)
└── models/ # Cached models (not tracked)
- Python 3.10+
- NVIDIA GPU with CUDA support (tested on RTX 4070 Laptop, 8GB VRAM)
- uv package manager
git clone https://github.com/<your-username>/llm-preference-learning.git
cd llm-preference-learning
uv syncuv run python scripts/download_datasets.pyThis downloads TinyLlama and the Orca DPO dataset for offline training.
# DPO
uv run python experiments/01_dpo/train_dpo.py
# ORPO (chat model)
uv run python experiments/02_orpo/train_orpo.py
# KTO
uv run python experiments/03_kto/train_kto.py
# RLHF (reward model, then PPO)
uv run python experiments/04_rlhf/train_reward_model.py
uv run python experiments/04_rlhf/train_ppo.pyEach script reads from its config.yaml by default. For full dataset runs, pass the full config:
uv run python experiments/01_dpo/train_dpo.py experiments/01_dpo/config_full.yaml# Validate VRAM and estimate timing first
uv run python scripts/sanity_check.py
# Run all methods sequentially (overnight)
uv run python scripts/run_comparison.py
# Evaluate all trained models
uv run python scripts/evaluate_models.pyAll methods share the same base configuration:
| Parameter | Value |
|---|---|
| Base model | TinyLlama-1.1B-Chat-v1.0 |
| Quantization | 4-bit NF4 (QLoRA) |
| LoRA rank | 8 |
| LoRA targets | q_proj, v_proj, k_proj, o_proj |
| Max length | 512 tokens |
| Learning rate | 5e-5 |
| Effective batch size | 16 |
| Epochs | 2 |
| Precision | BFloat16 |
Tested on RTX 4070 Laptop GPU (8GB VRAM). Key findings for memory-constrained setups:
- DPO/ORPO: batch_size=2 with gradient_accumulation=8 stays within 4GB VRAM
- KTO: Uses ~2x VRAM vs DPO at the same batch size (4 forward passes per step for KL estimation). Requires a
CudaCacheCallbackto clear CUDA cache every step and prevent memory fragmentation growth - PPO: Loading 4 models (policy, reference, value, reward) exceeds 8GB even with 4-bit quantization
- Windows Task Manager's "GPU Memory" combines dedicated + shared memory -- use
nvidia-smifor accurate VRAM readings
Core packages: transformers, trl, datasets, peft, accelerate, bitsandbytes, torch, wandb, pydantic, pyyaml
See pyproject.toml for the full list.
MIT