LLM Preference Learning

A hands-on comparison of LLM preference optimization methods, implemented with TinyLlama (1.1B) and LoRA fine-tuning on a single consumer GPU.

Results

All methods trained on Intel/orca_dpo_pairs (12,859 preference pairs, GPT-4 vs GPT-3.5 responses) for 2 epochs. Evaluated using a trained reward model as judge on 200 held-out prompts.

Method	Avg Reward	vs Chosen	vs Rejected
ORPO (chat)	0.7200	+0.1484	+0.4108
ORPO (base)	0.6724	+0.1008	+0.3631
KTO (chat)	0.6617	+0.0901	+0.3525
Chosen baseline	0.5717	--	+0.2624
DPO (chat)	0.4824	-0.0893	+0.1731
Rejected baseline	0.3093	-0.2624	--

Key Findings

ORPO performed best overall. Its combined SFT + preference optimization approach was the most effective, and even the base model variant was competitive with fine-tuned chat models.
KTO showed strong results despite using only binary feedback (good/bad) rather than pairwise comparisons, making it practical for real-world thumbs-up/down data.
DPO scored below the chosen baseline, suggesting it may be too conservative or mode-seeking with these hyperparameters.
PPO (RLHF) was not included in the final comparison due to VRAM constraints -- 4-bit quantization of all four models (policy, reference, value, reward) hit dtype compatibility issues with TRL's experimental PPO trainer.

Methods Covered

Method	Description	Reference Model	Data Format
DPO	Direct preference optimization, no reward model	Implicit (LoRA base)	Pairwise (chosen/rejected)
ORPO	Combined SFT + preference in one training pass	None needed	Pairwise (chosen/rejected)
KTO	Binary feedback based on prospect theory	Implicit (LoRA base)	Binary (good/bad)
RLHF	Classic PPO with trained reward model	Separate model	Pairwise + reward model

Project Structure

llm-preference-learning/
├── src/preference_learning/     # Shared library (config, data loaders, utils)
├── experiments/
│   ├── 01_dpo/                  # DPO training script + configs
│   ├── 02_orpo/                 # ORPO training script + configs
│   ├── 03_kto/                  # KTO training script + configs
│   └── 04_rlhf/                # Reward model + PPO scripts + configs
├── scripts/
│   ├── download_datasets.py     # Download datasets for offline use
│   ├── evaluate_models.py       # Compare all models with reward model judge
│   ├── run_comparison.py        # Run full comparison pipeline overnight
│   └── sanity_check.py          # Pre-flight VRAM + timing validation
├── notebooks/                   # Interactive walkthroughs
├── data/                        # Cached datasets (not tracked)
└── models/                      # Cached models (not tracked)

Setup

Prerequisites

Python 3.10+
NVIDIA GPU with CUDA support (tested on RTX 4070 Laptop, 8GB VRAM)
uv package manager

Installation

git clone https://github.com/<your-username>/llm-preference-learning.git
cd llm-preference-learning
uv sync

Download Models and Data

uv run python scripts/download_datasets.py

This downloads TinyLlama and the Orca DPO dataset for offline training.

Usage

Train a Single Method

# DPO
uv run python experiments/01_dpo/train_dpo.py

# ORPO (chat model)
uv run python experiments/02_orpo/train_orpo.py

# KTO
uv run python experiments/03_kto/train_kto.py

# RLHF (reward model, then PPO)
uv run python experiments/04_rlhf/train_reward_model.py
uv run python experiments/04_rlhf/train_ppo.py

Each script reads from its config.yaml by default. For full dataset runs, pass the full config:

uv run python experiments/01_dpo/train_dpo.py experiments/01_dpo/config_full.yaml

Run Full Comparison

# Validate VRAM and estimate timing first
uv run python scripts/sanity_check.py

# Run all methods sequentially (overnight)
uv run python scripts/run_comparison.py

# Evaluate all trained models
uv run python scripts/evaluate_models.py

Training Configuration

All methods share the same base configuration:

Parameter	Value
Base model	TinyLlama-1.1B-Chat-v1.0
Quantization	4-bit NF4 (QLoRA)
LoRA rank	8
LoRA targets	q_proj, v_proj, k_proj, o_proj
Max length	512 tokens
Learning rate	5e-5
Effective batch size	16
Epochs	2
Precision	BFloat16

Hardware Notes

Tested on RTX 4070 Laptop GPU (8GB VRAM). Key findings for memory-constrained setups:

DPO/ORPO: batch_size=2 with gradient_accumulation=8 stays within 4GB VRAM
KTO: Uses ~2x VRAM vs DPO at the same batch size (4 forward passes per step for KL estimation). Requires a CudaCacheCallback to clear CUDA cache every step and prevent memory fragmentation growth
PPO: Loading 4 models (policy, reference, value, reward) exceeds 8GB even with 4-bit quantization
Windows Task Manager's "GPU Memory" combines dedicated + shared memory -- use nvidia-smi for accurate VRAM readings

Dependencies

Core packages: transformers, trl, datasets, peft, accelerate, bitsandbytes, torch, wandb, pydantic, pyyaml

See pyproject.toml for the full list.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
experiments		experiments
scripts		scripts
src/preference_learning		src/preference_learning
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Preference Learning

Results

Key Findings

Methods Covered

Project Structure

Setup

Prerequisites

Installation

Download Models and Data

Usage

Train a Single Method

Run Full Comparison

Training Configuration

Hardware Notes

Dependencies

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Preference Learning

Results

Key Findings

Methods Covered

Project Structure

Setup

Prerequisites

Installation

Download Models and Data

Usage

Train a Single Method

Run Full Comparison

Training Configuration

Hardware Notes

Dependencies

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages