Skip to content

fabiantoh98/llm-preference-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Preference Learning

A hands-on comparison of LLM preference optimization methods, implemented with TinyLlama (1.1B) and LoRA fine-tuning on a single consumer GPU.

Results

All methods trained on Intel/orca_dpo_pairs (12,859 preference pairs, GPT-4 vs GPT-3.5 responses) for 2 epochs. Evaluated using a trained reward model as judge on 200 held-out prompts.

Method Avg Reward vs Chosen vs Rejected
ORPO (chat) 0.7200 +0.1484 +0.4108
ORPO (base) 0.6724 +0.1008 +0.3631
KTO (chat) 0.6617 +0.0901 +0.3525
Chosen baseline 0.5717 -- +0.2624
DPO (chat) 0.4824 -0.0893 +0.1731
Rejected baseline 0.3093 -0.2624 --

Key Findings

  • ORPO performed best overall. Its combined SFT + preference optimization approach was the most effective, and even the base model variant was competitive with fine-tuned chat models.
  • KTO showed strong results despite using only binary feedback (good/bad) rather than pairwise comparisons, making it practical for real-world thumbs-up/down data.
  • DPO scored below the chosen baseline, suggesting it may be too conservative or mode-seeking with these hyperparameters.
  • PPO (RLHF) was not included in the final comparison due to VRAM constraints -- 4-bit quantization of all four models (policy, reference, value, reward) hit dtype compatibility issues with TRL's experimental PPO trainer.

Methods Covered

Method Description Reference Model Data Format
DPO Direct preference optimization, no reward model Implicit (LoRA base) Pairwise (chosen/rejected)
ORPO Combined SFT + preference in one training pass None needed Pairwise (chosen/rejected)
KTO Binary feedback based on prospect theory Implicit (LoRA base) Binary (good/bad)
RLHF Classic PPO with trained reward model Separate model Pairwise + reward model

Project Structure

llm-preference-learning/
├── src/preference_learning/     # Shared library (config, data loaders, utils)
├── experiments/
│   ├── 01_dpo/                  # DPO training script + configs
│   ├── 02_orpo/                 # ORPO training script + configs
│   ├── 03_kto/                  # KTO training script + configs
│   └── 04_rlhf/                # Reward model + PPO scripts + configs
├── scripts/
│   ├── download_datasets.py     # Download datasets for offline use
│   ├── evaluate_models.py       # Compare all models with reward model judge
│   ├── run_comparison.py        # Run full comparison pipeline overnight
│   └── sanity_check.py          # Pre-flight VRAM + timing validation
├── notebooks/                   # Interactive walkthroughs
├── data/                        # Cached datasets (not tracked)
└── models/                      # Cached models (not tracked)

Setup

Prerequisites

  • Python 3.10+
  • NVIDIA GPU with CUDA support (tested on RTX 4070 Laptop, 8GB VRAM)
  • uv package manager

Installation

git clone https://github.com/<your-username>/llm-preference-learning.git
cd llm-preference-learning
uv sync

Download Models and Data

uv run python scripts/download_datasets.py

This downloads TinyLlama and the Orca DPO dataset for offline training.

Usage

Train a Single Method

# DPO
uv run python experiments/01_dpo/train_dpo.py

# ORPO (chat model)
uv run python experiments/02_orpo/train_orpo.py

# KTO
uv run python experiments/03_kto/train_kto.py

# RLHF (reward model, then PPO)
uv run python experiments/04_rlhf/train_reward_model.py
uv run python experiments/04_rlhf/train_ppo.py

Each script reads from its config.yaml by default. For full dataset runs, pass the full config:

uv run python experiments/01_dpo/train_dpo.py experiments/01_dpo/config_full.yaml

Run Full Comparison

# Validate VRAM and estimate timing first
uv run python scripts/sanity_check.py

# Run all methods sequentially (overnight)
uv run python scripts/run_comparison.py

# Evaluate all trained models
uv run python scripts/evaluate_models.py

Training Configuration

All methods share the same base configuration:

Parameter Value
Base model TinyLlama-1.1B-Chat-v1.0
Quantization 4-bit NF4 (QLoRA)
LoRA rank 8
LoRA targets q_proj, v_proj, k_proj, o_proj
Max length 512 tokens
Learning rate 5e-5
Effective batch size 16
Epochs 2
Precision BFloat16

Hardware Notes

Tested on RTX 4070 Laptop GPU (8GB VRAM). Key findings for memory-constrained setups:

  • DPO/ORPO: batch_size=2 with gradient_accumulation=8 stays within 4GB VRAM
  • KTO: Uses ~2x VRAM vs DPO at the same batch size (4 forward passes per step for KL estimation). Requires a CudaCacheCallback to clear CUDA cache every step and prevent memory fragmentation growth
  • PPO: Loading 4 models (policy, reference, value, reward) exceeds 8GB even with 4-bit quantization
  • Windows Task Manager's "GPU Memory" combines dedicated + shared memory -- use nvidia-smi for accurate VRAM readings

Dependencies

Core packages: transformers, trl, datasets, peft, accelerate, bitsandbytes, torch, wandb, pydantic, pyyaml

See pyproject.toml for the full list.

License

MIT

About

End-to-end LLM preference learning pipeline: training, evaluation, and comparison of DPO, ORPO, KTO, and RLHF with 4-bit quantization, LoRA, and memory-efficient training on a single 8GB GPU.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages