Unsloth RLHF Experiments (DPO, ORPO, SimPO, KTO)

This repository contains practical, minimal scripts and notebooks to fine‑tune and evaluate large language models (LLMs) using several RLHF-style algorithms with the Unsloth stack: DPO, ORPO, SimPO, and KTO. It also includes quick evaluation utilities and a consolidated comparison of results.

Repository Structure

DPO/
- Zephyr_(7B)-DPO.ipynb — Notebook for DPO training on Zephyr 7B
- checkpoint-117/ — Example trained adapter (LoRA) checkpoint and tokenizer assets
- quick_evaluation.py — Compare base vs trained checkpoint on curated prompts
- q_eval.md — Example evaluation run logs for DPO
ORPO/
- Llama3_(8B)_ORPO.ipynb — Notebook for ORPO training on Llama 3 8B
- lora_model/ — Example adapter weights saved after training
- quick_evaluation.py — Compare base vs trained checkpoint on curated prompts
- q_eval.md — Example evaluation run logs for ORPO (summary at root)
SimPO/
- SimPO_colab_notebook.ipynb — Notebook for SimPO training
- lora_model/ — Example adapter weights saved after training
- quick_evaluation.py — Compare base vs trained checkpoint on curated prompts
- q_eval.md — Example evaluation run logs for SimPO (summary at root)
KTO/
- kto_colab_notebook.ipynb — Notebook for KTO training
- lora_model/ — Example adapter weights saved after training
- quick_evaluation.py — Compare base vs trained checkpoint on curated prompts
- q_eval.md — Example evaluation run logs for KTO (summary at root)
rlhf/
- rl_dpo.py — DPO training script with Unsloth patches on UltraFeedback
- unsloth/ — Vendored Unsloth project components used by notebooks/examples
dpo_rlhf.py — Minimal DPO training script (Gemma 3 4B IT, 4‑bit)
evaluation_summary.md — Comparison across algorithms from quick eval runs
requirements.txt — Pinned dependencies (Transformers, TRL, Unsloth, Torch, etc.)

Environment and Requirements

GPU: NVIDIA GPU recommended (VRAM >= 12 GB for the 4‑bit examples; adjust batch sizes if needed)
Python: A recent Python (3.10–3.12 is typical with the pinned packages)
CUDA: The pinned Torch build targets CUDA 12.x. Use a CUDA‑compatible driver/runtime.
Dependencies: Install from requirements.txt

Quick start (Windows PowerShell):

Create a virtual environment (optional but recommended)

python -m venv .venv
..venv\Scripts\Activate

Install dependencies

pip install -r requirements.txt

(Optional) Set a specific GPU

$env:CUDA_VISIBLE_DEVICES = "0"

Training Recipes

Minimal DPO (Gemma 3 4B IT, 4‑bit)

Script: dpo_rlhf.py
Notes:
- Uses Unsloth FastLanguageModel with 4‑bit loading and LoRA patching
- Samples a small fraction of Dahoas/rm-static for quick prototyping
Run:
python dpo_rlhf.py

DPO on UltraFeedback (Zephyr SFT 4‑bit)

Script: rlhf/rl_dpo.py
Notes:
- Patches DPO trainer via Unsloth (PatchDPOTrainer)
- Loads trl-lib/ultrafeedback_binarized
- Uses LoRA with gradient checkpointing
Run:
python rlhf/rl_dpo.py

Algorithm Notebooks

DPO: DPO/Zephyr_(7B)-DPO.ipynb
ORPO: ORPO/Llama3_(8B)_ORPO.ipynb
SimPO: SimPO/SimPO_colab_notebook.ipynb
KTO: KTO/kto_colab_notebook.ipynb

Open the notebook you are interested in and run cells top‑to‑bottom. Most notebooks save a LoRA adapter into lora_model/ or outputs/.

Quick Evaluation (Base vs Trained)

Each algorithm folder includes a quick_evaluation.py that:

Loads a base model and a trained adapter/checkpoint
Generates responses to an identical set of prompts
Prints latency and responses, and summarizes at the end

Usage:

Update trained_model_path to point to your adapter (e.g., for DPO on Windows: f:/unsloth-rlhf/DPO/checkpoint-117)
Optionally update base_model_path to match the base model you trained from
Run:
python /quick_evaluation.py

Evaluation logs are captured in per‑algorithm q_eval.md files and summarized in evaluation_summary.md.

Results Snapshot (from evaluation_summary.md)

DPO: Base 14.48s vs Trained 14.41s (≈ +0.5% faster), slightly more focused responses
ORPO: Base 11.88s vs Trained 16.73s (≈ 40.8% slower), mixed quality; sometimes blank/irrelevant
SimPO: Base 12.47s vs Trained 14.10s (≈ 13% slower), mixed results
KTO: Base 10.80s vs Trained 12.97s (≈ 20.1% slower), mixed results

See evaluation_summary.md for details and a visual comparison.

Tips & Troubleshooting

Out of Memory (OOM):
- Reduce per_device_train_batch_size
- Increase gradient_accumulation_steps
- Ensure load_in_4bit=True for large models
GPU selection:
- PowerShell: $env:CUDA_VISIBLE_DEVICES = "0"
Speed features:
- Flash Attention is disabled in quick evals by default (use_flash_attention_2=False); enable only if your environment supports it.
Model paths:
- Replace any hardcoded Linux paths (e.g., /home/ubuntu/...) with your local OS paths.

Acknowledgements

Built on top of the Unsloth ecosystem (FastLanguageModel, trainer patches, and utilities)
Uses Hugging Face Transformers, TRL, Datasets, and related tooling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unsloth RLHF Experiments (DPO, ORPO, SimPO, KTO)

Repository Structure

Environment and Requirements

Training Recipes

Quick Evaluation (Base vs Trained)

Results Snapshot (from evaluation_summary.md)

Tips & Troubleshooting

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
DPO		DPO
KTO		KTO
ORPO		ORPO
SimPO		SimPO
rlhf		rlhf
.gitignore		.gitignore
README.md		README.md
Zephyr_(7B)-DPO.ipynb		Zephyr_(7B)-DPO.ipynb
dpo_q_eval.md		dpo_q_eval.md
dpo_rlhf.py		dpo_rlhf.py
evaluation_summary.md		evaluation_summary.md
kto_q_eval.md		kto_q_eval.md
orpo_q_eval.md		orpo_q_eval.md
requirements.txt		requirements.txt
simpo_q_eval.md		simpo_q_eval.md

Folders and files

Latest commit

History

Repository files navigation

Unsloth RLHF Experiments (DPO, ORPO, SimPO, KTO)

Repository Structure

Environment and Requirements

Training Recipes

Quick Evaluation (Base vs Trained)

Results Snapshot (from evaluation_summary.md)

Tips & Troubleshooting

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages