Skip to content

f3990/unsloth-rlhf

Repository files navigation

Unsloth RLHF Experiments (DPO, ORPO, SimPO, KTO)

This repository contains practical, minimal scripts and notebooks to fine‑tune and evaluate large language models (LLMs) using several RLHF-style algorithms with the Unsloth stack: DPO, ORPO, SimPO, and KTO. It also includes quick evaluation utilities and a consolidated comparison of results.

Repository Structure

  • DPO/
    • Zephyr_(7B)-DPO.ipynb — Notebook for DPO training on Zephyr 7B
    • checkpoint-117/ — Example trained adapter (LoRA) checkpoint and tokenizer assets
    • quick_evaluation.py — Compare base vs trained checkpoint on curated prompts
    • q_eval.md — Example evaluation run logs for DPO
  • ORPO/
    • Llama3_(8B)_ORPO.ipynb — Notebook for ORPO training on Llama 3 8B
    • lora_model/ — Example adapter weights saved after training
    • quick_evaluation.py — Compare base vs trained checkpoint on curated prompts
    • q_eval.md — Example evaluation run logs for ORPO (summary at root)
  • SimPO/
    • SimPO_colab_notebook.ipynb — Notebook for SimPO training
    • lora_model/ — Example adapter weights saved after training
    • quick_evaluation.py — Compare base vs trained checkpoint on curated prompts
    • q_eval.md — Example evaluation run logs for SimPO (summary at root)
  • KTO/
    • kto_colab_notebook.ipynb — Notebook for KTO training
    • lora_model/ — Example adapter weights saved after training
    • quick_evaluation.py — Compare base vs trained checkpoint on curated prompts
    • q_eval.md — Example evaluation run logs for KTO (summary at root)
  • rlhf/
    • rl_dpo.py — DPO training script with Unsloth patches on UltraFeedback
    • unsloth/ — Vendored Unsloth project components used by notebooks/examples
  • dpo_rlhf.py — Minimal DPO training script (Gemma 3 4B IT, 4‑bit)
  • evaluation_summary.md — Comparison across algorithms from quick eval runs
  • requirements.txt — Pinned dependencies (Transformers, TRL, Unsloth, Torch, etc.)

Environment and Requirements

  • GPU: NVIDIA GPU recommended (VRAM >= 12 GB for the 4‑bit examples; adjust batch sizes if needed)
  • Python: A recent Python (3.10–3.12 is typical with the pinned packages)
  • CUDA: The pinned Torch build targets CUDA 12.x. Use a CUDA‑compatible driver/runtime.
  • Dependencies: Install from requirements.txt

Quick start (Windows PowerShell):

  1. Create a virtual environment (optional but recommended)
  • python -m venv .venv
  • ..venv\Scripts\Activate
  1. Install dependencies
  • pip install -r requirements.txt
  1. (Optional) Set a specific GPU
  • $env:CUDA_VISIBLE_DEVICES = "0"

Training Recipes

  1. Minimal DPO (Gemma 3 4B IT, 4‑bit)
  • Script: dpo_rlhf.py
  • Notes:
    • Uses Unsloth FastLanguageModel with 4‑bit loading and LoRA patching
    • Samples a small fraction of Dahoas/rm-static for quick prototyping
  • Run:
  • python dpo_rlhf.py
  1. DPO on UltraFeedback (Zephyr SFT 4‑bit)
  • Script: rlhf/rl_dpo.py
  • Notes:
    • Patches DPO trainer via Unsloth (PatchDPOTrainer)
    • Loads trl-lib/ultrafeedback_binarized
    • Uses LoRA with gradient checkpointing
  • Run:
  • python rlhf/rl_dpo.py
  1. Algorithm Notebooks
  • DPO: DPO/Zephyr_(7B)-DPO.ipynb
  • ORPO: ORPO/Llama3_(8B)_ORPO.ipynb
  • SimPO: SimPO/SimPO_colab_notebook.ipynb
  • KTO: KTO/kto_colab_notebook.ipynb

Open the notebook you are interested in and run cells top‑to‑bottom. Most notebooks save a LoRA adapter into lora_model/ or outputs/.

Quick Evaluation (Base vs Trained)

Each algorithm folder includes a quick_evaluation.py that:

  • Loads a base model and a trained adapter/checkpoint
  • Generates responses to an identical set of prompts
  • Prints latency and responses, and summarizes at the end

Usage:

  • Update trained_model_path to point to your adapter (e.g., for DPO on Windows: f:/unsloth-rlhf/DPO/checkpoint-117)
  • Optionally update base_model_path to match the base model you trained from
  • Run:
  • python /quick_evaluation.py

Evaluation logs are captured in per‑algorithm q_eval.md files and summarized in evaluation_summary.md.

Results Snapshot (from evaluation_summary.md)

  • DPO: Base 14.48s vs Trained 14.41s (≈ +0.5% faster), slightly more focused responses
  • ORPO: Base 11.88s vs Trained 16.73s (≈ 40.8% slower), mixed quality; sometimes blank/irrelevant
  • SimPO: Base 12.47s vs Trained 14.10s (≈ 13% slower), mixed results
  • KTO: Base 10.80s vs Trained 12.97s (≈ 20.1% slower), mixed results

See evaluation_summary.md for details and a visual comparison.

Tips & Troubleshooting

  • Out of Memory (OOM):
    • Reduce per_device_train_batch_size
    • Increase gradient_accumulation_steps
    • Ensure load_in_4bit=True for large models
  • GPU selection:
    • PowerShell: $env:CUDA_VISIBLE_DEVICES = "0"
  • Speed features:
    • Flash Attention is disabled in quick evals by default (use_flash_attention_2=False); enable only if your environment supports it.
  • Model paths:
    • Replace any hardcoded Linux paths (e.g., /home/ubuntu/...) with your local OS paths.

Acknowledgements

  • Built on top of the Unsloth ecosystem (FastLanguageModel, trainer patches, and utilities)
  • Uses Hugging Face Transformers, TRL, Datasets, and related tooling

About

RLHF experiments with Unsloth: DPO, ORPO, SimPO, KTO. Training scripts, notebooks, and quick evaluation utilities.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors