This repository contains practical, minimal scripts and notebooks to fine‑tune and evaluate large language models (LLMs) using several RLHF-style algorithms with the Unsloth stack: DPO, ORPO, SimPO, and KTO. It also includes quick evaluation utilities and a consolidated comparison of results.
- DPO/
- Zephyr_(7B)-DPO.ipynb — Notebook for DPO training on Zephyr 7B
- checkpoint-117/ — Example trained adapter (LoRA) checkpoint and tokenizer assets
- quick_evaluation.py — Compare base vs trained checkpoint on curated prompts
- q_eval.md — Example evaluation run logs for DPO
- ORPO/
- Llama3_(8B)_ORPO.ipynb — Notebook for ORPO training on Llama 3 8B
- lora_model/ — Example adapter weights saved after training
- quick_evaluation.py — Compare base vs trained checkpoint on curated prompts
- q_eval.md — Example evaluation run logs for ORPO (summary at root)
- SimPO/
- SimPO_colab_notebook.ipynb — Notebook for SimPO training
- lora_model/ — Example adapter weights saved after training
- quick_evaluation.py — Compare base vs trained checkpoint on curated prompts
- q_eval.md — Example evaluation run logs for SimPO (summary at root)
- KTO/
- kto_colab_notebook.ipynb — Notebook for KTO training
- lora_model/ — Example adapter weights saved after training
- quick_evaluation.py — Compare base vs trained checkpoint on curated prompts
- q_eval.md — Example evaluation run logs for KTO (summary at root)
- rlhf/
- rl_dpo.py — DPO training script with Unsloth patches on UltraFeedback
- unsloth/ — Vendored Unsloth project components used by notebooks/examples
- dpo_rlhf.py — Minimal DPO training script (Gemma 3 4B IT, 4‑bit)
- evaluation_summary.md — Comparison across algorithms from quick eval runs
- requirements.txt — Pinned dependencies (Transformers, TRL, Unsloth, Torch, etc.)
- GPU: NVIDIA GPU recommended (VRAM >= 12 GB for the 4‑bit examples; adjust batch sizes if needed)
- Python: A recent Python (3.10–3.12 is typical with the pinned packages)
- CUDA: The pinned Torch build targets CUDA 12.x. Use a CUDA‑compatible driver/runtime.
- Dependencies: Install from requirements.txt
Quick start (Windows PowerShell):
- Create a virtual environment (optional but recommended)
- python -m venv .venv
- ..venv\Scripts\Activate
- Install dependencies
- pip install -r requirements.txt
- (Optional) Set a specific GPU
- $env:CUDA_VISIBLE_DEVICES = "0"
- Minimal DPO (Gemma 3 4B IT, 4‑bit)
- Script: dpo_rlhf.py
- Notes:
- Uses Unsloth FastLanguageModel with 4‑bit loading and LoRA patching
- Samples a small fraction of Dahoas/rm-static for quick prototyping
- Run:
- python dpo_rlhf.py
- DPO on UltraFeedback (Zephyr SFT 4‑bit)
- Script: rlhf/rl_dpo.py
- Notes:
- Patches DPO trainer via Unsloth (PatchDPOTrainer)
- Loads trl-lib/ultrafeedback_binarized
- Uses LoRA with gradient checkpointing
- Run:
- python rlhf/rl_dpo.py
- Algorithm Notebooks
- DPO: DPO/Zephyr_(7B)-DPO.ipynb
- ORPO: ORPO/Llama3_(8B)_ORPO.ipynb
- SimPO: SimPO/SimPO_colab_notebook.ipynb
- KTO: KTO/kto_colab_notebook.ipynb
Open the notebook you are interested in and run cells top‑to‑bottom. Most notebooks save a LoRA adapter into lora_model/ or outputs/.
Each algorithm folder includes a quick_evaluation.py that:
- Loads a base model and a trained adapter/checkpoint
- Generates responses to an identical set of prompts
- Prints latency and responses, and summarizes at the end
Usage:
- Update trained_model_path to point to your adapter (e.g., for DPO on Windows: f:/unsloth-rlhf/DPO/checkpoint-117)
- Optionally update base_model_path to match the base model you trained from
- Run:
- python /quick_evaluation.py
Evaluation logs are captured in per‑algorithm q_eval.md files and summarized in evaluation_summary.md.
- DPO: Base 14.48s vs Trained 14.41s (≈ +0.5% faster), slightly more focused responses
- ORPO: Base 11.88s vs Trained 16.73s (≈ 40.8% slower), mixed quality; sometimes blank/irrelevant
- SimPO: Base 12.47s vs Trained 14.10s (≈ 13% slower), mixed results
- KTO: Base 10.80s vs Trained 12.97s (≈ 20.1% slower), mixed results
See evaluation_summary.md for details and a visual comparison.
- Out of Memory (OOM):
- Reduce per_device_train_batch_size
- Increase gradient_accumulation_steps
- Ensure load_in_4bit=True for large models
- GPU selection:
- PowerShell: $env:CUDA_VISIBLE_DEVICES = "0"
- Speed features:
- Flash Attention is disabled in quick evals by default (use_flash_attention_2=False); enable only if your environment supports it.
- Model paths:
- Replace any hardcoded Linux paths (e.g., /home/ubuntu/...) with your local OS paths.
- Built on top of the Unsloth ecosystem (FastLanguageModel, trainer patches, and utilities)
- Uses Hugging Face Transformers, TRL, Datasets, and related tooling