Skip to content

Now, we support the hybrid model in our Olmo-core code. #1713

Open
finbarrtimbers wants to merge 37 commits into
mainfrom
finbarr/oc-hybrid-dpo
Open

Now, we support the hybrid model in our Olmo-core code. #1713
finbarrtimbers wants to merge 37 commits into
mainfrom
finbarr/oc-hybrid-dpo

Conversation

@finbarrtimbers

@finbarrtimbers finbarrtimbers commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds Olmo-Hybrid (GDN) support to the OLMo-core DPO trainer (dpo.py) and substantially improves its MFU:

  • Bump olmo-core to a commit with the olmo3_hybrid_7B config preset and HF→olmo-core hybrid weight conversion (convert_hybrid_state_from_hf).
  • Pack DPO microbatches to the max_seq_length token budget instead of capping at per_device_train_batch_size sequences (the cap was the root cause of ~7% MFU).
  • Yield rectangular stacked packed-row batches (stack_packed_rows/unstack_packed_rows) so OLMo-core's dict batch contract, pre_train batch-size validation, and token/FLOPs accounting all work natively; gradient accumulation = packed rows per rank per step, rank_microbatch_size = 2 × max_seq_length tokens per packed row.
  • Move LR/step/epoch metric recording into DPOMetricsCallback (standard OLMo-core callback pattern) with ReduceType.sum numerator/denominator reduction.
  • Add a selected_modules activation checkpointing mode so torch.compile and AC coexist with GDN.
  • Make ModelDims FLOPs/memory GDN-aware for correct MFU reporting.
  • Add an OLMo-core hybrid DPO sweep script.

MFU on the multi-node debug config (OLMo-2-7B, 16k seq, packing, TP=2): 20.8% → 30.5% vs the previous cap-based packing at identical config (1.87 s/step vs 2.78 s/step).

Runs:

  1. Multi-node packed DPO (2×8 GPU, OLMo-2-7B, 16k, rectangular batches; mfu_avg 30.5%, padding_fraction ~0.16): Beaker
  2. Single-GPU non-packed DPO (OLMo-2-1B, 3 epochs, epoch/LR metrics via callback verified): Beaker
  3. Hybrid 7B DPO with token-budget packing (list-based predecessor, mfu_avg 30.4%): wandb run 45sigjmq

GPU_TESTS=01KTCG94JXFMJQES1DERQR1JRM

🤖 Generated with Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for hybrid models featuring linear attention layers (such as Gated Delta Net) within the model dimension and FLOPs calculation utilities, along with corresponding unit tests. It also updates the DPO training sweep scripts to use public SFT models and adds a new sweep script utilizing OLMo-core. Feedback on these changes highlights two issues: first, the removal of the SFT_LR variable in 7b_instruct_dpo_sweep.sh leaves a broken reference in the experiment description; second, direct attribute access on the configuration object in utils.py should be replaced with getattr to prevent potential AttributeErrors when optional attributes are missing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread scripts/train/olmo-hybrid/7b_instruct_dpo_sweep.sh Outdated
Comment thread open_instruct/utils.py
@finbarrtimbers finbarrtimbers changed the title Finbarr/oc hybrid dpo Now, we support the hybrid model in our Olmo-core code. Jun 2, 2026
… tests Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…pport Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…matching ZeRO-3 reference Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… GDN at 16k seq Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eckpoint of GDN op fails recompute metadata check Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…GDN (checkpoint only compile-safe MLPs, leave opaque GDN mixer activations live)
…selected_modules AC Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rminism check Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e checkpoint (avoids full-mode inductor stride guard failure) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…Triton>=3.4 Hopper kernel (fla #640) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ain/* and perf/* keys, add learning_rate/epoch/training_step) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Documentation Changes Detected

📄 sitemap.xml
--- site-base/sitemap.xml	2026-06-03 14:15:50.355697873 +0000
+++ site-pr/sitemap.xml	2026-06-03 14:15:43.530894820 +0000
@@ -13,6 +13,10 @@
          <lastmod>2026-06-03</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/dpo-mfu-optimization/</loc>
+         <lastmod>2026-06-03</lastmod>
+    </url>
+    <url>
📄 sitemap.xml.gz
Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

…_length) and wire HSDP knobs to cut padding-FLOP waste Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…capping at per_device_batch×GAS sequences (fixes padding-FLOP MFU waste); revert bucketing approach Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ulation (microbatches_per_step); add train/padding_fraction and train/sequences_per_step metrics Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sample_cap doesn't load the dataset Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Documentation Changes Detected

📄 sitemap.xml
--- site-base/sitemap.xml	2026-06-04 16:16:44.513516458 +0000
+++ site-pr/sitemap.xml	2026-06-04 16:16:38.739226163 +0000
@@ -13,6 +13,10 @@
          <lastmod>2026-06-04</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/dpo-mfu-optimization/</loc>
+         <lastmod>2026-06-04</lastmod>
+    </url>
+    <url>
📄 sitemap.xml.gz
Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Documentation Changes Detected

📄 sitemap.xml
--- site-base/sitemap.xml	2026-06-04 17:22:44.705540248 +0000
+++ site-pr/sitemap.xml	2026-06-04 17:22:39.036262278 +0000
@@ -13,6 +13,10 @@
          <lastmod>2026-06-04</lastmod>
     </url>
     <url>
+         <loc>https://github.com/allenai/open-instruct/dpo-mfu-optimization/</loc>
+         <lastmod>2026-06-04</lastmod>
+    </url>
+    <url>
📄 sitemap.xml.gz
Binary files site-base/sitemap.xml.gz and site-pr/sitemap.xml.gz differ

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

…mpute MFU (metric refactor moved it into the deferred callback, breaking get_metric) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ipt and extend CHANGELOG entry to cover the MFU work (token-budget packing, grad accumulation, selected_modules AC, GDN-aware ModelDims) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…er.global_num_tokens_in_batch and unify the collator packing probe behind _collator_max_seq_length Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts:
#	CHANGELOG.md
#	pyproject.toml
#	requirements.txt
#	uv.lock
…ked_rows) so OLMo-core's dict batch contract, pre_train validation, and token accounting work natively; rank_microbatch_size = 2*max_seq_length tokens per packed row; drop microbatches_per_step and list-batch handling Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… (microbatches_per_step removed) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… for non-packed batches), removing the None fallbacks in train_batch and PerfCallback.pre_step and the now-unused per_device_train_batch_size field Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts:
#	CHANGELOG.md
#	open_instruct/dpo.py
#	open_instruct/olmo_core_utils.py
#	open_instruct/utils.py
#	pyproject.toml
#	requirements.txt
#	uv.lock
@finbarrtimbers finbarrtimbers marked this pull request as ready for review June 9, 2026 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant