Now, we support the hybrid model in our Olmo-core code. #1713
Now, we support the hybrid model in our Olmo-core code. #1713finbarrtimbers wants to merge 37 commits into
Conversation
…emory Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for hybrid models featuring linear attention layers (such as Gated Delta Net) within the model dimension and FLOPs calculation utilities, along with corresponding unit tests. It also updates the DPO training sweep scripts to use public SFT models and adds a new sweep script utilizing OLMo-core. Feedback on these changes highlights two issues: first, the removal of the SFT_LR variable in 7b_instruct_dpo_sweep.sh leaves a broken reference in the experiment description; second, direct attribute access on the configuration object in utils.py should be replaced with getattr to prevent potential AttributeErrors when optional attributes are missing.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
… tests Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…pport Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…matching ZeRO-3 reference Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… GDN at 16k seq Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eckpoint of GDN op fails recompute metadata check Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…GDN (checkpoint only compile-safe MLPs, leave opaque GDN mixer activations live)
…selected_modules AC Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rminism check Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e checkpoint (avoids full-mode inductor stride guard failure) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…Triton>=3.4 Hopper kernel (fla #640) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ain/* and perf/* keys, add learning_rate/epoch/training_step) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Documentation Changes Detected📄
|
…_length) and wire HSDP knobs to cut padding-FLOP waste Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…capping at per_device_batch×GAS sequences (fixes padding-FLOP MFU waste); revert bucketing approach Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ulation (microbatches_per_step); add train/padding_fraction and train/sequences_per_step metrics Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sample_cap doesn't load the dataset Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Documentation Changes Detected📄
|
Documentation Changes Detected📄
|
…mpute MFU (metric refactor moved it into the deferred callback, breaking get_metric) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ipt and extend CHANGELOG entry to cover the MFU work (token-budget packing, grad accumulation, selected_modules AC, GDN-aware ModelDims) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…er.global_num_tokens_in_batch and unify the collator packing probe behind _collator_max_seq_length Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts: # CHANGELOG.md # pyproject.toml # requirements.txt # uv.lock
…ked_rows) so OLMo-core's dict batch contract, pre_train validation, and token accounting work natively; rank_microbatch_size = 2*max_seq_length tokens per packed row; drop microbatches_per_step and list-batch handling Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… (microbatches_per_step removed) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… for non-packed batches), removing the None fallbacks in train_batch and PerfCallback.pre_step and the now-unused per_device_train_batch_size field Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts: # CHANGELOG.md # open_instruct/dpo.py # open_instruct/olmo_core_utils.py # open_instruct/utils.py # pyproject.toml # requirements.txt # uv.lock
Summary
Adds Olmo-Hybrid (GDN) support to the OLMo-core DPO trainer (
dpo.py) and substantially improves its MFU:olmo3_hybrid_7Bconfig preset and HF→olmo-core hybrid weight conversion (convert_hybrid_state_from_hf).max_seq_lengthtoken budget instead of capping atper_device_train_batch_sizesequences (the cap was the root cause of ~7% MFU).stack_packed_rows/unstack_packed_rows) so OLMo-core's dict batch contract,pre_trainbatch-size validation, and token/FLOPs accounting all work natively; gradient accumulation = packed rows per rank per step,rank_microbatch_size = 2 × max_seq_lengthtokens per packed row.DPOMetricsCallback(standard OLMo-core callback pattern) with ReduceType.sum numerator/denominator reduction.selected_modulesactivation checkpointing mode so torch.compile and AC coexist with GDN.ModelDimsFLOPs/memory GDN-aware for correct MFU reporting.MFU on the multi-node debug config (OLMo-2-7B, 16k seq, packing, TP=2): 20.8% → 30.5% vs the previous cap-based packing at identical config (1.87 s/step vs 2.78 s/step).
Runs:
GPU_TESTS=01KTCG94JXFMJQES1DERQR1JRM
🤖 Generated with Claude Code