feat(packing): add opt-in packing_strategy (binpack | sequential) for order-preserving packing#9598
Conversation
The packing path uses binpacking.to_constant_volume (best-fit-decreasing), which reorders samples and changes pack composition. Recipes relying on a sequential sampler (sample order / pack boundaries are part of the recipe, e.g. domain weighting encoded in sample order) cannot be reproduced faithfully. Add packing_strategy: - 'binpack' (default, unchanged): best-fit-decreasing bin packing. - 'sequential': order-preserving greedy packing (next-fit) - single open pack, flush on overflow, carry trailing pack to next batch; sample order and pack boundaries follow input order (use packing_num_proc=1 for global order). Threaded through PackingDataset / IterablePackingDataset and BaseArguments; default behavior unchanged. Adds zh/en docs. Signed-off-by: yuchenwang3 <eang333cms@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new packing_strategy parameter, allowing users to choose between the default 'binpack' (best-fit-decreasing) and a new 'sequential' (order-preserving greedy next-fit) packing algorithm. This parameter is documented, added to the base arguments, and integrated into the dataset packing process. The review feedback suggests validating the strategy parameter in calculate_matched_group to raise a ValueError for unsupported strategies instead of silently falling back to 'binpack'.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| def calculate_matched_group(sequences, packing_length: int, is_finished: bool = True, strategy: str = 'binpack'): | ||
| if len(sequences) == 0: | ||
| return [], [] | ||
| # https://arxiv.org/pdf/2404.10830 | ||
| if strategy == 'sequential': |
There was a problem hiding this comment.
The strategy parameter is not validated. If an invalid strategy name is passed, the function will silently fall back to the default 'binpack' strategy. It is safer to explicitly validate the strategy and raise a ValueError if an unsupported strategy is provided.
| def calculate_matched_group(sequences, packing_length: int, is_finished: bool = True, strategy: str = 'binpack'): | |
| if len(sequences) == 0: | |
| return [], [] | |
| # https://arxiv.org/pdf/2404.10830 | |
| if strategy == 'sequential': | |
| def calculate_matched_group(sequences, packing_length: int, is_finished: bool = True, strategy: str = 'binpack'): | |
| if strategy not in ('binpack', 'sequential'): | |
| raise ValueError(f"Unknown packing strategy: {strategy}. Supported strategies are 'binpack' and 'sequential'.") | |
| if len(sequences) == 0: | |
| return [], [] | |
| if strategy == 'sequential': |
What
The packing path uses
binpacking.to_constant_volume(best-fit-decreasing), which reorders samples (by length) and changes which samples land in each pack. For recipes that rely on a sequential sampler — where the sample order and pack boundaries are themselves part of the recipe (e.g. domain weighting encoded in the sample order, as in StepFun Step-3.5) — this reordering breaks faithful reproduction.This adds an opt-in
packing_strategy:binpack(default, unchanged): best-fit-decreasing bin packing.sequential: order-preserving greedy packing (next-fit) — keep a single open pack, flush it when the next sample doesn't fit, and carry the trailing unfinished pack to the next batch. Sample order and pack boundaries follow the input order (usepacking_num_proc=1for a single global ordering).Change
calculate_matched_group(..., strategy='binpack'): add thesequential(next-fit) branch; the defaultbinpackpath is unchanged.PackingDataset/IterablePackingDataset: acceptpacking_strategyand pass it through tocalculate_matched_group.BaseArguments.packing_strategy: Literal['binpack', 'sequential'] = 'binpack', threaded throughsft.py.Default behavior is unchanged (
binpack).Testing
Verified syntactically (
py_compile) and traced the call path (args →sft.py→PackingDataset/IterablePackingDataset→calculate_matched_group). I could not run the full training suite locally; relying on CI. Thesequentialalgorithm (order-preserving next-fit with carry-over) was used in real Qwen3.5-35B SFT to reproduce a sequential-sampler recipe.Note
Named
sequential(notfirst_fit) because the algorithm keeps a single open pack and flushes on overflow — i.e. next-fit — and because the user-facing property is order preservation. Happy to rename if maintainers prefer.