Rewrite layer-wise ONNX export as an API -> adds CustomLoader and Loop inside export#1048
Open
ochougul wants to merge 6 commits into
Open
Rewrite layer-wise ONNX export as an API -> adds CustomLoader and Loop inside export#1048ochougul wants to merge 6 commits into
ochougul wants to merge 6 commits into
Conversation
Replace the env-var (LAYERWISE_EXPORT) + monkeypatch driven layer-wise export flow with a generic, first-class API: - Add a `layerwise` flag to from_pretrained for QEFFAutoModelForCausalLM and QEFFAutoModelForImageTextToText. When True, the model is built on the meta device and weights are streamed one decoder-layer window at a time during export. - Add CustomLoader (QEfficient/utils/custom_loader.py): loads a window of decoder layers via HF from_pretrained while restricting sharded checkpoints to the window's layers (handles checkpoint->module weight conversion such as fused-MoE experts). Supports one or more layer prefixes (CausalLM and VLM language paths). - Move the per-window loop (load -> apply transforms -> export -> split -> add prefix -> merge) into export(); compile() only forwards layerwise_window_size (default 1) and compiles the merged ONNX unchanged. - Add layerwise_utils.py (window tiling, meta-model build, text-model resolution, window-state setattr; no monkeypatching). - Support VLM dual-QPC layer-wise: vision encoder exported once, language decoder windowed and merged. - Add total_layers compile kwarg (must be > 1) on both Auto classes to override the exported decoder-layer count. - Auto-force use_onnx_subfunctions=True (with a warning) when layerwise. - Remove all LAYERWISE_EXPORT env usage. - Update layer-wise example scripts to the new flow. - Add tests/utils/test_layerwise_utils.py. Signed-off-by: ochougul <ochougul@qti.qualcomm.com>
Replace set_window_state class-level attribute mutation (type(text_model)._start, QEFFBaseModel._start) with instance attributes set on the relevant module instances. - set_window_state now sets _start/_end/_total_layers as instance attrs on the text_model, propagates _start to child attention submodules, and optionally mirrors onto the qeff_wrapper instance. - Model forwards read getattr(self, '_start', 0) instead of ClassName._start. - Child attention modules read getattr(self, '_start', 0) instead of class lookup. - QEffQwen3_5MoeDynamicCache.from_legacy_cache accepts start_layer parameter. - Export loop call sites pass qeff_wrapper=self to set_window_state/reset_window_state. - Class-level defaults (_start=0, _end=0, _total_layers=None) remain as inert fallbacks; never mutated at runtime. Signed-off-by: ochougul <ochougul@qti.qualcomm.com>
Switch qwen3moe and qwen3_vl_moe layerwise example scripts to use tiny-random models (yujiepan/qwen3-moe-tiny-random, tiny-random/qwen3-vl-moe) as defaults so they can run out-of-the-box without needing large model weights. The qwen3_5_moe scripts remain pointed at Qwen/Qwen3.5-397B-A17B (a pre-existing embedding-dtype bug in the qwen3_5 decoder wrapper prevents export with any available model). Signed-off-by: ochougul <ochougul@qti.qualcomm.com>
Update QEffQwen3_5DecoderWrapper.forward to: - Accept inputs_embeds kwarg (skip embedding for non-first windows) - Gate vision-merge logic on _start == 0 - Gate lm_head on _end == total_layers This aligns with the pattern used in qwen3_5_moe and qwen3_vl_moe decoder wrappers, enabling the generic layerwise export flow. Note: a separate issue remains where the hybrid linear-attention cache (conv_state) is not properly initialized for windowed export — this is a pre-existing gap in qwen3_5's cache construction, not related to this fix. Signed-off-by: ochougul <ochougul@qti.qualcomm.com>
Add layerwise export support to QEffQwen3_5TextModel and its decoder wrapper, mirroring the pattern already used in qwen3_5_moe/qwen3_vl_moe: - QEffQwen3_5DynamicCache.from_legacy_cache: add start_layer param for correct cache offset indexing during windowed export. - QEffQwen3_5TextModel.forward: add _start/_end layer skipping, gate norm on _end==total_layers, return single-layer legacy cache. - QEffQwen3_5DecoderWrapper.forward: return variable outputs per window (vision_embeds/image_idx only on first window; lm_head only on last). - get_dummy_inputs: populate cache for all layers so _export_layerwise can index any window. - Example scripts default to Qwen/Qwen3.5-0.8B with total_layers=2. Validated: layerwise export produces merged ONNX successfully. Compile fails due to a qaic-compile subfunction naming issue (unrelated). Signed-off-by: ochougul <ochougul@qti.qualcomm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace the env-var (
LAYERWISE_EXPORT) + monkeypatch driven layer-wise export with a generic, first-class API.Key changes
layerwiseflag onfrom_pretrained— builds model onmetadevice; weights streamed per decoder-layer window during export.CustomLoader(QEfficient/utils/custom_loader.py) — HF-assisted window loader that restricts sharded checkpoints to the active layer window (handles fused-MoE etc.). Supports one or more layer prefixes.export()— load → apply transforms → export → split → add-prefix → merge;compile()only forwardslayerwise_window_sizeand compiles merged ONNX unchanged.layerwise_utils.py— window tiling, meta-model build, text-model resolution, window-state setattr (no monkeypatching).total_layerskwarg on both Auto classes (must be > 1) to override exported decoder-layer count.use_onnx_subfunctions=Truewhenlayerwise=True(with warning).LAYERWISE_EXPORTenv-var usage from the package.from_pretrained(layerwise=True)+compile(layerwise_window_size=...).tests/utils/test_layerwise_utils.py; validated on AIC compiler for CausalLM and VLM dual-QPC.