Rewrite layer-wise ONNX export as an API -> adds CustomLoader and Loop inside export by ochougul · Pull Request #1048 · quic/efficient-transformers

ochougul · 2026-06-05T22:48:27Z

Summary

Replace the env-var (LAYERWISE_EXPORT) + monkeypatch driven layer-wise export with a generic, first-class API.

Key changes

layerwise flag on from_pretrained — builds model on meta device; weights streamed per decoder-layer window during export.
CustomLoader (QEfficient/utils/custom_loader.py) — HF-assisted window loader that restricts sharded checkpoints to the active layer window (handles fused-MoE etc.). Supports one or more layer prefixes.
Window loop in export() — load → apply transforms → export → split → add-prefix → merge; compile() only forwards layerwise_window_size and compiles merged ONNX unchanged.
layerwise_utils.py — window tiling, meta-model build, text-model resolution, window-state setattr (no monkeypatching).
VLM dual-QPC support — vision encoder exported once; language decoder windowed and merged.
total_layers kwarg on both Auto classes (must be > 1) to override exported decoder-layer count.
Auto-force use_onnx_subfunctions=True when layerwise=True (with warning).
Remove all LAYERWISE_EXPORT env-var usage from the package.
Updated example scripts to new flow: from_pretrained(layerwise=True) + compile(layerwise_window_size=...).
Tests added in tests/utils/test_layerwise_utils.py; validated on AIC compiler for CausalLM and VLM dual-QPC.

Replace the env-var (LAYERWISE_EXPORT) + monkeypatch driven layer-wise export flow with a generic, first-class API: - Add a `layerwise` flag to from_pretrained for QEFFAutoModelForCausalLM and QEFFAutoModelForImageTextToText. When True, the model is built on the meta device and weights are streamed one decoder-layer window at a time during export. - Add CustomLoader (QEfficient/utils/custom_loader.py): loads a window of decoder layers via HF from_pretrained while restricting sharded checkpoints to the window's layers (handles checkpoint->module weight conversion such as fused-MoE experts). Supports one or more layer prefixes (CausalLM and VLM language paths). - Move the per-window loop (load -> apply transforms -> export -> split -> add prefix -> merge) into export(); compile() only forwards layerwise_window_size (default 1) and compiles the merged ONNX unchanged. - Add layerwise_utils.py (window tiling, meta-model build, text-model resolution, window-state setattr; no monkeypatching). - Support VLM dual-QPC layer-wise: vision encoder exported once, language decoder windowed and merged. - Add total_layers compile kwarg (must be > 1) on both Auto classes to override the exported decoder-layer count. - Auto-force use_onnx_subfunctions=True (with a warning) when layerwise. - Remove all LAYERWISE_EXPORT env usage. - Update layer-wise example scripts to the new flow. - Add tests/utils/test_layerwise_utils.py. Signed-off-by: ochougul <ochougul@qti.qualcomm.com>

Replace set_window_state class-level attribute mutation (type(text_model)._start, QEFFBaseModel._start) with instance attributes set on the relevant module instances. - set_window_state now sets _start/_end/_total_layers as instance attrs on the text_model, propagates _start to child attention submodules, and optionally mirrors onto the qeff_wrapper instance. - Model forwards read getattr(self, '_start', 0) instead of ClassName._start. - Child attention modules read getattr(self, '_start', 0) instead of class lookup. - QEffQwen3_5MoeDynamicCache.from_legacy_cache accepts start_layer parameter. - Export loop call sites pass qeff_wrapper=self to set_window_state/reset_window_state. - Class-level defaults (_start=0, _end=0, _total_layers=None) remain as inert fallbacks; never mutated at runtime. Signed-off-by: ochougul <ochougul@qti.qualcomm.com>

Switch qwen3moe and qwen3_vl_moe layerwise example scripts to use tiny-random models (yujiepan/qwen3-moe-tiny-random, tiny-random/qwen3-vl-moe) as defaults so they can run out-of-the-box without needing large model weights. The qwen3_5_moe scripts remain pointed at Qwen/Qwen3.5-397B-A17B (a pre-existing embedding-dtype bug in the qwen3_5 decoder wrapper prevents export with any available model). Signed-off-by: ochougul <ochougul@qti.qualcomm.com>

Update QEffQwen3_5DecoderWrapper.forward to: - Accept inputs_embeds kwarg (skip embedding for non-first windows) - Gate vision-merge logic on _start == 0 - Gate lm_head on _end == total_layers This aligns with the pattern used in qwen3_5_moe and qwen3_vl_moe decoder wrappers, enabling the generic layerwise export flow. Note: a separate issue remains where the hybrid linear-attention cache (conv_state) is not properly initialized for windowed export — this is a pre-existing gap in qwen3_5's cache construction, not related to this fix. Signed-off-by: ochougul <ochougul@qti.qualcomm.com>

Add layerwise export support to QEffQwen3_5TextModel and its decoder wrapper, mirroring the pattern already used in qwen3_5_moe/qwen3_vl_moe: - QEffQwen3_5DynamicCache.from_legacy_cache: add start_layer param for correct cache offset indexing during windowed export. - QEffQwen3_5TextModel.forward: add _start/_end layer skipping, gate norm on _end==total_layers, return single-layer legacy cache. - QEffQwen3_5DecoderWrapper.forward: return variable outputs per window (vision_embeds/image_idx only on first window; lm_head only on last). - get_dummy_inputs: populate cache for all layers so _export_layerwise can index any window. - Example scripts default to Qwen/Qwen3.5-0.8B with total_layers=2. Validated: layerwise export produces merged ONNX successfully. Compile fails due to a qaic-compile subfunction naming issue (unrelated). Signed-off-by: ochougul <ochougul@qti.qualcomm.com>

ochougul added 2 commits June 6, 2026 04:01

ran linter

059c75e

ochougul changed the title ~~Rewrite layer-wise ONNX export as a first-class API~~ Rewrite layer-wise ONNX export as an API -> adds CustomLoader and Loop inside export Jun 5, 2026

ochougul added 4 commits June 6, 2026 10:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite layer-wise ONNX export as an API -> adds CustomLoader and Loop inside export#1048

Rewrite layer-wise ONNX export as an API -> adds CustomLoader and Loop inside export#1048
ochougul wants to merge 6 commits into
release/v1.22.0_tmpfrom
layerwise_rewrite

ochougul commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ochougul commented Jun 5, 2026

Summary

Key changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant