Skip to content

Rewrite layer-wise ONNX export as an API -> adds CustomLoader and Loop inside export#1048

Open
ochougul wants to merge 6 commits into
release/v1.22.0_tmpfrom
layerwise_rewrite
Open

Rewrite layer-wise ONNX export as an API -> adds CustomLoader and Loop inside export#1048
ochougul wants to merge 6 commits into
release/v1.22.0_tmpfrom
layerwise_rewrite

Conversation

@ochougul
Copy link
Copy Markdown
Contributor

@ochougul ochougul commented Jun 5, 2026

Summary

Replace the env-var (LAYERWISE_EXPORT) + monkeypatch driven layer-wise export with a generic, first-class API.

Key changes

  • layerwise flag on from_pretrained — builds model on meta device; weights streamed per decoder-layer window during export.
  • CustomLoader (QEfficient/utils/custom_loader.py) — HF-assisted window loader that restricts sharded checkpoints to the active layer window (handles fused-MoE etc.). Supports one or more layer prefixes.
  • Window loop in export() — load → apply transforms → export → split → add-prefix → merge; compile() only forwards layerwise_window_size and compiles merged ONNX unchanged.
  • layerwise_utils.py — window tiling, meta-model build, text-model resolution, window-state setattr (no monkeypatching).
  • VLM dual-QPC support — vision encoder exported once; language decoder windowed and merged.
  • total_layers kwarg on both Auto classes (must be > 1) to override exported decoder-layer count.
  • Auto-force use_onnx_subfunctions=True when layerwise=True (with warning).
  • Remove all LAYERWISE_EXPORT env-var usage from the package.
  • Updated example scripts to new flow: from_pretrained(layerwise=True) + compile(layerwise_window_size=...).
  • Tests added in tests/utils/test_layerwise_utils.py; validated on AIC compiler for CausalLM and VLM dual-QPC.

ochougul added 2 commits June 6, 2026 04:01
Replace the env-var (LAYERWISE_EXPORT) + monkeypatch driven layer-wise
export flow with a generic, first-class API:

- Add a `layerwise` flag to from_pretrained for QEFFAutoModelForCausalLM
  and QEFFAutoModelForImageTextToText. When True, the model is built on
  the meta device and weights are streamed one decoder-layer window at a
  time during export.
- Add CustomLoader (QEfficient/utils/custom_loader.py): loads a window of
  decoder layers via HF from_pretrained while restricting sharded
  checkpoints to the window's layers (handles checkpoint->module weight
  conversion such as fused-MoE experts). Supports one or more layer
  prefixes (CausalLM and VLM language paths).
- Move the per-window loop (load -> apply transforms -> export -> split ->
  add prefix -> merge) into export(); compile() only forwards
  layerwise_window_size (default 1) and compiles the merged ONNX unchanged.
- Add layerwise_utils.py (window tiling, meta-model build, text-model
  resolution, window-state setattr; no monkeypatching).
- Support VLM dual-QPC layer-wise: vision encoder exported once, language
  decoder windowed and merged.
- Add total_layers compile kwarg (must be > 1) on both Auto classes to
  override the exported decoder-layer count.
- Auto-force use_onnx_subfunctions=True (with a warning) when layerwise.
- Remove all LAYERWISE_EXPORT env usage.
- Update layer-wise example scripts to the new flow.
- Add tests/utils/test_layerwise_utils.py.

Signed-off-by: ochougul <ochougul@qti.qualcomm.com>
@ochougul ochougul changed the title Rewrite layer-wise ONNX export as a first-class API Rewrite layer-wise ONNX export as an API -> adds CustomLoader and Loop inside export Jun 5, 2026
ochougul added 4 commits June 6, 2026 10:48
Replace set_window_state class-level attribute mutation (type(text_model)._start,
QEFFBaseModel._start) with instance attributes set on the relevant module instances.

- set_window_state now sets _start/_end/_total_layers as instance attrs on the
  text_model, propagates _start to child attention submodules, and optionally
  mirrors onto the qeff_wrapper instance.
- Model forwards read getattr(self, '_start', 0) instead of ClassName._start.
- Child attention modules read getattr(self, '_start', 0) instead of class lookup.
- QEffQwen3_5MoeDynamicCache.from_legacy_cache accepts start_layer parameter.
- Export loop call sites pass qeff_wrapper=self to set_window_state/reset_window_state.
- Class-level defaults (_start=0, _end=0, _total_layers=None) remain as inert
  fallbacks; never mutated at runtime.

Signed-off-by: ochougul <ochougul@qti.qualcomm.com>
Switch qwen3moe and qwen3_vl_moe layerwise example scripts to use
tiny-random models (yujiepan/qwen3-moe-tiny-random, tiny-random/qwen3-vl-moe)
as defaults so they can run out-of-the-box without needing large model weights.

The qwen3_5_moe scripts remain pointed at Qwen/Qwen3.5-397B-A17B (a
pre-existing embedding-dtype bug in the qwen3_5 decoder wrapper prevents
export with any available model).

Signed-off-by: ochougul <ochougul@qti.qualcomm.com>
Update QEffQwen3_5DecoderWrapper.forward to:
- Accept inputs_embeds kwarg (skip embedding for non-first windows)
- Gate vision-merge logic on _start == 0
- Gate lm_head on _end == total_layers

This aligns with the pattern used in qwen3_5_moe and qwen3_vl_moe
decoder wrappers, enabling the generic layerwise export flow.

Note: a separate issue remains where the hybrid linear-attention cache
(conv_state) is not properly initialized for windowed export — this is
a pre-existing gap in qwen3_5's cache construction, not related to this
fix.

Signed-off-by: ochougul <ochougul@qti.qualcomm.com>
Add layerwise export support to QEffQwen3_5TextModel and its decoder
wrapper, mirroring the pattern already used in qwen3_5_moe/qwen3_vl_moe:

- QEffQwen3_5DynamicCache.from_legacy_cache: add start_layer param for
  correct cache offset indexing during windowed export.
- QEffQwen3_5TextModel.forward: add _start/_end layer skipping, gate
  norm on _end==total_layers, return single-layer legacy cache.
- QEffQwen3_5DecoderWrapper.forward: return variable outputs per window
  (vision_embeds/image_idx only on first window; lm_head only on last).
- get_dummy_inputs: populate cache for all layers so _export_layerwise
  can index any window.
- Example scripts default to Qwen/Qwen3.5-0.8B with total_layers=2.

Validated: layerwise export produces merged ONNX successfully. Compile
fails due to a qaic-compile subfunction naming issue (unrelated).

Signed-off-by: ochougul <ochougul@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant