feat(0506): Layerwise export: API-driven, env-var-free, opt-in flag by vbaddi · Pull Request #1047 · quic/efficient-transformers

vbaddi · 2026-06-05T12:26:16Z

Summary

Encapsulate the layerwise export+stitch+compile orchestration loop (previously a 200+ line example with monkey-patches and an LAYERWISE_EXPORT env var) behind a single layerwise=True flag on .compile() / .export().

What's new

- New flag: layerwise: bool = False (+ layerwise_window_size: int = 1) on:                                                                
- QEFFAutoModelForImageTextToText.compile() / .export()                                                                                 
- QEFFAutoModelForCausalLM.compile() / .export()                                                                                        
- No env vars. LAYERWISE_EXPORT is gone. Control is a process-local class flag toggled by an internal context manager.

Backward compatibility

layerwise=False is the default.
Verified: pytest tests/unit_test/models/test_model_quickcheck.py -n auto → 130 passed, 3 skipped (was 121 / 3 before this PR).

Tests added

9 new tests covering windowing helpers, the supported/unsupported guard (parametrized over all 3 supported architectures),
env-var-not-leaked invariant, and the context manager's flag toggle.

Usage: Enable layerwise

 import torch                                                                                                                              
 from transformers import AutoConfig                                                                                                       
 from QEfficient import QEFFAutoModelForImageTextToText               

 config = AutoConfig.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

 model = QEFFAutoModelForImageTextToText.from_pretrained(             
     "Qwen/Qwen3-VL-235B-A22B-Instruct",
     attn_implementation="eager", kv_offload=True,
     config=config, torch_dtype=torch.float16,
 )                                                                    

 qpc = model.compile(                                                 
     batch_size=1, prefill_seq_len=1, ctx_len=4096,
     num_cores=16, num_devices=4,
     mxfp6_matmul=True, aic_enable_depth_first=True,
     skip_vision=True, split_retained_state_io=True,
     use_onnx_subfunctions=True, mos=1,
     layerwise=True,                         # opt-in
     layerwise_window_size=1,        # layers per window              
 )

Disable (default)

Just don't pass layerwise. Behavior is identical to before this PR.

  qpc = model.compile(batch_size=1, prefill_seq_len=1, ctx_len=4096, ...)

  CausalLM (qwen3_moe)                                                 

  from QEfficient import QEFFAutoModelForCausalLM

  model = QEFFAutoModelForCausalLM.from_pretrained("Qwen/Qwen3-235B-A22B-Instruct-2507")
  qpc = model.compile(                                                 
      prefill_seq_len=4, ctx_len=128, num_cores=16, num_devices=1,
      mxfp6_matmul=True, mxint8_kv_cache=True,
      layerwise=True, layerwise_window_size=1,
  )

Test Plan

pyenv activate qeff && python -m pytest -q tests/unit_test/models/test_model_quickcheck.py -n auto → 130 passed, 3 skipped

…flag Move the layerwise export+stitch+compile orchestration loop into a single internal driver gated by a new layerwise=True kwarg on .compile() and .export(). The flag is opt-in; layerwise=False remains the default and the non-layerwise compile path is unchanged byte-for-byte. The LAYERWISE_EXPORT environment variable is removed entirely; control flows purely through the API via a process-local QEFFBaseModel._layerwise_active flag toggled by an internal context manager. Supported architectures are allowlisted (qwen3_vl_moe, qwen3_5_moe, qwen3_moe); other model types raise NotImplementedError when layerwise=True. Wired on QEFFAutoModelForImageTextToText (dual-QPC) and QEFFAutoModelForCausalLM. Five existing layerwise example scripts collapse from 200-330 lines to ~60 lines each. The encapsulation module is documented as provisional and emits a one-shot DeprecationWarning. test_model_quickcheck.py: 121 -> 127 passed, 3 skipped (unchanged) with five new tests covering the windowing helpers, the supported/unsupported guard, the env-var-not-leaked invariant, and the context manager's class-flag toggle. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

- Slim per-window export: truncate sin_cached/cos_cached to ctx_len and null embed_tokens / lm_head when unreached. - Fix fp16 layerwise export: _export_layerwise synthesized inputs_embeds via torch.rand without a dtype. - Suppress confusing "An unexpected error occurred while dumping the qconfig" message when compile short-circuits without producing a QPC (e.g. layerwise per-window export). dump_qconfig now skips when qpc_path is None and demotes real failures to logger.debug. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

abhishek-singh591 · 2026-06-06T03:29:51Z

Backward compatibility is not yet guaranteed for Qwen 3.5 and Qwen3-VL-MoE. PR #1043 addresses this issue for Qwen3-VL, but Qwen 3.5 will still need additional modeling changes. We skipped the Qwen 3.5 unit tests, which is why all tests are currently passing.

- Add layerwise=True to from_pretrained (VLM + CausalLM). When set, the outer model is built on the meta device via from_config, so the caller's load no longer pulls full checkpoint weights into RAM. - Stop polluting transformers.modeling_utils.PreTrainedModel with class vars. Window state lives in a module-local _LAYERWISE_STATE dict; the patched HF hooks (shard filter, init nuller) close over it and behave as no-ops when layerwise is inactive. - Cache layerwise ONNX between runs: _export_layerwise short-circuits when final_data/merged_*.onnx already exists, and the stitch step reuses it. - WIP: Hard-cap RoPE rows at 32K for now. (was ctx_len) so changing ctx_len does not invalidate the export hash. - Respect explicit low_cpu_mem_usage=True in from_pretrained for VLM and CausalLM (was unconditionally forced False); used by the layerwise factory for window-only weight materialization on sharded checkpoints. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

vbaddi assigned vbaddi and quic-rishinr Jun 5, 2026

vbaddi added the enhancement New feature or request label Jun 5, 2026

vbaddi assigned abhishek-singh591 Jun 5, 2026

vbaddi added the 1.22 Release 1.22 candidate label Jun 5, 2026

vbaddi changed the title ~~feat(0605): Layerwise export: API-driven, env-var-free, opt-in flag~~ feat(0506): Layerwise export: API-driven, env-var-free, opt-in flag Jun 5, 2026

vbaddi force-pushed the layerwise/api-encapsulation branch from fc7a3d9 to f842a30 Compare June 5, 2026 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(0506): Layerwise export: API-driven, env-var-free, opt-in flag#1047

feat(0506): Layerwise export: API-driven, env-var-free, opt-in flag#1047
vbaddi wants to merge 3 commits into
release/v1.22.0_tmpfrom
layerwise/api-encapsulation

vbaddi commented Jun 5, 2026 •

edited

Loading

Uh oh!

abhishek-singh591 commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vbaddi commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new

Backward compatibility

Tests added

Uh oh!

abhishek-singh591 commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vbaddi commented Jun 5, 2026 •

edited

Loading