Skip to content

feat(0506): Layerwise export: API-driven, env-var-free, opt-in flag#1047

Open
vbaddi wants to merge 3 commits into
release/v1.22.0_tmpfrom
layerwise/api-encapsulation
Open

feat(0506): Layerwise export: API-driven, env-var-free, opt-in flag#1047
vbaddi wants to merge 3 commits into
release/v1.22.0_tmpfrom
layerwise/api-encapsulation

Conversation

@vbaddi
Copy link
Copy Markdown
Contributor

@vbaddi vbaddi commented Jun 5, 2026

Summary

Encapsulate the layerwise export+stitch+compile orchestration loop (previously a 200+ line example with monkey-patches and an LAYERWISE_EXPORT env var) behind a single layerwise=True flag on .compile() / .export().

What's new

- New flag: layerwise: bool = False (+ layerwise_window_size: int = 1) on:                                                                
- QEFFAutoModelForImageTextToText.compile() / .export()                                                                                 
- QEFFAutoModelForCausalLM.compile() / .export()                                                                                        
- No env vars. LAYERWISE_EXPORT is gone. Control is a process-local class flag toggled by an internal context manager.                    

Backward compatibility

  • layerwise=False is the default.
  • Verified: pytest tests/unit_test/models/test_model_quickcheck.py -n auto → 130 passed, 3 skipped (was 121 / 3 before this PR).

Tests added

  1. 9 new tests covering windowing helpers, the supported/unsupported guard (parametrized over all 3 supported architectures),
  2. env-var-not-leaked invariant, and the context manager's flag toggle.

Usage: Enable layerwise

 import torch                                                                                                                              
 from transformers import AutoConfig                                                                                                       
 from QEfficient import QEFFAutoModelForImageTextToText               

 config = AutoConfig.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

 model = QEFFAutoModelForImageTextToText.from_pretrained(             
     "Qwen/Qwen3-VL-235B-A22B-Instruct",
     attn_implementation="eager", kv_offload=True,
     config=config, torch_dtype=torch.float16,
 )                                                                    

 qpc = model.compile(                                                 
     batch_size=1, prefill_seq_len=1, ctx_len=4096,
     num_cores=16, num_devices=4,
     mxfp6_matmul=True, aic_enable_depth_first=True,
     skip_vision=True, split_retained_state_io=True,
     use_onnx_subfunctions=True, mos=1,
     layerwise=True,                         # opt-in
     layerwise_window_size=1,        # layers per window              
 )

Disable (default)

Just don't pass layerwise. Behavior is identical to before this PR.

  qpc = model.compile(batch_size=1, prefill_seq_len=1, ctx_len=4096, ...)

  CausalLM (qwen3_moe)                                                 

  from QEfficient import QEFFAutoModelForCausalLM

  model = QEFFAutoModelForCausalLM.from_pretrained("Qwen/Qwen3-235B-A22B-Instruct-2507")
  qpc = model.compile(                                                 
      prefill_seq_len=4, ctx_len=128, num_cores=16, num_devices=1,
      mxfp6_matmul=True, mxint8_kv_cache=True,
      layerwise=True, layerwise_window_size=1,
  )   

Test Plan

  • pyenv activate qeff && python -m pytest -q tests/unit_test/models/test_model_quickcheck.py -n auto → 130 passed, 3 skipped

@vbaddi vbaddi added the enhancement New feature or request label Jun 5, 2026
@vbaddi vbaddi added the 1.22 Release 1.22 candidate label Jun 5, 2026
@vbaddi vbaddi changed the title feat(0605): Layerwise export: API-driven, env-var-free, opt-in flag feat(0506): Layerwise export: API-driven, env-var-free, opt-in flag Jun 5, 2026
…flag

Move the layerwise export+stitch+compile orchestration loop into a
single internal driver gated by a new layerwise=True kwarg on .compile() and
.export(). The flag is opt-in; layerwise=False remains the default and the
non-layerwise compile path is unchanged byte-for-byte.

The LAYERWISE_EXPORT environment variable is removed entirely; control flows
purely through the API via a process-local QEFFBaseModel._layerwise_active
flag toggled by an internal context manager. Supported architectures are
allowlisted (qwen3_vl_moe, qwen3_5_moe, qwen3_moe); other model types raise
NotImplementedError when layerwise=True.

Wired on QEFFAutoModelForImageTextToText (dual-QPC) and
QEFFAutoModelForCausalLM. Five existing layerwise example scripts
collapse from 200-330 lines to ~60 lines each. The encapsulation module is
documented as provisional and emits a one-shot DeprecationWarning.

test_model_quickcheck.py: 121 -> 127 passed, 3 skipped (unchanged) with
five new tests covering the windowing helpers, the supported/unsupported
guard, the env-var-not-leaked invariant, and the context manager's
class-flag toggle.

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
@vbaddi vbaddi force-pushed the layerwise/api-encapsulation branch from fc7a3d9 to f842a30 Compare June 5, 2026 18:22
  - Slim per-window export: truncate sin_cached/cos_cached to ctx_len and
    null embed_tokens / lm_head when unreached.

  - Fix fp16 layerwise export: _export_layerwise synthesized
    inputs_embeds via torch.rand without a dtype.

  - Suppress confusing "An unexpected error occurred while dumping the
    qconfig" message when compile short-circuits without producing a QPC
    (e.g. layerwise per-window export). dump_qconfig now skips when
    qpc_path is None and demotes real failures to logger.debug.

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
@abhishek-singh591
Copy link
Copy Markdown
Contributor

Backward compatibility is not yet guaranteed for Qwen 3.5 and Qwen3-VL-MoE. PR #1043 addresses this issue for Qwen3-VL, but Qwen 3.5 will still need additional modeling changes. We skipped the Qwen 3.5 unit tests, which is why all tests are currently passing.

  - Add layerwise=True to from_pretrained (VLM + CausalLM). When set, the
    outer model is built on the meta device via from_config, so the caller's
    load no longer pulls full checkpoint weights into RAM.

  - Stop polluting transformers.modeling_utils.PreTrainedModel with class
    vars. Window state lives in a module-local _LAYERWISE_STATE dict; the
    patched HF hooks (shard filter, init nuller) close over it and behave
    as no-ops when layerwise is inactive.

  - Cache layerwise ONNX between runs: _export_layerwise short-circuits
    when final_data/merged_*.onnx already exists, and the stitch step
    reuses it.

  - WIP: Hard-cap RoPE rows at 32K for now. (was ctx_len) so changing ctx_len does not
    invalidate the export hash.

  - Respect explicit low_cpu_mem_usage=True in from_pretrained for VLM and
    CausalLM (was unconditionally forced False); used by the layerwise
    factory for window-only weight materialization on sharded checkpoints.

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1.22 Release 1.22 candidate enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants