Optimize CPU RAM peak memory during quantization#1386
Open
lvliang-intel wants to merge 13 commits intomainfrom
Open
Optimize CPU RAM peak memory during quantization#1386lvliang-intel wants to merge 13 commits intomainfrom
lvliang-intel wants to merge 13 commits intomainfrom
Conversation
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR optimizes CPU RAM usage during model quantization by introducing two optional streaming strategies. The changes enable efficient quantization of large models by reducing peak memory consumption through block-wise weight offloading to disk and on-the-fly loss computation.
Changes:
- Added CPU RAM optimization options (
cpu_stream_offload_blocksandcpu_stream_loss) to reduce memory usage during quantization - Modified export logic to only save quantization config attributes that differ from scheme defaults
- Added comprehensive test for CPU RAM optimization with memory tracking
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| auto_round/compressors/base.py | Core implementation of CPU RAM optimization with block offloading and streaming loss computation |
| auto_round/utils/model.py | Added utility functions for saving/loading/clearing module weights to support offloading |
| auto_round/export/export_to_autoround/export.py | Modified to only save non-default config attributes in extra_config |
| auto_round/export/export_to_autoround/export_to_fp8.py | Modified to only save non-default config attributes in extra_config |
| auto_round/export/export_to_autoround/export_to_nvfp_mxfp.py | Modified to only save non-default config attributes in extra_config |
| test/test_cuda/advanced/test_cpu_ram_optimization.py | New test file to validate CPU RAM optimization features |
| test/test_cuda/quantization/test_mix_bits.py | Updated assertions to verify only non-default attributes are saved |
| test/test_cpu/quantization/test_mix_bits.py | Updated assertions to verify only non-default attributes are saved |
| test/test_cuda/integrations/test_sglang.py | Updated test configuration and assertions |
| test/test_cpu/quantization/test_act_quantization.py | Removed assertions for default config values |
| test/test_cuda/export/test_gguf.py | Changed device specification from integer to string format |
| auto_round/auto_scheme/utils.py | Added fallback device handling for string device specifications |
Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
n1ck-guo
reviewed
Feb 4, 2026
…atible) (#1374) Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Co-authored-by: n1ck-guo <heng.guo@intel.com> Co-authored-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Optimize CPU RAM peak memory during quantization:
Two optional CPU RAM optimizations, gated by [low_cpu_mem_usage]:
cpu_stream_offload_blocks: offload block weights to disk and load them on demand during block-wise quantization, then re-offload quantized weights; restore at the end.
cpu_stream_loss: avoid caching block outputs by computing targets on-the-fly with a frozen block copy (requires [nblocks=1]).
The quantization flow caches inputs once, then processes blocks sequentially, loading/offloading weights and optionally streaming loss to keep peak CPU RAM low.
Test
Quantize Qwen/Qwen3-4B-Instruct-2507 with AutoRound (4-bit) and compare CPU RAM peak usage with different optimization options.
Optimization options:
Summary: Peak RAM Comparison
Type of Change
Related Issues
Fixes or relates to #
Checklist Before Submitting