Skip to content

Optimize CPU RAM peak memory during quantization#1386

Open
lvliang-intel wants to merge 13 commits intomainfrom
lvl/ram_usage_optimization
Open

Optimize CPU RAM peak memory during quantization#1386
lvliang-intel wants to merge 13 commits intomainfrom
lvl/ram_usage_optimization

Conversation

@lvliang-intel
Copy link
Contributor

Description

Optimize CPU RAM peak memory during quantization:

  1. Two optional CPU RAM optimizations, gated by [low_cpu_mem_usage]:
    cpu_stream_offload_blocks: offload block weights to disk and load them on demand during block-wise quantization, then re-offload quantized weights; restore at the end.
    cpu_stream_loss: avoid caching block outputs by computing targets on-the-fly with a frozen block copy (requires [nblocks=1]).

  2. The quantization flow caches inputs once, then processes blocks sequentially, loading/offloading weights and optionally streaming loss to keep peak CPU RAM low.

Test

Quantize Qwen/Qwen3-4B-Instruct-2507 with AutoRound (4-bit) and compare CPU RAM peak usage with different optimization options.

Optimization options:

  1. cpu_stream_offload_blocks: Offload block weights to disk, load on demand
  2. cpu_stream_loss: Compute loss on-the-fly using frozen block copy

Summary: Peak RAM Comparison

Configuration Peak RAM (GB) Time (s) RAM Saved
Baseline 24.29 1582.3 baseline
+ offload_blocks 20.26 1609.1 -4.03 GB
+ stream_loss 21.31 1364.0 -2.98 GB
All optimizations 15.57 1269.3 -8.72 GB

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Copilot AI review requested due to automatic review settings February 3, 2026 07:06
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes CPU RAM usage during model quantization by introducing two optional streaming strategies. The changes enable efficient quantization of large models by reducing peak memory consumption through block-wise weight offloading to disk and on-the-fly loss computation.

Changes:

  • Added CPU RAM optimization options (cpu_stream_offload_blocks and cpu_stream_loss) to reduce memory usage during quantization
  • Modified export logic to only save quantization config attributes that differ from scheme defaults
  • Added comprehensive test for CPU RAM optimization with memory tracking

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
auto_round/compressors/base.py Core implementation of CPU RAM optimization with block offloading and streaming loss computation
auto_round/utils/model.py Added utility functions for saving/loading/clearing module weights to support offloading
auto_round/export/export_to_autoround/export.py Modified to only save non-default config attributes in extra_config
auto_round/export/export_to_autoround/export_to_fp8.py Modified to only save non-default config attributes in extra_config
auto_round/export/export_to_autoround/export_to_nvfp_mxfp.py Modified to only save non-default config attributes in extra_config
test/test_cuda/advanced/test_cpu_ram_optimization.py New test file to validate CPU RAM optimization features
test/test_cuda/quantization/test_mix_bits.py Updated assertions to verify only non-default attributes are saved
test/test_cpu/quantization/test_mix_bits.py Updated assertions to verify only non-default attributes are saved
test/test_cuda/integrations/test_sglang.py Updated test configuration and assertions
test/test_cpu/quantization/test_act_quantization.py Removed assertions for default config values
test/test_cuda/export/test_gguf.py Changed device specification from integer to string format
auto_round/auto_scheme/utils.py Added fallback device handling for string device specifications

WeiweiZhang1 and others added 4 commits February 3, 2026 07:17
Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
yiliu30 and others added 7 commits February 4, 2026 03:10
Signed-off-by: yiliu30 <yi4.liu@intel.com>
…atible) (#1374)

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com>
Co-authored-by: n1ck-guo <heng.guo@intel.com>
Co-authored-by: WeiweiZhang1 <weiwei1.zhang@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants