feat: add configurable residual processing to reduce peak VRAM usage by magiccodingman · Pull Request #239 · p-e-w/heretic

magiccodingman · 2026-03-19T18:33:39Z

This PR introduces configurable memory optimizations for the residual-analysis phase, allowing users to reduce peak VRAM usage while preserving default behavior.

By default, everything works exactly as before. The new behavior is fully opt-in.

Why this exists

The largest avoidable VRAM spikes in this pipeline come from residual handling, not model execution.

Two key issues:

Residual tensors could remain in VRAM longer than necessary, including across trial iterations
The refusal-direction calculation phase could create a one-time VRAM spike significantly larger than the rest of the run

This meant users who could run the model itself would still hit OOM either during residual collection or later during trials due to accumulated intermediate tensors.

This PR makes that phase configurable and more memory-efficient.

What’s added

1. `offload_outputs_to_cpu`

Controls whether intermediate tensors (residuals, logprobs) are moved to CPU across the pipeline (both residual analysis and trial execution):

This prevents intermediate tensors from accumulating in VRAM over time, avoiding OOM during longer trial runs.

false – original behavior (keep in VRAM)
true – move to RAM after computation

Offloading happens after residual computation is complete, preserving the original computation path as closely as possible.

2. `residual_collection`

Controls how residuals are aggregated:

"full" – store all residuals (original behavior)
"mean" – compute the mean incrementally without storing full tensors

The "mean" mode significantly reduces peak memory usage during the residual pass.

3. `residual_batch_size`

Controls batch size used during residual collection when residual_collection = "mean":

"default" – use the resolved main batch size
"safe" – use half the batch size (minimum 1)
integer – explicit value

Using smaller batch sizes can further reduce peak memory usage.

4. `residual_use_cache`

Controls whether KV caching is used during residual extraction:

true – use the model’s default generation behavior (original behavior)
false – disable KV cache and recompute the full forward pass

Since residual extraction only generates a single token, KV caching is not required for correctness.

Disabling it may slightly reduce VRAM usage, but can introduce small numerical differences due to changes in execution paths and kernel selection.

For best reproducibility and consistency with previous results, the default (true) preserves the original behavior.

Impact

Default users

No behavior change
No performance change

With memory optimizations enabled

Reduced peak VRAM usage during residual analysis
Avoids OOM during refusal-direction calculation
Keeps VRAM usage closer to steady-state throughout both residual analysis and trial execution
Small expected slowdown (~3–10%)

Depending on configuration, small numerical differences may occur due to batching and floating-point reduction order.

Why this matters

This change removes unnecessary VRAM pressure from residual handling across the entire pipeline, not just during the initial analysis phase.

It addresses both:

large one-time spikes during refusal-direction calculation
gradual VRAM buildup from intermediate tensors across trial iterations

It enables users who:

have enough VRAM for model execution and trials
but not enough for residual accumulation or long-running stability

to run the full pipeline reliably without changing default behavior.

Example

offload_outputs_to_cpu = true
residual_collection = "mean"
residual_batch_size = "safe"

Final Note

These configurations also allow me to run a 4-bit 27B–35B model on 2×3090s with 1k harmful and 4k harmless prompts, while achieving over 100 TPS at a batch size of ~32.

This was previously not possible without offloading the entire model to CPU.

These changes not only help my setup, but should allow many others to run the heretic pipeline significantly faster on more limited hardware.

gemini-code-assist · 2026-03-19T18:34:01Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a suite of configurable memory optimizations aimed at reducing peak VRAM usage during the residual-analysis phase of the pipeline. By allowing users to offload intermediate tensors to CPU and compute residual means incrementally, it addresses common out-of-memory issues that occur during refusal-direction calculation, thereby enabling the full pipeline to run on systems with more constrained VRAM without altering the default behavior for existing users.

Highlights

Configurable Memory Offloading: Introduced offload_outputs_to_cpu to move intermediate tensors (residuals, logprobs) to CPU memory as soon as possible, significantly reducing peak VRAM usage during analysis and evaluation.
Incremental Residual Collection: Added residual_collection setting with 'full' (original behavior) and 'mean' modes. The 'mean' mode computes residual means incrementally, avoiding the need to store all per-prompt residual tensors simultaneously, which drastically lowers peak memory usage.
Adjustable Residual Batch Size: Provided residual_batch_size to control the batch size used during streaming residual collection when residual_collection is set to 'mean'. Options include 'default', 'safe' (half the main batch size), or an explicit integer, allowing fine-grained VRAM tuning.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces valuable memory optimization features for residual processing, which will allow users with less VRAM to use the tool. The changes are well-structured, with new behavior being opt-in and preserving the default functionality. The implementation is solid, with new configuration options, corresponding logic in the model, and updates to the main execution flow. I've found a few minor issues related to style guide adherence and code conciseness that I've detailed in my comments.

config.default.toml

src/heretic/main.py

src/heretic/model.py

magiccodingman · 2026-03-19T18:59:05Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces valuable memory optimizations for the residual analysis phase by adding new configuration options to offload tensors to the CPU and compute residual means incrementally. The implementation is well-structured and correctly integrated into the main execution path. My review includes one suggestion for a minor refactoring to improve code clarity and efficiency by avoiding a redundant calculation.

src/heretic/model.py

magiccodingman · 2026-03-19T19:10:24Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces valuable, configurable memory optimizations for the residual analysis phase, which will help users with limited VRAM to avoid out-of-memory errors. The implementation is solid, with new settings correctly added and used throughout the codebase. The logic for memory-efficient mean calculation and offloading tensors to the CPU is well-designed and clearly implemented. My review focuses on a few minor style guide violations in the configuration file comments and a missing newline at the end of one of the modified files. Addressing these will improve code consistency and maintainability.

config.default.toml

src/heretic/model.py

magiccodingman · 2026-03-19T20:35:22Z

I’ve introduced a new residual_use_cache setting. The original implementation did not explicitly control this, but standard Hugging Face causal models typically default to using cache during generation. As a result, keeping this set to true preserves the original behavior. Setting it to false can reduce VRAM usage, with only minimal expected impact on generation behavior.

I’ve also ensured that when all settings are left at their defaults:

residual_collection=full
residual_batch_size=default
residual_use_cache=true
cpu_offload=false

the execution path closely matches the original implementation in terms of algorithmic intent, output behavior, and overall structure.

Additionally, I made improvements to the new "mean" residual collection mode to increase numerical stability. While this mode cannot be perfectly identical to the original "full" approach due to its incremental reduction strategy, it should produce comparable results and may offer improved stability in some cases.

/gemini review

gemini-code-assist

Code Review

This pull request introduces several well-implemented memory optimization features for residual analysis, such as offloading to CPU and incremental mean calculation. The changes are logically sound and improve the tool's flexibility for users with limited VRAM. My review focuses primarily on ensuring the new configuration options and their accompanying comments in config.default.toml adhere to the repository's style guide for consistency and clarity.

config.default.toml

p-e-w

Thanks for this interesting PR!

Reducing VRAM usage is very valuable for the project. Could you quantify the peak memory reduction for an example setup?

I'm not sure we need most (or even all) of those settings though. Backwards compatibility for tensor computations is not a concern (we most certainly don't guarantee completely identical results across versions), and a 3-10% slowdown once (during the initial residual computations) is irrelevant given that the overall processing time is dominated by the trials.

p-e-w · 2026-03-20T06:26:36Z

src/heretic/main.py


    # We don't need the residuals after computing refusal directions.
    del good_residuals, bad_residuals, analyzer
    empty_cache()


These two lines are supposed to solve the problem you are describing. Why do you think they don't work?

Those lines handle post-analysis cleanup, but they do not reduce peak memory during accumulation. The peak occurs earlier in get_residuals_batched() (and similarly get_logprobs_batched()), where multiple per-batch tensors can still be resident before torch.cat(...) completes. offload_outputs_to_cpu reduces that peak by moving intermediate outputs to CPU during accumulation instead of only cleaning them up afterward.

src/heretic/model.py

src/heretic/config.py

magiccodingman · 2026-03-20T16:20:15Z

@p-e-w

Thanks for the feedback, this helps clarify your project philosophy a lot.

I added the extra configuration mainly to respect what I thought might be a need for backward compatibility and fine grained control. Now that I understand your preference for simplicity, I’m happy to adjust.

residual_collection: Agreed, I can remove the setting and default to the incremental mean approach. It’s more memory-efficient and likely the better default overall.
residual_use_cache: Also agreed, I’ll remove the setting and default to False. It’s a minor optimization and not necessary.

For residual_batch_size, I’d like to make a case for keeping it. This isn’t just a tuning knob, it directly affects peak VRAM behavior. Even with improvements, the residual phase can still spike higher than the steady-state trial usage. The “safe” option helps avoid blocking users who otherwise have enough VRAM to complete the full pipeline. It also impacts numerical behavior depending on batch alignment. Allowing direct number choices for that setting also can be a make it or break it setting for those with much larger data sets. I’ll follow up with concrete benchmarks to better demonstrate this.

Regarding offload_outputs_to_cpu, this affects the entire pipeline, not just the initial phase, since it controls where residuals live long-term. I’ll include benchmarks comparing enabled vs disabled so we can decide whether this should be the default or fully internal.

I’ll gather and share real measurements shortly so we can make decisions based on data. Thanks again, I’m happy to align this with your direction.

magiccodingman · 2026-03-21T02:24:18Z

@p-e-w

This PR addresses two distinct VRAM problems in the pipeline:

Initial residual collection spike (can exceed trial VRAM by 3–4×+)
Long-term VRAM instability during trials (gradual buildup → late OOM)

These are solved by two independent controls:

residual_batch_size → reduces initial spike
offload_outputs_to_cpu → improves long-term stability

These are not interchangeable.

VRAM during residual collection ≠ VRAM during trials

VRAM usage during residual collection ≠ VRAM usage during trials.

Even at 800 prompts, the spike exceeds steady-state usage. At 5k, the gap becomes extreme.

Benchmark Evidence

benchmark_report.txt

Legend:

new_code = this PR
CPU-D = offload disabled
batch-XXX = residual_batch_size

Highlights:

default_code (800 prompts):
- Peak: 6.34 GB
- Trials: ~5.34 GB
default_code_5k:
- Peak: 20.42 GB
- Trials: ~5.35 GB
- → ~15 GB is pure overhead spike
new_code_5k_batch-32:
- Peak: ~6.43 GB
- Trials unchanged

Batch size is also a fine-grained escape hatch:

Lowering (e.g. 32 → 16 → 8) reduces peak VRAM further
Often the difference when operating near VRAM limits

Tradeoff:

Smaller batches = longer initial load
Extreme values (1–8) can be very slow
But still preferable to failing to start

Additional observation:

At the same batch size, enabling CPU offload reduces spike variance further
- e.g. CPU-D-batch-32 vs CPU-E-batch-32

Interpretation:

Batching alone makes large runs feasible
Combining batching + CPU offload minimizes spikes further
batch ≤ 32 (even without offload) is often the make-or-break threshold for starting runs

`offload_outputs_to_cpu` (Stability vs Speed)

This is the one I want to dig into a bit more because it's not as obvious.

In longer runs / larger models:

VRAM can grow over time
Runs may OOM after starting successfully

Likely causes:

hidden_states retained on GPU
Missing .cpu() / .detach()
empty_cache() not freeing referenced memory

This PR addresses those paths.

I don’t have a perfectly isolated benchmark (this is situational) and the behavior aligns with what’s being fixed here. But you can see when cpu offload is disabled or comparing to the VRAM use of the current code base. The trial VRAM use is slightly higher constantly versus when cpu offload is enabled. Though not perfectly isolated, that small difference we see here can be very extreme depending on the situation, especially with larger models. I’ve repeatedly hit this in practice.

But just to clarify. The offload_outputs_to_cpu though it affects the initial spike, it was included specifically for VRAM efficiency and stability during the trial process, but in exchange it's noticeably slower.

Tradeoff

Enabled (true):
- More stable VRAM over time
- Prevents late-run OOM
- ~14–28% slower total trial time (which is serious) (CPU↔GPU transfer)
Disabled (false):
- Faster
- Still benefits from batching
- Higher risk of VRAM buildup in long runs
- Is how the current code base works. But this is what would be considered "performance mode".

Example:

new_code_5k_CPU-D (offload disabled):
- Same speed as baseline
- Still gets batching benefits
- Confirms offload is not required for spike reduction

Effectively, disabling offload keeps behavior aligned with the current code, what I’d consider “performance mode.” This PR introduces an optional alternative: a more balanced mode that reduces VRAM pressure and improves long-run stability at the cost of performance.

That tradeoff is real. Enabling offload_outputs_to_cpu introduces consistent slowdown during trials due to CPU↔GPU transfers. Based on benchmarks, this is roughly ~14% total slowdown (and up to ~28% on larger datasets).

Because of this, it should remain optional, not defaulted or removed.

And just to emphasize, the benchmarks do measure a lower trial VRAM use over the trials with cpu offload enabled, but the spikes and use can be much more dramatic situationally which was not measured here, but the code itself shows lots of clean up that protects stability, but introduces the slow downs shown throughout the trials.

From my own usage:

Smaller models / datasets → I leave it disabled (no need to pay the cost)
Larger models / datasets → I enable it, otherwise runs become unstable or OOM

Even with multiple 3090s and R9700s, I couldn’t reliably run larger models without this. This option is what made those runs viable.

A good analogy is bnb_4bit, you don’t enable it unless you need it, but when you do, it’s essential.

Why I Believe This Must Stay Configurable

These settings solve different user problems:

These solve different problems:

Start the run → residual_batch_size
Keep it stable → offload_outputs_to_cpu

Defaulting CPU offload to true adds unnecessary slowdown for many users, but this is valuable for larger datasets, larger models, and long runs.

TLDR

residual_batch_size - controls initial spike and is a net benefit to the project, but the configurability is still important in my opinion and is shown in the graph.

offload_outputs_to_cpu - this shouldn't be the default, but shouldn't be removed as a configurable option. It's too much of a performance loss for many people who want to just run and gun to get the model they want.

Recommendation

Keep:

residual_batch_size (controls initial VRAM spike; critical for starting runs)
offload_outputs_to_cpu (stability vs performance tradeoff)

Remove:

residual_collection
residual_use_cache

Defaults

offload_outputs_to_cpu = false (performance mode)
Enable it only when VRAM limits or instability require it

If you need more data about my benchmarks or findings, just let me know. I have all the logs for all the data collected so that it's also reproducible.

But what're your thoughts?

p-e-w · 2026-03-21T06:00:24Z

I'm amazed by the quality of your analysis. This is super important work!

That tradeoff is real. Enabling offload_outputs_to_cpu introduces consistent slowdown during trials due to CPU↔GPU transfers. Based on benchmarks, this is roughly ~14% total slowdown (and up to ~28% on larger datasets).

Because of this, it should remain optional, not defaulted or removed.

If the run OOMs because of the VRAM buildup, it can waste many hours of time. While Heretic now keeps snapshots so work won't be lost, the user may leave it running unsupervised, expecting it to proceed in the background when in reality it crashed because of an OOM. Also, many users will try to process models that are at or near the limit of their VRAM capacity, so even small amounts of memory count.

Therefore, I think that:

Yes, this setting should remain, but
it should be enabled (not disabled) by default.

"Don't leak VRAM" should be the basic expectation for a tool like Heretic, not a cool feature that you can flip on if you want it.

Overall, enabling this should actually improve performance in many cases, because batch sizes that used to OOM will now be determined to be feasible, often doubling or even quadrupling the processing speed.

BTW, I don't understand why the slowdown would depend on the dataset size. We only keep the refusal directions, which are the differences-of-means for "good" and "bad" prompts. So after the initial residual computations, the memory footprint (and thus the impact on trial performance) should be independent of whether there are 400 or 5000 prompts. Please clarify this; I don't want to miss something important here.

residual_batch_size - controls initial spike and is a net benefit to the project, but the configurability is still important in my opinion and is shown in the graph.

I think the best solution would be to have a batch size retry logic for the residuals. We start trying to get the residuals with the global batch size. If that OOMs, we retry with half that size, and so on. This should solve the memory problem without burdening the user with another settings switch. In many cases, the global setting will be fine, and if it isn't, we just lower it.

In fact, this problem can occur elsewhere in the system as well (e.g. if another process starts to consume VRAM that was previously available, invalidating the batch size). I wonder if we should simply catch OOMs everywhere (including in get_responses etc.), and react by halving the batch size and retrying. This could make Heretic much more robust, effectively able to dynamically adapt to changing memory pressure.

It might also be a good idea to actually try batch sizes from the high end (i.e. starting with 128 and halving it successively, rather than with 1 and doubling it). Often, a batch size of 128 (the default maximum) already works, which would speed up processing because there is only one test run instead of 8. In fact, combining this idea with the one from the previous paragraph could completely eliminate the current batch size determination loop, as it would basically be integrated into get_responses.

magiccodingman · 2026-03-21T17:49:24Z

@p-e-w

Thanks for the comment. I appreciate you appreciating the analytics lol!

BTW, I don't understand why the slowdown would depend on the dataset size.

TLDR: You were right, it doesn’t depend on dataset size in the trial phase.

So firstly, I made some mistakes. My bad. I reviewed the benchmarks again and accidentally conflated some things and misinterpreted my own benchmarks from the trial phase. I did accidentally conflated some of my own numbers. The differences with cpu offload enabled was not as dramatic as I wrote, plus I accidentally compared some of the wrong numbers.

This is actually a good thing because I'll clarify this now and it's better, not worse.
1.) Data set size affects warm up / residual phase. Larger vs smaller data sets do not affect trial phase. Again, that was my fault for saying that earlier. Looking at the same data I already posted, you can see this already. My previous comment was purely accidental misinterpretation.
2.) CPU offload does affect the trial phase as discussed. Measured was a 9.4% slow down. Not because of data set size, just due to the new CPU offload features that also makes the run much more stable and VRAM efficient.

Completely disregard my 14% and 28% statements from before. You're right about the data sets.

Also, just a note. I agree with your point that ‘don’t leak VRAM’ should be the default expectation.

And I'm totally okay with just defaulting to cpu offload to true. It does make the process by default less of a headache, less prone to error, and more overal easier.

Questions

And to clarify:

1.) Are you saying that offload_outputs_to_cpu should 100% exist, but that it should be enabled true by default?
2.) Just confirming, because I also agree with you. Go ahead and remove residual_collection and residual_use_cache right?
3.) Are you saying residual_batch_size should continue to exist as well, but should also have auto retry?

For #3 (auto-retry logic), I’m generally on board, but I think residual_batch_size should still exist.

I agree with auto-retry as the default behavior. My only concern is avoiding repeated probing once a stable configuration is known, so I’d suggest allowing an optional override, rather than requiring it.

In practice, I use auto batch detection (batch_size = 0) during initial testing, but once I know what my system can handle, I always lock it in manually. I think residual_batch_size should follow the same pattern: allow automation, but still support explicit control.

Right now, residual collection starts from the resolved batch_size, which is ideal. Matching the trial batch size is both safer (less likely to OOM) and better for numerical consistency. Starting higher (e.g. retrying from 128 when the system settled on 64) is not only unlikely to succeed, but also less desirable from a consistency standpoint.

Given that, I’d suggest simplifying the setting:

Remove "default" and "safe"
Use 0 to mean “match resolved batch_size” (auto behavior), which also matches how the batch_size setting works.
Allow explicit integers for manual override

This keeps things simple while preserving control when needed. Example:

0   → auto (match main batch)
32  → explicit override

Regarding auto-retry:

I like the idea, but I think it’s better handled in a follow-up PR. Proper retry logic here can get messy (state, prompt reuse, memory cleanup), and plus I'm not familiar with that mechanism at the moment either. Personally I'd propose to keep this PR simple with stability and batching behavior. Then I'll submit a follow up PR once this is merged that introduces adaptive retry logic.

From a usability standpoint, I do think manual control should remain. Once a user knows their limits, forcing re-probing every run can be wasteful—especially on slower setups (e.g. CPU inference), where startup time matters.

Thanks again for getting back to me and giving feedback. I'll also get back to your other comments on the code and do cleanup/refactoring once you are able to confirm the direction you'd like to go. Thanks again.

p-e-w · 2026-03-23T03:17:51Z

1.) Are you saying that offload_outputs_to_cpu should 100% exist, but that it should be enabled true by default?

Correct. It can be disabled by people who have the VRAM and want the extra few percent of performance. Although disabling it isn't always a performance gain even in those cases, because in multi-GPU setups the residuals will still need to be moved to the tensor's device.

2.) Just confirming, because I also agree with you. Go ahead and remove residual_collection and residual_use_cache right?

Yes. It should be noted that ARA (#211) requires individual residuals for good prompts, but it is experimental for now and we will have a separate code path anyway. These two settings are unnecessary and their "on" behavior should be the default.

3.) Are you saying residual_batch_size should continue to exist as well, but should also have auto retry?

I agree with auto-retry as the default behavior. My only concern is avoiding repeated probing once a stable configuration is known, so I’d suggest allowing an optional override, rather than requiring it.

We don't need repeated probing. In hindsight, the batch size determination on startup was a mistake I made during Heretic's initial design. I have just described what I believe would be a better design in #248.

I think it would be best to remove the residual batch size configuration from this PR (just re-use the global batch size for now), then implement the design from #248 separately, which should resolve all related issues without stuffing this PR with extra complexity.

magiccodingman · 2026-03-23T16:18:39Z

@p-e-w Sounds good to me, I completely agree. I'm going to put this PR back into draft. I'll then make some changes to the PR, get back to your original comments on the code (if they still apply), and put this back in ready to review when the changes are done.

Update:
Still going to submit a more barebones version of this PR. Just been busy as I need my current branch. Removing the residual_batch_size option to manually configure would stop me from making specific runs and finishing some research on my end. Plus having that option lets me set what I know my system can do, and thus skips a longer wait for auto retries.

Will get a clean version of this PR though.

magiccodingman · 2026-03-27T15:08:55Z

@p-e-w Until auto-retry is implemented, removing residual_batch_size means residual collection can still briefly exceed the VRAM used during trials. That said, matching default behavior already gives a large reduction, so this is a solid improvement. But I run multiple heretic pipelines pretty consistently, so if you have 2 heretic runs in GPU at the same time, if you can't manually set the batch size, that'd prevent people like me from being able to reliably not have multiple pipelines OOM without manually editing the code since this version won't have the ability to set this.

For larger research datasets, batch sizes around 16–32 seem to be a good balance, so auto-retry could potentially bias toward that range early after it's first failure instead of trying higher values first.

I'm about to submit the cleaned version and will resolve the remaining review comments. Just need a little bit to make sure the refactored version is working correct and passes my unit tests.

Quick question: would you prefer feature ideas / experiments to be opened as issues? I noticed discussions aren't enabled, so just wanted to confirm the preferred place before putting effort into PR-ready implementations.

magiccodingman · 2026-03-27T16:16:13Z

For reference, I’ve preserved the original version of this PR in a separate branch (will remain unchanged):
[https://github.com/magiccodingman/heretic/tree/residual-memory-optimization-original](Original code with more configuration options)

Some removed methods (e.g. _iter_residual_batches, _get_residual_batch_size, get_residual_collection_mode) may still be useful for future batch-size-related changes for reference.

I’ll mark this PR as ready for review after completing unit tests. It should be stable, but I want to verify everything locally first.

p-e-w · 2026-03-28T07:23:57Z

Until auto-retry is implemented, removing residual_batch_size means residual collection can still briefly exceed the VRAM used during trials. That said, matching default behavior already gives a large reduction, so this is a solid improvement. But I run multiple heretic pipelines pretty consistently, so if you have 2 heretic runs in GPU at the same time, if you can't manually set the batch size, that'd prevent people like me from being able to reliably not have multiple pipelines OOM without manually editing the code since this version won't have the ability to set this.

No problem, it's an incremental process. Even the baseline from this PR is a big improvement, and the auto-retry, which comes later, will complete it.

Quick question: would you prefer feature ideas / experiments to be opened as issues? I noticed discussions aren't enabled, so just wanted to confirm the preferred place before putting effort into PR-ready implementations.

Yes, please use issues. I like to prefix proposals that are invitations for discussions with [RFC] in the title. The issue/discussion split never made sense to me.

gemini-code-assist bot reviewed Mar 19, 2026

View reviewed changes

config.default.toml Outdated Show resolved Hide resolved

src/heretic/main.py Outdated Show resolved Hide resolved

src/heretic/model.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Mar 19, 2026

View reviewed changes

src/heretic/model.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Mar 19, 2026

View reviewed changes

config.default.toml Outdated Show resolved Hide resolved

config.default.toml Outdated Show resolved Hide resolved

config.default.toml Outdated Show resolved Hide resolved

src/heretic/model.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Mar 19, 2026

View reviewed changes

magiccodingman changed the title ~~Add configurable residual processing to reduce peak VRAM usage~~ feat: add configurable residual processing to reduce peak VRAM usage Mar 19, 2026

magiccodingman marked this pull request as ready for review March 19, 2026 21:57

p-e-w reviewed Mar 20, 2026

View reviewed changes

p-e-w mentioned this pull request Mar 23, 2026

[RFC] A better automatic batch size mechanism #248

Open

magiccodingman marked this pull request as draft March 23, 2026 16:18

p-e-w mentioned this pull request Mar 27, 2026

CUDA out of memory issue #96

Open

magiccodingman closed this Mar 27, 2026

magiccodingman force-pushed the residual-memory-optimization branch from 0424048 to 1126332 Compare March 27, 2026 15:44

refactor residual memory optimizations

f902fd5

magiccodingman reopened this Mar 27, 2026

formatting

4309c38

magiccodingman force-pushed the residual-memory-optimization branch from e9ef079 to 4309c38 Compare March 27, 2026 16:13

Conversation

magiccodingman commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this exists

What’s added

1. offload_outputs_to_cpu

2. residual_collection

3. residual_batch_size

4. residual_use_cache

Impact

Default users

With memory optimizations enabled

Why this matters

Example

Final Note

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

magiccodingman commented Mar 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

magiccodingman commented Mar 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

magiccodingman commented Mar 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

p-e-w left a comment

Choose a reason for hiding this comment

Uh oh!

p-e-w Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

magiccodingman Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

magiccodingman commented Mar 20, 2026

Uh oh!

magiccodingman commented Mar 21, 2026

Benchmark Evidence

offload_outputs_to_cpu (Stability vs Speed)

Tradeoff

Why I Believe This Must Stay Configurable

TLDR

Recommendation

Defaults

Uh oh!

magiccodingman commented Mar 19, 2026 •

edited

Loading

1. `offload_outputs_to_cpu`

2. `residual_collection`

3. `residual_batch_size`

4. `residual_use_cache`

`offload_outputs_to_cpu` (Stability vs Speed)

magiccodingman commented Mar 23, 2026 •

edited

Loading

magiccodingman commented Mar 27, 2026 •

edited

Loading