feat: add configurable residual processing to reduce peak VRAM usage#239
feat: add configurable residual processing to reduce peak VRAM usage#239magiccodingman wants to merge 2 commits intop-e-w:masterfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a suite of configurable memory optimizations aimed at reducing peak VRAM usage during the residual-analysis phase of the pipeline. By allowing users to offload intermediate tensors to CPU and compute residual means incrementally, it addresses common out-of-memory issues that occur during refusal-direction calculation, thereby enabling the full pipeline to run on systems with more constrained VRAM without altering the default behavior for existing users. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces valuable memory optimization features for residual processing, which will allow users with less VRAM to use the tool. The changes are well-structured, with new behavior being opt-in and preserving the default functionality. The implementation is solid, with new configuration options, corresponding logic in the model, and updates to the main execution flow. I've found a few minor issues related to style guide adherence and code conciseness that I've detailed in my comments.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces valuable memory optimizations for the residual analysis phase by adding new configuration options to offload tensors to the CPU and compute residual means incrementally. The implementation is well-structured and correctly integrated into the main execution path. My review includes one suggestion for a minor refactoring to improve code clarity and efficiency by avoiding a redundant calculation.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces valuable, configurable memory optimizations for the residual analysis phase, which will help users with limited VRAM to avoid out-of-memory errors. The implementation is solid, with new settings correctly added and used throughout the codebase. The logic for memory-efficient mean calculation and offloading tensors to the CPU is well-designed and clearly implemented. My review focuses on a few minor style guide violations in the configuration file comments and a missing newline at the end of one of the modified files. Addressing these will improve code consistency and maintainability.
|
I’ve introduced a new I’ve also ensured that when all settings are left at their defaults: the execution path closely matches the original implementation in terms of algorithmic intent, output behavior, and overall structure. Additionally, I made improvements to the new /gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces several well-implemented memory optimization features for residual analysis, such as offloading to CPU and incremental mean calculation. The changes are logically sound and improve the tool's flexibility for users with limited VRAM. My review focuses primarily on ensuring the new configuration options and their accompanying comments in config.default.toml adhere to the repository's style guide for consistency and clarity.
p-e-w
left a comment
There was a problem hiding this comment.
Thanks for this interesting PR!
Reducing VRAM usage is very valuable for the project. Could you quantify the peak memory reduction for an example setup?
I'm not sure we need most (or even all) of those settings though. Backwards compatibility for tensor computations is not a concern (we most certainly don't guarantee completely identical results across versions), and a 3-10% slowdown once (during the initial residual computations) is irrelevant given that the overall processing time is dominated by the trials.
|
|
||
| # We don't need the residuals after computing refusal directions. | ||
| del good_residuals, bad_residuals, analyzer | ||
| empty_cache() |
There was a problem hiding this comment.
These two lines are supposed to solve the problem you are describing. Why do you think they don't work?
There was a problem hiding this comment.
Those lines handle post-analysis cleanup, but they do not reduce peak memory during accumulation. The peak occurs earlier in get_residuals_batched() (and similarly get_logprobs_batched()), where multiple per-batch tensors can still be resident before torch.cat(...) completes. offload_outputs_to_cpu reduces that peak by moving intermediate outputs to CPU during accumulation instead of only cleaning them up afterward.
|
Thanks for the feedback, this helps clarify your project philosophy a lot. I added the extra configuration mainly to respect what I thought might be a need for backward compatibility and fine grained control. Now that I understand your preference for simplicity, I’m happy to adjust.
For Regarding I’ll gather and share real measurements shortly so we can make decisions based on data. Thanks again, I’m happy to align this with your direction. |
|
I'm amazed by the quality of your analysis. This is super important work!
If the run OOMs because of the VRAM buildup, it can waste many hours of time. While Heretic now keeps snapshots so work won't be lost, the user may leave it running unsupervised, expecting it to proceed in the background when in reality it crashed because of an OOM. Also, many users will try to process models that are at or near the limit of their VRAM capacity, so even small amounts of memory count. Therefore, I think that:
"Don't leak VRAM" should be the basic expectation for a tool like Heretic, not a cool feature that you can flip on if you want it. Overall, enabling this should actually improve performance in many cases, because batch sizes that used to OOM will now be determined to be feasible, often doubling or even quadrupling the processing speed. BTW, I don't understand why the slowdown would depend on the dataset size. We only keep the refusal directions, which are the differences-of-means for "good" and "bad" prompts. So after the initial residual computations, the memory footprint (and thus the impact on trial performance) should be independent of whether there are 400 or 5000 prompts. Please clarify this; I don't want to miss something important here.
I think the best solution would be to have a batch size retry logic for the residuals. We start trying to get the residuals with the global batch size. If that OOMs, we retry with half that size, and so on. This should solve the memory problem without burdening the user with another settings switch. In many cases, the global setting will be fine, and if it isn't, we just lower it. In fact, this problem can occur elsewhere in the system as well (e.g. if another process starts to consume VRAM that was previously available, invalidating the batch size). I wonder if we should simply catch OOMs everywhere (including in It might also be a good idea to actually try batch sizes from the high end (i.e. starting with 128 and halving it successively, rather than with 1 and doubling it). Often, a batch size of 128 (the default maximum) already works, which would speed up processing because there is only one test run instead of 8. In fact, combining this idea with the one from the previous paragraph could completely eliminate the current batch size determination loop, as it would basically be integrated into |
|
Thanks for the comment. I appreciate you appreciating the analytics lol!
TLDR: You were right, it doesn’t depend on dataset size in the trial phase. So firstly, I made some mistakes. My bad. I reviewed the benchmarks again and accidentally conflated some things and misinterpreted my own benchmarks from the trial phase. I did accidentally conflated some of my own numbers. The differences with cpu offload enabled was not as dramatic as I wrote, plus I accidentally compared some of the wrong numbers. This is actually a good thing because I'll clarify this now and it's better, not worse. Completely disregard my 14% and 28% statements from before. You're right about the data sets. Also, just a note. I agree with your point that ‘don’t leak VRAM’ should be the default expectation. And I'm totally okay with just defaulting to cpu offload to true. It does make the process by default less of a headache, less prone to error, and more overal easier. QuestionsAnd to clarify: 1.) Are you saying that For #3 (auto-retry logic), I’m generally on board, but I think I agree with auto-retry as the default behavior. My only concern is avoiding repeated probing once a stable configuration is known, so I’d suggest allowing an optional override, rather than requiring it. In practice, I use auto batch detection ( Right now, residual collection starts from the resolved Given that, I’d suggest simplifying the setting:
This keeps things simple while preserving control when needed. Example: 0 → auto (match main batch)
32 → explicit overrideRegarding auto-retry: I like the idea, but I think it’s better handled in a follow-up PR. Proper retry logic here can get messy (state, prompt reuse, memory cleanup), and plus I'm not familiar with that mechanism at the moment either. Personally I'd propose to keep this PR simple with stability and batching behavior. Then I'll submit a follow up PR once this is merged that introduces adaptive retry logic. From a usability standpoint, I do think manual control should remain. Once a user knows their limits, forcing re-probing every run can be wasteful—especially on slower setups (e.g. CPU inference), where startup time matters. Thanks again for getting back to me and giving feedback. I'll also get back to your other comments on the code and do cleanup/refactoring once you are able to confirm the direction you'd like to go. Thanks again. |
Correct. It can be disabled by people who have the VRAM and want the extra few percent of performance. Although disabling it isn't always a performance gain even in those cases, because in multi-GPU setups the residuals will still need to be moved to the tensor's device.
Yes. It should be noted that ARA (#211) requires individual residuals for good prompts, but it is experimental for now and we will have a separate code path anyway. These two settings are unnecessary and their "on" behavior should be the default.
We don't need repeated probing. In hindsight, the batch size determination on startup was a mistake I made during Heretic's initial design. I have just described what I believe would be a better design in #248. I think it would be best to remove the residual batch size configuration from this PR (just re-use the global batch size for now), then implement the design from #248 separately, which should resolve all related issues without stuffing this PR with extra complexity. |
|
@p-e-w Sounds good to me, I completely agree. I'm going to put this PR back into draft. I'll then make some changes to the PR, get back to your original comments on the code (if they still apply), and put this back in ready to review when the changes are done. Update: Will get a clean version of this PR though. |
|
@p-e-w Until auto-retry is implemented, removing For larger research datasets, batch sizes around 16–32 seem to be a good balance, so auto-retry could potentially bias toward that range early after it's first failure instead of trying higher values first. I'm about to submit the cleaned version and will resolve the remaining review comments. Just need a little bit to make sure the refactored version is working correct and passes my unit tests. Quick question: would you prefer feature ideas / experiments to be opened as issues? I noticed discussions aren't enabled, so just wanted to confirm the preferred place before putting effort into PR-ready implementations. |
0424048 to
1126332
Compare
e9ef079 to
4309c38
Compare
|
For reference, I’ve preserved the original version of this PR in a separate branch (will remain unchanged): Some removed methods (e.g. I’ll mark this PR as ready for review after completing unit tests. It should be stable, but I want to verify everything locally first. |
No problem, it's an incremental process. Even the baseline from this PR is a big improvement, and the auto-retry, which comes later, will complete it.
Yes, please use issues. I like to prefix proposals that are invitations for discussions with |

This PR introduces configurable memory optimizations for the residual-analysis phase, allowing users to reduce peak VRAM usage while preserving default behavior.
By default, everything works exactly as before. The new behavior is fully opt-in.
Why this exists
The largest avoidable VRAM spikes in this pipeline come from residual handling, not model execution.
Two key issues:
This meant users who could run the model itself would still hit OOM either during residual collection or later during trials due to accumulated intermediate tensors.
This PR makes that phase configurable and more memory-efficient.
What’s added
1.
offload_outputs_to_cpuControls whether intermediate tensors (residuals, logprobs) are moved to CPU across the pipeline (both residual analysis and trial execution):
This prevents intermediate tensors from accumulating in VRAM over time, avoiding OOM during longer trial runs.
false– original behavior (keep in VRAM)true– move to RAM after computationOffloading happens after residual computation is complete, preserving the original computation path as closely as possible.
2.
residual_collectionControls how residuals are aggregated:
"full"– store all residuals (original behavior)"mean"– compute the mean incrementally without storing full tensorsThe
"mean"mode significantly reduces peak memory usage during the residual pass.3.
residual_batch_sizeControls batch size used during residual collection when
residual_collection = "mean":"default"– use the resolved main batch size"safe"– use half the batch size (minimum 1)Using smaller batch sizes can further reduce peak memory usage.
4.
residual_use_cacheControls whether KV caching is used during residual extraction:
true– use the model’s default generation behavior (original behavior)false– disable KV cache and recompute the full forward passSince residual extraction only generates a single token, KV caching is not required for correctness.
Disabling it may slightly reduce VRAM usage, but can introduce small numerical differences due to changes in execution paths and kernel selection.
For best reproducibility and consistency with previous results, the default (
true) preserves the original behavior.Impact
Default users
With memory optimizations enabled
Depending on configuration, small numerical differences may occur due to batching and floating-point reduction order.
Why this matters
This change removes unnecessary VRAM pressure from residual handling across the entire pipeline, not just during the initial analysis phase.
It addresses both:
It enables users who:
to run the full pipeline reliably without changing default behavior.
Example
Final Note
These configurations also allow me to run a 4-bit 27B–35B model on 2×3090s with 1k harmful and 4k harmless prompts, while achieving over 100 TPS at a batch size of ~32.
This was previously not possible without offloading the entire model to CPU.
These changes not only help my setup, but should allow many others to run the heretic pipeline significantly faster on more limited hardware.