Skip to content

feat: reproducibility when saving & uploading a heretic model#191

Open
Vinay-Umrethe wants to merge 65 commits intop-e-w:masterfrom
Vinay-Umrethe:feature/reproducibility
Open

feat: reproducibility when saving & uploading a heretic model#191
Vinay-Umrethe wants to merge 65 commits intop-e-w:masterfrom
Vinay-Umrethe:feature/reproducibility

Conversation

@Vinay-Umrethe
Copy link
Copy Markdown
Contributor

Implement Reproducibility Suite

Well this PR adds a reproducibility suite that lets users capture the exact environment, packages, OS, and random states used during an abliteration optimization run...

it uses a new seed configuration parameter that, when set, enforces deterministic behavior across everything that matters like: Python version, PyTorch (CPU & GPU), Numpy, and the Optuna sampler. If not specified one, heretic now generates a random seed automatically at startup so it can still be recorded for that run.

A NEW (optional) reproduce/ folder can be created during model export (local save or HF upload). containing the exact TOML configuration, a snapshot of installed package versions, and a summary of the hardware/software environment.

On the technical side, I'm using a .safetensors format for storing RNG states. this avoids those old, insecure pickle-based (.pkl) checkpoints which I still use in my non-heretic AI/ML trainings... It stores torch tensors efficiently and wraps the non-tensor states (like python/numpy RNG tuples) into JSON metadata within the secure file header.

For user control, I've added a confirmation prompt prompt_confirm so reproduce/ folder is only generated and uploaded if you explicitly grant permission at the time of export. This builds on the other prompt_* helpers I had added in one of the old PR.

Note: Right now, the easiest way to reproduce a model you found on the Hub is to just look at the seed value in its reproduce/config.toml and specify that when you run Heretic. It's much simpler than trying to load a full binary .safetensors state file, though need confirmation on which approchs is better...

Also I wonder if generating a seed at start randomly is a good approch or not because it can affect normal heretic run quality a lot, instead of not generating it, if we go that way then the --seed (easy way would be gone) and would need to implement a --reproduce or something similar to load the suite files... (SO NEED CONFIRMATION TO THIS...)

for a DEMO take a look at model made with reproduce/ in the repo: https://huggingface.co/VINAY-UMRETHE/Qwen3-0.6B-heretic

@Vinay-Umrethe
Copy link
Copy Markdown
Contributor Author

Regarding my concern of generating a random seed at start seems valid to me, it DO AFFECT THE QUALITY A LOT:

  1. WITH seed random
Screenshot 2026-02-25 204742
  1. WITHOUT seed random
image

even after running 20 trials (with generating random seed) resulted in >30 refusals while running just 10 trials (without generating random seed function earlier) resulted in 16 refusals at trial no. 9.

So tell how should we implement loading a reproduce suite

@Vinay-Umrethe
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable reproducibility suite, adding seeding capabilities and detailed environment capture. However, a critical issue has been identified regarding NumPy 2.0+ compatibility, specifically with the use of deprecated legacy NumPy random APIs that could lead to application crashes. Additionally, there are minor comment style violations that need to be addressed to align with repository standards. Specific suggestions have been provided to resolve these issues.

Vinay-Umrethe and others added 3 commits February 26, 2026 06:20
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented Feb 26, 2026

Regarding my concern of generating a random seed at start seems valid to me, it DO AFFECT THE QUALITY A LOT

That's impossible unless the core libraries have bugs. You are likely seeing random variation between runs.

@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented Feb 26, 2026

This is a good start on #161, but it needs careful consideration. I've done an initial review.

Two ideas:

  1. We could also include the checkpoint file for the run in reproduce. This would allow the user to both verify that they have indeed reproduced the run exactly, and, if they want, save themselves the work by just loading the checkpoint.
  2. The reproduce folder should include a README.md file explaining what steps need to be taken in order to reproduce the model.

@Vinay-Umrethe
Copy link
Copy Markdown
Contributor Author

@p-e-w

  1. Dependencies: Numpy is now a core dependency (V2.2+). also swapped the reproducibility tools to use (~=) versions to keep things stable.

  2. Moved hardware detection logicc into a consolidated get_accelerator_info() helper in utils.py.

  3. determinism problem: for now I've split the logic: basic seeding (Python, Numpy, PyTorch, Optuna) is always active, but, "bit-perfect" mode (torch.use_deterministic_algorithms) is now only active when a NEW --deterministic flag. It's off by default. (need confirmation to remove this or not? based on results analysis below)

  4. reproduce/ folder includes a README.md and specifically captures the original model.jsonl study checkpoint from the checkpoints/ directory. It was well a good suggestion...

  5. prompt_confirm message is now explicit about what's being shared (OS, hardware specs, and package list). though I think it should be multi-line
    ". . ."
    ". . ."

Regarding seeding vs. determinism: ran 3 separate tests on Qwen/Qwen-0.6B. I found that just providing the --seed from a previous run resulted in a mathematically identical model even without the --deterministic flag.

Since seeding alone seems so efficient for these models, do you think we should keep the --deterministic flag as an optional "researcher-ablation" tool, or is it better to just strip it out...

For Results:

  1. VINAY-UMRETHE/Qwen3-0.6B-heretic

This one is original run with random seed

  1. VINAY-UMRETHE/Qwen3-0.6B-heretic-reproduce-seed

This one is reproduced from (1) using ONLY the --seed that was generated randomly, taken from it's config.toml

Both gave exact same results looking at heretic table in both their README on huggingface...

also a 3rd is tested with --deterministic too but likely not needed VINAY-UMRETHE/Qwen3-0.6B-heretic-pytorch-deterministic

@Vinay-Umrethe Vinay-Umrethe requested a review from p-e-w February 26, 2026 14:57
@Vinay-Umrethe Vinay-Umrethe force-pushed the feature/reproducibility branch from 3b3875f to d7da30c Compare February 27, 2026 13:25
@Vinay-Umrethe Vinay-Umrethe requested a review from p-e-w February 27, 2026 13:58
@Vinay-Umrethe Vinay-Umrethe requested a review from p-e-w March 25, 2026 16:11
except (subprocess.CalledProcessError, FileNotFoundError):
pass

# 3. Try /sys/module/amdgpu/version (Linux kernel driver version)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the kernel driver support ROCm? Because if it doesn't it's irrelevant I think.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI MODE SAYS:

1. Typical Output (Driver Version)
If Linux and the `amdgpu` driver is active, will see a version string similar to:

* 3.50.0 (Example of a standard kernel driver version)
* 6.16.13-2278356.24.04 (Example of a full version string for Ubuntu 24.04 users)
* 20.10-1048554 (Example for older systems using the "PRO" driver)

2. Fallback Output: "Unknown"
The function will return exactly "Unknown" if: 
- Not on Linux
- Driver not installed
- Linux distro stores in a whole different path (unlikely)

I think 1st and 2nd method should probably work 3rd is "just in case" suggestion from gemini (can be removed)

if "+" in version_str:
version_str = version_str.split("+")[0]

unique_reqs[normalized_name] = f"{name}=={version_str}"
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why do we need normalization to begin with?


unique_reqs[normalized_name] = f"{name}=={version_str}"

reqs = sorted(unique_reqs.values(), key=lambda x: x.lower())
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't mark review comments as resolved when they are neither answered nor changed in code. I spend a lot of time going through every line and the question is valid.

"> **Heterogeneous GPUs Detected!**\n"
"> This system uses multiple non-identical GPUs. When operations are distributed "
"across different GPUs (e.g. via `device_map='auto'`), non-deterministic "
"behavior can occur. **Reproducibility ***cannot*** be guaranteed in this environment.**\n"
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't fix this! Please don't mark things as resolved when they aren't. It just doubles my work to have to go through everything again.

@Vinay-Umrethe Vinay-Umrethe requested a review from p-e-w March 26, 2026 14:13
@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented Mar 28, 2026

I just realized that we're still missing a very important puzzle piece here, especially with regard to the future:

The reproducibility information must be machine-readable.

That's because we eventually (which can be implemented later) want to be able to to this:

heretic --reproduce /path/to/reproduction/info

This should:

  1. Verify that the hardware, software, drivers, and packages are identical to the ones used to generate the original model (if they aren't, the user gets a warning, and is given the option to cancel or proceed anyway).

  2. Load the model with the original settings, abliterate it with the parameters used, and save it.

  3. Verify that the hashes of the generated model files are identical to the original ones.

I propose that we introduce another file, reproduce.toml. This file should contain all the environment information, the settings, the abliteration parameters, and the checksums of all files from the original repository. This can then later serve as a basis for implementing the above. The file does not need to contain all trials or the Pareto front. The focus is on being able to restore the exact model from the repository, fully automatically, with automatic verification that everything worked correctly.

Comment on lines +473 to +486
def get_cpu_info() -> str:
"""Gets the CPU brand name and instruction set capability."""
brand = platform.processor()
try:
if platform.system() == "Windows":
brand = (
subprocess.check_output(
[
"powershell",
"-Command",
"Get-CimInstance Win32_Processor | Select-Object -ExpandProperty Name",
],
text=True,
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@p-e-w
I tested this function, so It shows:

Environment Snapshot
====================
OS: Windows-11-10.0.26100-SP0 (AMD64)
CPU: 12th Gen Intel(R) Core(TM) i3-1215U (Capability: AVX2)
Python: 3.12.7

PyTorch & Accelerators
----------------------
PyTorch Version: 2.10.0+cpu
No GPU or other accelerator detected.

The "Operations will be slow" warning is stripped too since it only makes sense when heretic is "running"


Also, I tested this command:

grep 'model name' /proc/cpuinfo | head -1 | cut -d: -f2

Found out it gives same output on Windows Powershell, Windows Command Prompt (OLD) and Ubuntu Linux as well

 12th Gen Intel(R) Core(TM) i3-1215U

So not sure should we keep the powershell command natively or use this instead. Because it can look redundant

"> **Heterogeneous GPUs Detected!**\n"
"> This system uses multiple non-identical GPUs. When operations are distributed "
"across different GPUs (e.g. via `device_map='auto'`), non-deterministic "
"behavior can occur. **Reproducibility ***cannot*** be guaranteed in this environment.**\n"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it works actually.
Screenshot 2026-03-25 184254


unique_reqs[normalized_name] = f"{name}=={version_str}"

reqs = sorted(unique_reqs.values(), key=lambda x: x.lower())
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this lower-casing one was answerd above so would be ok. while others where kinda annyoing as I had to scroll a lot, that's why there's too many seperate commits for each fix.

"> **Heterogeneous GPUs Detected!**\n"
"> This system uses multiple non-identical GPUs. When operations are distributed "
"across different GPUs (e.g. via `device_map='auto'`), non-deterministic "
"behavior can occur. **Reproducibility ***cannot*** be guaranteed in this environment.**\n"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you reviwed older code because I see "outdated" label here, right now it uses multi-line string. Regarding markdwon *** inside **text** looks like it do work.

We can remove *** if needed, shall we?

except (subprocess.CalledProcessError, FileNotFoundError):
pass

# 3. Try /sys/module/amdgpu/version (Linux kernel driver version)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI MODE SAYS:

1. Typical Output (Driver Version)
If Linux and the `amdgpu` driver is active, will see a version string similar to:

* 3.50.0 (Example of a standard kernel driver version)
* 6.16.13-2278356.24.04 (Example of a full version string for Ubuntu 24.04 users)
* 20.10-1048554 (Example for older systems using the "PRO" driver)

2. Fallback Output: "Unknown"
The function will return exactly "Unknown" if: 
- Not on Linux
- Driver not installed
- Linux distro stores in a whole different path (unlikely)

I think 1st and 2nd method should probably work 3rd is "just in case" suggestion from gemini (can be removed)

@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented Mar 30, 2026

A couple of comments on https://huggingface.co/VINAY-UMRETHE/Qwen3-0.6B-heretic-REPRODUCE/blob/main/reproduce/reproduce.json:

  • Why is AVX512 the only listed CPU capability? Don't CPUs have many flags and features?
  • What is the purpose of api_version_label and driver_version_label?
  • The requirements section lists over 700(!) packages. Heretic has far fewer dependencies, even with transitive dependencies included. I think what happened here is that the entire environment was dumped, which is pre-loaded with a bunch of Python packages that weren't installed by Heretic. Many of them look like Jupyter dependencies. Checking for dependencies we don't depend on would almost always show reproducibility warnings.
  • We should only store the abliteration parameters, not the whole trial. This file is only about restoring the exact same model as exported. We also don't need the metrics (refusals/KLD), as those are guaranteed to be identical if the model files are identical.
  • We absolutely need the hashes of all model files in there, otherwise we can't verify that they were reproduced successfully. That's basically the most important part.

@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented Mar 31, 2026

Surely there is a Python library for getting information like the full CPU ID? Maintaining this kind of logic shouldn't happen in Heretic.

@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented Mar 31, 2026

Check out https://github.com/workhorsy/py-cpuinfo for example. We're not building a system monitor. There's no way we should be writing such code riddled with shell calls and platform switches ourselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants