feat: reproducibility when saving & uploading a heretic model#191
feat: reproducibility when saving & uploading a heretic model#191Vinay-Umrethe wants to merge 65 commits intop-e-w:masterfrom
Conversation
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable reproducibility suite, adding seeding capabilities and detailed environment capture. However, a critical issue has been identified regarding NumPy 2.0+ compatibility, specifically with the use of deprecated legacy NumPy random APIs that could lead to application crashes. Additionally, there are minor comment style violations that need to be addressed to align with repository standards. Specific suggestions have been provided to resolve these issues.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
That's impossible unless the core libraries have bugs. You are likely seeing random variation between runs. |
|
This is a good start on #161, but it needs careful consideration. I've done an initial review. Two ideas:
|
Regarding seeding vs. determinism: ran 3 separate tests on Since seeding alone seems so efficient for these models, do you think we should keep the --deterministic flag as an optional "researcher-ablation" tool, or is it better to just strip it out... For Results:
Both gave exact same results looking at heretic table in both their README on huggingface... also a 3rd is tested with --deterministic too but likely not needed VINAY-UMRETHE/Qwen3-0.6B-heretic-pytorch-deterministic |
3b3875f to
d7da30c
Compare
src/heretic/utils.py
Outdated
| except (subprocess.CalledProcessError, FileNotFoundError): | ||
| pass | ||
|
|
||
| # 3. Try /sys/module/amdgpu/version (Linux kernel driver version) |
There was a problem hiding this comment.
Does the kernel driver support ROCm? Because if it doesn't it's irrelevant I think.
There was a problem hiding this comment.
AI MODE SAYS:
1. Typical Output (Driver Version)
If Linux and the `amdgpu` driver is active, will see a version string similar to:
* 3.50.0 (Example of a standard kernel driver version)
* 6.16.13-2278356.24.04 (Example of a full version string for Ubuntu 24.04 users)
* 20.10-1048554 (Example for older systems using the "PRO" driver)
2. Fallback Output: "Unknown"
The function will return exactly "Unknown" if:
- Not on Linux
- Driver not installed
- Linux distro stores in a whole different path (unlikely)
I think 1st and 2nd method should probably work 3rd is "just in case" suggestion from gemini (can be removed)
src/heretic/utils.py
Outdated
| if "+" in version_str: | ||
| version_str = version_str.split("+")[0] | ||
|
|
||
| unique_reqs[normalized_name] = f"{name}=={version_str}" |
There was a problem hiding this comment.
Then why do we need normalization to begin with?
src/heretic/utils.py
Outdated
|
|
||
| unique_reqs[normalized_name] = f"{name}=={version_str}" | ||
|
|
||
| reqs = sorted(unique_reqs.values(), key=lambda x: x.lower()) |
There was a problem hiding this comment.
Please don't mark review comments as resolved when they are neither answered nor changed in code. I spend a lot of time going through every line and the question is valid.
src/heretic/utils.py
Outdated
| "> **Heterogeneous GPUs Detected!**\n" | ||
| "> This system uses multiple non-identical GPUs. When operations are distributed " | ||
| "across different GPUs (e.g. via `device_map='auto'`), non-deterministic " | ||
| "behavior can occur. **Reproducibility ***cannot*** be guaranteed in this environment.**\n" |
There was a problem hiding this comment.
You didn't fix this! Please don't mark things as resolved when they aren't. It just doubles my work to have to go through everything again.
|
I just realized that we're still missing a very important puzzle piece here, especially with regard to the future: The reproducibility information must be machine-readable. That's because we eventually (which can be implemented later) want to be able to to this: This should:
I propose that we introduce another file, |
src/heretic/utils.py
Outdated
| def get_cpu_info() -> str: | ||
| """Gets the CPU brand name and instruction set capability.""" | ||
| brand = platform.processor() | ||
| try: | ||
| if platform.system() == "Windows": | ||
| brand = ( | ||
| subprocess.check_output( | ||
| [ | ||
| "powershell", | ||
| "-Command", | ||
| "Get-CimInstance Win32_Processor | Select-Object -ExpandProperty Name", | ||
| ], | ||
| text=True, | ||
| ) |
There was a problem hiding this comment.
@p-e-w
I tested this function, so It shows:
Environment Snapshot
====================
OS: Windows-11-10.0.26100-SP0 (AMD64)
CPU: 12th Gen Intel(R) Core(TM) i3-1215U (Capability: AVX2)
Python: 3.12.7
PyTorch & Accelerators
----------------------
PyTorch Version: 2.10.0+cpu
No GPU or other accelerator detected.The "Operations will be slow" warning is stripped too since it only makes sense when heretic is "running"
Also, I tested this command:
grep 'model name' /proc/cpuinfo | head -1 | cut -d: -f2Found out it gives same output on Windows Powershell, Windows Command Prompt (OLD) and Ubuntu Linux as well
12th Gen Intel(R) Core(TM) i3-1215USo not sure should we keep the powershell command natively or use this instead. Because it can look redundant
src/heretic/utils.py
Outdated
| "> **Heterogeneous GPUs Detected!**\n" | ||
| "> This system uses multiple non-identical GPUs. When operations are distributed " | ||
| "across different GPUs (e.g. via `device_map='auto'`), non-deterministic " | ||
| "behavior can occur. **Reproducibility ***cannot*** be guaranteed in this environment.**\n" |
src/heretic/utils.py
Outdated
|
|
||
| unique_reqs[normalized_name] = f"{name}=={version_str}" | ||
|
|
||
| reqs = sorted(unique_reqs.values(), key=lambda x: x.lower()) |
There was a problem hiding this comment.
I thought this lower-casing one was answerd above so would be ok. while others where kinda annyoing as I had to scroll a lot, that's why there's too many seperate commits for each fix.
src/heretic/utils.py
Outdated
| "> **Heterogeneous GPUs Detected!**\n" | ||
| "> This system uses multiple non-identical GPUs. When operations are distributed " | ||
| "across different GPUs (e.g. via `device_map='auto'`), non-deterministic " | ||
| "behavior can occur. **Reproducibility ***cannot*** be guaranteed in this environment.**\n" |
There was a problem hiding this comment.
I think you reviwed older code because I see "outdated" label here, right now it uses multi-line string. Regarding markdwon *** inside **text** looks like it do work.
We can remove *** if needed, shall we?
src/heretic/utils.py
Outdated
| except (subprocess.CalledProcessError, FileNotFoundError): | ||
| pass | ||
|
|
||
| # 3. Try /sys/module/amdgpu/version (Linux kernel driver version) |
There was a problem hiding this comment.
AI MODE SAYS:
1. Typical Output (Driver Version)
If Linux and the `amdgpu` driver is active, will see a version string similar to:
* 3.50.0 (Example of a standard kernel driver version)
* 6.16.13-2278356.24.04 (Example of a full version string for Ubuntu 24.04 users)
* 20.10-1048554 (Example for older systems using the "PRO" driver)
2. Fallback Output: "Unknown"
The function will return exactly "Unknown" if:
- Not on Linux
- Driver not installed
- Linux distro stores in a whole different path (unlikely)
I think 1st and 2nd method should probably work 3rd is "just in case" suggestion from gemini (can be removed)
|
A couple of comments on https://huggingface.co/VINAY-UMRETHE/Qwen3-0.6B-heretic-REPRODUCE/blob/main/reproduce/reproduce.json:
|
|
Surely there is a Python library for getting information like the full CPU ID? Maintaining this kind of logic shouldn't happen in Heretic. |
|
Check out https://github.com/workhorsy/py-cpuinfo for example. We're not building a system monitor. There's no way we should be writing such code riddled with shell calls and platform switches ourselves. |



Implement Reproducibility Suite
Well this PR adds a reproducibility suite that lets users capture the exact environment, packages, OS, and random states used during an abliteration optimization run...
it uses a new
seedconfiguration parameter that, when set, enforces deterministic behavior across everything that matters like: Python version, PyTorch (CPU & GPU), Numpy, and the Optuna sampler. If not specified one, heretic now generates a random seed automatically at startup so it can still be recorded for that run.A NEW (optional)
reproduce/folder can be created during model export (local save or HF upload). containing the exact TOML configuration, a snapshot of installed package versions, and a summary of the hardware/software environment.On the technical side, I'm using a
.safetensorsformat for storing RNG states. this avoids those old, insecure pickle-based (.pkl) checkpoints which I still use in my non-heretic AI/ML trainings... It stores torch tensors efficiently and wraps the non-tensor states (like python/numpy RNG tuples) into JSON metadata within the secure file header.For user control, I've added a confirmation prompt
prompt_confirmsoreproduce/folder is only generated and uploaded if you explicitly grant permission at the time of export. This builds on the otherprompt_*helpers I had added in one of the old PR.Note: Right now, the easiest way to reproduce a model you found on the Hub is to just look at the seed value in its
reproduce/config.tomland specify that when you run Heretic. It's much simpler than trying to load a full binary.safetensorsstate file, though need confirmation on which approchs is better...Also I wonder if generating a seed at start randomly is a good approch or not because it can affect normal heretic run quality a lot, instead of not generating it, if we go that way then the --seed (easy way would be gone) and would need to implement a --reproduce or something similar to load the suite files... (SO NEED CONFIRMATION TO THIS...)