fix: handle FP8 model weights in LoRA adapters and merge by shifusen329 · Pull Request #182 · p-e-w/heretic

shifusen329 · 2026-02-19T02:09:52Z

Summary

Models distributed in FP8 (e.g. MiniMax-M2.5) crash during inference because torch.addmm has no FP8 (Float8_e4m3fn) kernel, causing NotImplementedError in PEFT LoRA forward passes
Cast LoRA adapter weights to bfloat16 after initialization in _apply_lora() so adapter matmuls use a supported dtype
Upcast FP8 base weights to bfloat16 before merge_and_unload() in get_merged_model(), then cast back, to avoid unsupported in-place addition during merge

Test plan

Load an FP8 model (e.g. MiniMaxAI/MiniMax-M2.5) and verify inference completes without NotImplementedError
Verify saving a merged model works for FP8 models
Verify no regression for non-FP8 models (bfloat16, float16, BNB 4-bit)

🤖 Generated with Claude Code

Models distributed in FP8 (e.g. MiniMax-M2.5) cause failures because torch.addmm has no FP8 kernel. Cast LoRA adapter weights to bfloat16 after initialization, and upcast FP8 base weights before merge to avoid unsupported in-place addition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-02-19T02:10:06Z

Summary of Changes

Hello @shifusen329, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces crucial compatibility fixes for working with FP8 quantized models, such as MiniMax-M2.5, when using LoRA adapters. It addresses issues where FP8 models would crash during inference due to unsupported torch.addmm operations and fail to merge LoRA adapters due to limitations with in-place addition for FP8 dtypes. The changes ensure that LoRA inference and model merging now function correctly with FP8 models by strategically casting weights to supported dtypes during critical operations.

Highlights

FP8 LoRA Inference Fix: Resolved NotImplementedError during LoRA forward passes for FP8 models by casting LoRA adapter weights to bfloat16 after initialization, ensuring compatibility with torch.addmm.
FP8 Model Merging Support: Enabled successful merging of LoRA adapters into FP8 base models by temporarily upcasting FP8 base weights to bfloat16 before merge_and_unload() and then casting them back to their original FP8 dtype, circumventing unsupported in-place addition operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/heretic/model.py
- Casted LoRA adapter weights to bfloat16 if they are float8_e4m3fn or float8_e5m2 within the _apply_lora method to ensure compatibility with torch.addmm.
- Implemented logic in get_merged_model to temporarily upcast FP8 base model weights to bfloat16 before calling merge_and_unload(), and then cast them back to their original FP8 dtype after merging, to circumvent unsupported in-place addition operations.

Activity

No human activity recorded for this pull request.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to fix issues with FP8 models by casting weights to bfloat16 during LoRA forward passes and model merging. The change in _apply_lora to handle adapter weights seems correct. However, in get_merged_model, there's a critical issue in the logic for upcasting base model weights before merging. The current implementation incorrectly excludes the LoRA-wrapped layers, which prevents the fix from working. I've provided a specific comment and suggestion to address this.

gemini-code-assist · 2026-02-19T02:11:25Z

src/heretic/model.py

+            # so upcast them to bfloat16 first, merge, then cast back.
+            fp8_params = {}
+            for name, module in self.model.named_modules():
+                if hasattr(module, "weight") and not isinstance(module, Linear):


The condition not isinstance(module, Linear) prevents the logic from running on LoRA-wrapped layers. These are precisely the layers that need their base weights upcast because merge_and_unload() performs an in-place addition on them, which fails for FP8 dtypes. The weight property on a peft.tuners.lora.layer.Linear module correctly delegates to the base layer's weight, so these modules should be processed. Removing this part of the condition will fix the issue and allow the merge to succeed with FP8 models.

Suggested change

if hasattr(module, "weight") and not isinstance(module, Linear):

if hasattr(module, "weight"):

- _apply_lora: resolve target module names from the model tree by matching module identities instead of parsing component labels. Fixes MoE models where registered names differ from heretic's labels (e.g. "w2" vs "down_proj" in MiniMax-M2.5). - abliterate: dequantize FP8 block-wise quantized weights by applying weight_scale_inv per block, so abliteration computes correct refusal direction projections. - get_merged_model: same FP8 dequantization before merge. The merged model is kept in bfloat16 since the original scale factors are invalidated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Models that don't refuse any baseline prompts (e.g. MiniMax-M2.5) cause a ZeroDivisionError when computing refusals_score. Return 0.0 when base_refusals is zero since there are no refusals to remove. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Aggressive abliteration can destabilize the model, producing NaN logits that propagate into the KL divergence. NaN silently bypasses the kl_divergence >= target comparison (always False), producing a misleadingly finite score. Replace NaN with inf so Optuna correctly identifies these trials as maximally bad. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

All trials were producing inf KL divergence because the 0.8-1.5 range for max_weight was too aggressive for the model. Lowered to 0.1-0.8. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Low-precision dtypes (bfloat16/float16) cause log_softmax to produce -inf for low-probability tokens. When abliteration shifts which tokens underflow, kl_div returns NaN/inf. Upcasting to float32 matches the existing pattern used for residual vectors. Also reverts the max_weight range change from 2e6cfc6 since the search ranges were not the actual problem. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

log_softmax produces -inf for near-zero probability tokens, which causes kl_div to return inf regardless of actual distribution similarity. Clamping to -100 keeps values finite while preserving effectively-zero probabilities (exp(-100) ≈ 3.7e-44). Ref: pytorch/pytorch#32520 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

p-e-w · 2026-02-19T17:09:16Z

This seems to be quite similar to #151, which is also concerned with FP8 support.

gemini-code-assist bot reviewed Feb 19, 2026

View reviewed changes

shifusen329 and others added 6 commits February 19, 2026 03:03

fix: lower max_weight search range to avoid model destabilization

2e6cfc6

All trials were producing inf KL divergence because the 0.8-1.5 range for max_weight was too aggressive for the model. Lowered to 0.1-0.8. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle FP8 model weights in LoRA adapters and merge#182

fix: handle FP8 model weights in LoRA adapters and merge#182
shifusen329 wants to merge 7 commits intop-e-w:masterfrom
shifusen329:master

shifusen329 commented Feb 19, 2026

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 19, 2026

Uh oh!

p-e-w commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if hasattr(module, "weight") and not isinstance(module, Linear):
	if hasattr(module, "weight"):

Conversation

shifusen329 commented Feb 19, 2026

Summary

Test plan

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

p-e-w commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants