Skip to content

fix: handle FP8 model weights in LoRA adapters and merge#182

Open
shifusen329 wants to merge 7 commits intop-e-w:masterfrom
shifusen329:master
Open

fix: handle FP8 model weights in LoRA adapters and merge#182
shifusen329 wants to merge 7 commits intop-e-w:masterfrom
shifusen329:master

Conversation

@shifusen329
Copy link
Copy Markdown

Summary

  • Models distributed in FP8 (e.g. MiniMax-M2.5) crash during inference because torch.addmm has no FP8 (Float8_e4m3fn) kernel, causing NotImplementedError in PEFT LoRA forward passes
  • Cast LoRA adapter weights to bfloat16 after initialization in _apply_lora() so adapter matmuls use a supported dtype
  • Upcast FP8 base weights to bfloat16 before merge_and_unload() in get_merged_model(), then cast back, to avoid unsupported in-place addition during merge

Test plan

  • Load an FP8 model (e.g. MiniMaxAI/MiniMax-M2.5) and verify inference completes without NotImplementedError
  • Verify saving a merged model works for FP8 models
  • Verify no regression for non-FP8 models (bfloat16, float16, BNB 4-bit)

🤖 Generated with Claude Code

Models distributed in FP8 (e.g. MiniMax-M2.5) cause failures because
torch.addmm has no FP8 kernel. Cast LoRA adapter weights to bfloat16
after initialization, and upcast FP8 base weights before merge to
avoid unsupported in-place addition.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @shifusen329, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces crucial compatibility fixes for working with FP8 quantized models, such as MiniMax-M2.5, when using LoRA adapters. It addresses issues where FP8 models would crash during inference due to unsupported torch.addmm operations and fail to merge LoRA adapters due to limitations with in-place addition for FP8 dtypes. The changes ensure that LoRA inference and model merging now function correctly with FP8 models by strategically casting weights to supported dtypes during critical operations.

Highlights

  • FP8 LoRA Inference Fix: Resolved NotImplementedError during LoRA forward passes for FP8 models by casting LoRA adapter weights to bfloat16 after initialization, ensuring compatibility with torch.addmm.
  • FP8 Model Merging Support: Enabled successful merging of LoRA adapters into FP8 base models by temporarily upcasting FP8 base weights to bfloat16 before merge_and_unload() and then casting them back to their original FP8 dtype, circumventing unsupported in-place addition operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/heretic/model.py
    • Casted LoRA adapter weights to bfloat16 if they are float8_e4m3fn or float8_e5m2 within the _apply_lora method to ensure compatibility with torch.addmm.
    • Implemented logic in get_merged_model to temporarily upcast FP8 base model weights to bfloat16 before calling merge_and_unload(), and then cast them back to their original FP8 dtype after merging, to circumvent unsupported in-place addition operations.
Activity
  • No human activity recorded for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix issues with FP8 models by casting weights to bfloat16 during LoRA forward passes and model merging. The change in _apply_lora to handle adapter weights seems correct. However, in get_merged_model, there's a critical issue in the logic for upcasting base model weights before merging. The current implementation incorrectly excludes the LoRA-wrapped layers, which prevents the fix from working. I've provided a specific comment and suggestion to address this.

# so upcast them to bfloat16 first, merge, then cast back.
fp8_params = {}
for name, module in self.model.named_modules():
if hasattr(module, "weight") and not isinstance(module, Linear):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The condition not isinstance(module, Linear) prevents the logic from running on LoRA-wrapped layers. These are precisely the layers that need their base weights upcast because merge_and_unload() performs an in-place addition on them, which fails for FP8 dtypes. The weight property on a peft.tuners.lora.layer.Linear module correctly delegates to the base layer's weight, so these modules should be processed. Removing this part of the condition will fix the issue and allow the merge to succeed with FP8 models.

Suggested change
if hasattr(module, "weight") and not isinstance(module, Linear):
if hasattr(module, "weight"):

shifusen329 and others added 6 commits February 19, 2026 03:03
- _apply_lora: resolve target module names from the model tree by matching
  module identities instead of parsing component labels. Fixes MoE models
  where registered names differ from heretic's labels (e.g. "w2" vs
  "down_proj" in MiniMax-M2.5).
- abliterate: dequantize FP8 block-wise quantized weights by applying
  weight_scale_inv per block, so abliteration computes correct refusal
  direction projections.
- get_merged_model: same FP8 dequantization before merge. The merged model
  is kept in bfloat16 since the original scale factors are invalidated.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Models that don't refuse any baseline prompts (e.g. MiniMax-M2.5)
cause a ZeroDivisionError when computing refusals_score. Return 0.0
when base_refusals is zero since there are no refusals to remove.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Aggressive abliteration can destabilize the model, producing NaN logits
that propagate into the KL divergence. NaN silently bypasses the
kl_divergence >= target comparison (always False), producing a
misleadingly finite score. Replace NaN with inf so Optuna correctly
identifies these trials as maximally bad.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All trials were producing inf KL divergence because the 0.8-1.5 range
for max_weight was too aggressive for the model. Lowered to 0.1-0.8.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Low-precision dtypes (bfloat16/float16) cause log_softmax to produce
-inf for low-probability tokens. When abliteration shifts which tokens
underflow, kl_div returns NaN/inf. Upcasting to float32 matches the
existing pattern used for residual vectors.

Also reverts the max_weight range change from 2e6cfc6 since the search
ranges were not the actual problem.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
log_softmax produces -inf for near-zero probability tokens, which
causes kl_div to return inf regardless of actual distribution
similarity. Clamping to -100 keeps values finite while preserving
effectively-zero probabilities (exp(-100) ≈ 3.7e-44).

Ref: pytorch/pytorch#32520

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented Feb 19, 2026

This seems to be quite similar to #151, which is also concerned with FP8 support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants