Robust FP8 layer detection for ignore_layers (#1283)#1289
Robust FP8 layer detection for ignore_layers (#1283)#1289scopophobic wants to merge 2 commits intointel:mainfrom
Conversation
Signed-off-by: Adithyan Madhu <adithyanworkmail@gmail.com>
for more information, see https://pre-commit.ci
| logger.trace(f"Auto-detected FP8 layer to ignore : {n}") | ||
|
|
||
| if ignore_layers: | ||
| ignore_list = ignore_layers.replace(" ", "").split(",") |
There was a problem hiding this comment.
Hi @scopophobic, thanks for your interest in fix that issue! I think there might be a bit of misunderstanding.
We don’t want to skip all FP8 layers. The idea is that we start with an FP8 model and want to requantize it to another format, like W4A16. However, we don’t want certain layers—such as those inside the attention module—to be quantized to W4A16.
The fix here is aligned with what we’re aiming for. #1286
There was a problem hiding this comment.
Hi @scopophobic Would you be interested in working on the left part of this issue? #1283 (comment)
There was a problem hiding this comment.
Hi @yiliu30, thanks a lot for the clarification, that helped resolve a misunderstanding I had 👍
I now understand that the goal is not to skip all FP8 layers, but to start from an FP8 model and re-quantize it (e.g., to W4A16), while keeping specific submodules (like attention) from being quantized.
I’m definitely interested in working on the remaining part of #1283. My current thought is to make FP8 detection more robust by moving away from class-name checks (like "FP8Linear") and instead relying on explicit FP8 characteristics (e.g., presence of FP8 scale metadata used during dequantization). This would allow supporting multiple FP8 layer implementations without brittle heuristics.
Does this approach sound aligned with what you had in mind for this issue?
FP8 layers were not detected by get_fp_layer_names, causing ignore_layers
to be ignored. This PR:
Signed-off-by: Adithyan Madhu adithyanworkmail@gmail.com