-
Notifications
You must be signed in to change notification settings - Fork 265
Description
Hello, I have been working on https://github.com/qskousen/ggufy which is a tool to aid in quantization. I've been using this node pack to test in ComfyUI. I have taken some inspiration for this tool from your quantization tools. Thank you for the work you've done here.
I've noticed while working on Lumina2 architecture that BF16, while supported generally by this node pack, does not work for two specific layers: x_pad_token and cap_pad_token. As a workaround, I am currently forcing these to upcast to F32. If I leave them in BF16, I get this error for both layers:
While copying the parameter named "x_pad_token", whose dimensions in the model are torch. Size([3840]) and whose dimensions in the checkpoint are torch.Size([7680]), an exception occurred : ('The size of tensor a (3840) must match the size of tensor b (7680) at non-singleton dimension 0',)
The workaround works, but I am curious why these layers specifically do not support BF16 while other layers do. I don't know a lot about how stable diffusion itself works, and I am not sure how these layers are used during inference. I have noticed that other GGUF node packs don't support BF16 in GGUF at all.