Comfy INT8 Acceleration

This node speeds up Flux2, Chroma, Z-Image, Ernie Image in ComfyUI by using INT8 quantization, delivering between 1.5~2x faster inference on my 3090 depending on the model. It should work on any NVIDIA GPU with enough INT8 TOPS. It appears to be faster than FP8 on 40-Series and above as well. Works with lora, torch compile.

Updates:

2026-15-05:

Bringing back stochastic lora. Some loras appear to need it, others don't, try it if your lora is not working and you don't like pre-lora. TLDR is "sometimes it really helps, sometimes its a little worse". See our measurements here.

Attempt at reducing RAM usage

Fixed an issue with Pre-Lora crashing on windows

2026-10-05:

Overhauled the entire lora system. Normal lora loader node works now, no need for specialized lora loaders.

Converted QuaRot to ConvRot, which is a small but free quality gain.

Added Pre-Lora node, which you can connect to the INT8 Model loader to merge loras before utilizing on the fly quantization.

For more info on quality of convrot, lora approaches see the Metrics

Pre-quantized checkpoints were recommended for most architectures, but on-the-fly quantization with ConvRot is better in all cases. However, ConvRot is also a little slower, so these prequantized models are still useful. Avoid using INT8 Tensorwise models.

Shoutout to vistralis for these:

Model	Link
FLUX.2-klein-base-9b	Download
FLUX.2-klein-base-4b	Download
FLUX.2-klein-9b	Download
FLUX.2-klein-4b	Download

My own:

Model	Link
Chroma1-HD²	~~Download~~
Z-Image-Base¹	~~Download~~
Z-Image-Turbo²	~~Download~~
Anima	Download

¹Z-Image Base weights have been Deprecated in favor of Convrot OTF, which is higher quality.

²Tensorwise models are worse than on the fly quantization since we switched to row-wise INT8

Speed:

Measured on a 3090 at 1024x1024, 26 steps with Flux2 Klein Base 9B.

Format	Speed (s/it) ↓	Relative Speedup
bf16	2.07	1.00×
bf16 compile	2.24	0.92×
fp8	2.06	1.00×
int8	1.64	1.26×
int8 compile ★	1.04	1.99×
gguf8_0 compile	2.03	1.02×

3090, Qwen Image 2512.

Format	Speed (s/it) ↓
Nunchaku INT4 Best Quality	1.21
Nunchaku INT4 with R128 Lora	1.36
INT8 ConvRot compile	1.26
INT8 Row compile ★	1.18
INT8 R128 Lora	No slowdown, except if dynamic.

I would also like to point out that we beat Nunchaku INT4 on every quality measurement in the Quality Metrics

Additionally, the quality of loras applied with this nunchaku lora node appears to be degraded.

Klein 9B, Measured on an 8gb 5060, same settings as the 3090 run:

Format	Speed (s/it) ↓	Relative Speedup
fp8	3.04	1.00×
fp8 fast	3.00	1.00×
fp8 compile	couldn't get to work	??×
int8	2.53	1.20×
int8 compile ★	2.25	1.35×

8gb RTX 5060, Anima, Comfy version from 2026-05-02, Pytorch 2.11+CU13.0, latest kitchen triton and everything else

Format	Speed (it/s) ↑
bf16	0.78
INT8 ConvRot	1.12
INT8 Row	1.24
INT8 ConvRot Compile	1.47
MXFP8	0.89
MXFP8 --fast	0.93
MXFP8 + Compile	Still failing.

Finally have gotten compile with --fast to work with mxfp8, PyTorch 2.13.0.dev20260511+cu132, RTX5060 same as before.

Quality results for this run, can be found here: Anima Results

Format	Speed (it/s) ↑
MXFP8 --fast + Compile	1.37it
INT8 ConvRot + Compile	1.47it

Requirements:

Working ComfyKitchen (needs latest comfy and possibly pytorch with cu130)

Triton

Windows untested, but I hear triton-windows exists.

Credits:

dxqb for the entirety of the INT8 code during the very early versions of this node, it would have been impossible without them:

Nerogar/OneTrainer#1034

If you have a 30-Series GPU, OneTrainer is also the fastest current lora trainer thanks to this. Please go check them out!!

newgrit1004 for the base ConvRot code we modified into proper ConvRot

https://github.com/newgrit1004/ComfyUI-ZImage-Triton

silveroxides for providing a base to hack the INT8 conversion code onto.

https://github.com/silveroxides/convert_to_quant

Also silveroxides for showing how to properly register new data types to comfy

https://github.com/silveroxides/ComfyUI-QuantOps

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
.github/workflows		.github/workflows
example_workflows		example_workflows
js		js
LICENSE		LICENSE
Metrics.md		Metrics.md
README.md		README.md
Workflow.png		Workflow.png
__init__.py		__init__.py
convrot.py		convrot.py
int8_fused_kernel.py		int8_fused_kernel.py
int8_lora.py		int8_lora.py
int8_quant.py		int8_quant.py
int8_save.py		int8_save.py
int8_unet_loader.py		int8_unet_loader.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comfy INT8 Acceleration

Speed:

Requirements:

Credits:

dxqb for the entirety of the INT8 code during the very early versions of this node, it would have been impossible without them:

newgrit1004 for the base ConvRot code we modified into proper ConvRot

silveroxides for providing a base to hack the INT8 conversion code onto.

Also silveroxides for showing how to properly register new data types to comfy

The unholy trinity of AI slopsters I used to glue all this together over the course of multiple months now

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Comfy INT8 Acceleration

Speed:

Requirements:

Credits:

dxqb for the entirety of the INT8 code during the very early versions of this node, it would have been impossible without them:

newgrit1004 for the base ConvRot code we modified into proper ConvRot

silveroxides for providing a base to hack the INT8 conversion code onto.

Also silveroxides for showing how to properly register new data types to comfy

The unholy trinity of AI slopsters I used to glue all this together over the course of multiple months now

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages