Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion ggml/include/ggml.h
Original file line number Diff line number Diff line change
Expand Up @@ -437,7 +437,9 @@ extern "C" {
GGML_TYPE_TQ3_1S = 47, // TurboQuant 3-bit weight: WHT-rotated 8-level Lloyd-Max, block_size=32
GGML_TYPE_TQ4_1S = 48, // TurboQuant 4-bit weight: WHT-rotated 16-level Lloyd-Max, block_size=32
GGML_TYPE_Q6_0 = 49,
GGML_TYPE_COUNT = 50,
GGML_TYPE_Q4_0_ROCMFP4 = 50, // ROCmFP4 experimental UE4M3 scales + packed AMD FP4 blocks
GGML_TYPE_Q4_0_ROCMFP4_FAST = 51, // ROCmFP4 single-scale speed layout
GGML_TYPE_COUNT = 52,
};

// precision
Expand Down Expand Up @@ -481,6 +483,8 @@ extern "C" {
GGML_FTYPE_MOSTLY_MXFP4 = 25, // except 1d tensors
GGML_FTYPE_MOSTLY_NVFP4 = 26, // except 1d tensors
GGML_FTYPE_MOSTLY_Q1_0 = 27, // except 1d tensors
GGML_FTYPE_MOSTLY_Q4_0_ROCMFP4 = 45, // except 1d tensors
GGML_FTYPE_MOSTLY_Q4_0_ROCMFP4_FAST = 46, // ROCmFP4 single-scale speed layout
};

// available tensor operations:
Expand Down
48 changes: 48 additions & 0 deletions ggml/rocmfp4/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# ROCmFP4

ROCmFP4 adds two experimental 4-bit GGUF tensor layouts intended for compact
AMD-oriented inference formats:

- `Q4_0_ROCMFP4`: 32 weights per block, packed 4-bit values, and two finite
unsigned E4M3 scale bytes, one scale per 16 weights. The block size is
18 bytes, or 4.50 bits per weight.
- `Q4_0_ROCMFP4_FAST`: 32 weights per block, packed 4-bit values, and one
finite unsigned E4M3 scale byte for the full block. The block size is
17 bytes, or 4.25 bits per weight.

The 4-bit values use a small signed codebook with levels up to `5.0` after
the decoded scale is applied. Quantization searches all finite E4M3 scale
candidates and keeps the lowest-error assignment. Invalid scale bytes are
rejected during row validation so malformed tensors fail early.

This directory contains the format-specific CPU reference helpers. This initial
change adds the GGUF tensor types, row validation, quantization, and
dequantization needed to create and inspect ROCmFP4 files from the normal
`llama-quantize` workflow.

- CPU reference quantization, dequantization, row validation, and
`llama-quantize` support.

The feature is additive. Existing tensor types, file types, and backend
dispatch paths are unchanged unless a tensor is explicitly stored as
`Q4_0_ROCMFP4` or `Q4_0_ROCMFP4_FAST`.

Accelerated ROCm/HIP and Vulkan execution paths can be added in follow-up
changes after the tensor formats and CPU reference behavior are reviewed.

Example quantization:

```sh
./llama-quantize model-f16.gguf model-rocmfp4.gguf Q4_0_ROCMFP4
./llama-quantize model-f16.gguf model-rocmfp4-fast.gguf Q4_0_ROCMFP4_FAST
```

Importance matrices use the existing quantize interface:

```sh
./llama-quantize --imatrix imatrix.dat model-f16.gguf model-rocmfp4.gguf Q4_0_ROCMFP4
```

Advanced mixed recipes can be expressed with the existing
`--tensor-type` and `--tensor-type-file` options instead of adding extra
public file-type presets.
Loading