Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
290 changes: 290 additions & 0 deletions include/ck/library/utility/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@
# CK Library Utility

This directory contains utility headers for testing, benchmarking, and validating Composable Kernel (CK) operations. The utilities support both modern GPU-first validation for high-performance testing and legacy CPU-based approaches for backward compatibility.

## Quick Start
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section seems to summarize what our good practices are, key principles, or validation guidelines rather than initial setup steps.

Suggested change
## Quick Start
## Recommended Practices


1. **Use GPU validation** for all new tests (10-100x faster than CPU validation)
2. **Let the system compute tolerances** automatically based on data types
3. **Only transfer error statistics**, not full tensors

## File-to-Purpose Quick Reference
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small thing, but the table is organized from left-to-right as purpose-to-file. The current section name is the reverse.

Suggested change
## File-to-Purpose Quick Reference
## Purpose-to-Utility Quick Reference


| Need to... | Use this file | Key function/class |
|-------------------------------------|-----------------------------------|---------------------------|
| Validate on GPU (recommended) | `gpu_verification.hpp` | `gpu_verify()` |
| Validate on CPU (legacy/debugging) | `check_err.hpp` | `check_err()` |
| Compute tolerances automatically | `check_err.hpp` | `get_relative_threshold<>()` |
| Allocate GPU memory | `device_memory.hpp` | `DeviceMem` |
| Create CPU tensors | `host_tensor.hpp` | `Tensor<T>` |
| Generate test data on GPU | `device_tensor_generator.hpp` | `FillUniformRandFp()` |
| Generate test data on CPU (legacy) | `host_tensor_generator.hpp` | `GeneratorTensor_*` |
| Set up convolution parameters | `convolution_parameter.hpp` | `ConvParam` |
| Create tensor descriptors | `host_tensor.hpp` | `HostTensorDescriptor` |

## Core Validation Tools

### GPU Validation (Recommended)

**`gpu_verification.hpp`** - Complete on-device verification

- `gpu_verify()`: Compares device tensors entirely on GPU
- Automatic tolerance computation based on data types
- Only transfers error statistics (~12 bytes), not tensors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove (~12 bytes)

- Detailed error reporting (count, max error, percentage)
- Supports all CK data types (fp32, fp16, bf16, fp8, int8, etc.)
- `gpu_reduce_max()`: Computes max(abs(tensor)) on GPU for tolerance scaling
- Grid-stride kernels with LDS reduction for optimal performance

**Performance**: 10-100x faster than CPU validation for large tensors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove the 10-100x number?


**Example usage:**

```cpp
// Explicit tolerance
bool pass = gpu_verify<float>(output_dev, reference_dev, 1e-5f, 1e-6f, size);

// Automatic tolerance for mixed precision
bool pass = gpu_verify<float, half_t, float>(output_dev, reference_dev, K_dim, size);
Comment on lines +44 to +48
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth mentioning when to use an explicit vs. automatic tolerance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K_dim should be changed to accumulation_count or similar, it's not necessarily equal to the K dimension.

On a slightly separate note, this does not currently support split-k, which requires accounting for accumulation in multiple data types. See issue #3673.

```

**See:** `test/gpu_verification/test_gpu_verification.cpp`

### Tolerance Computation

**`check_err.hpp`** - Automatic tolerance calculation

- `get_relative_threshold<ComputeType, OutType, AccType>()`: Computes relative tolerance from mantissa bits
- `get_absolute_threshold<ComputeType, OutType, AccType>()`: Computes absolute tolerance scaled by magnitude
Comment on lines +57 to +58
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is helpful, a short example for each of these calls could help users see its usage.

- Type-specific overloads for all CK data types
- Accumulation-aware error bounds

**Theory**: Based on IEEE 754 floating-point arithmetic and error propagation analysis.

### Legacy CPU Validation

**`check_err.hpp`** - CPU-based error checking (legacy)

- Overloaded `check_err()` functions for different data types
- Type-aware default tolerances
- Detailed error reporting (first 5 mismatches, statistics)

**Note**: Requires full tensor transfer to CPU - slow for large tensors. Use `gpu_verification.hpp` for new tests.

**See:** `test/convnd_fwd/convnd_fwd_naive.cpp` for legacy CPU validation patterns

## Numerical Validation Strategy

**TL;DR:** CK computes tolerances from IEEE 754 precision limits, not arbitrary values. FP32 gets ~1e-5 relative tolerance, FP16 gets ~1e-3, etc. The system accounts for accumulation effects in matrix operations.

CK implements a **theoretically-grounded approach to numerical validation** that goes beyond simple fixed tolerances. The validation system is designed around three core principles:

### 1. Type-Aware Tolerance Computation

Rather than using arbitrary threshold values, CK computes tolerances based on the datatypes:

- **Relative tolerance**: Derived from mantissa bits as `2^(-mantissa_bits) * 0.5`
- **Absolute tolerance**: Scaled by value magnitude as `2^(exponent - mantissa_bits) * 0.5`
- **Multi-type analysis**: Considers compute type, output type, and accumulator type separately
- **Conservative bounds**: Takes maximum error across all data paths

### 2. Algorithm-Aware Validation

Different algorithms have different error characteristics:

- **Accumulation effects**: Matrix operations (GEMM, convolution) accumulate errors proportional to the number of operations
- **Precision cascades**: Mixed-precision operations require careful tolerance selection based on the weakest link
- **Operation-specific bounds**: Tolerances scale with problem size (e.g., K dimension in GEMM)

The validation system accepts `number_of_accumulations` to adjust tolerances for algorithmic context.

### 3. Data Type Characteristics

Each data type has inherent precision limits that inform validation:

| Data Type | Mantissa Bits | Typical rtol | Typical atol |
|-----------|---------------|--------------|--------------|
| FP32 | 23 | 1e-5 | 3e-6 |
| TF32 | 10 | 5e-4 | 5e-4 |
| FP16 | 10 | 1e-3 | 1e-3 |
| BF16 | 7 | 1e-1 | 1e-3 |
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relative tolerance for BF16 (1e-1 or 0.1) appears unusually high compared to other data types. This suggests 10% relative error is acceptable, which seems inconsistent with typical numerical validation standards. Verify this value is correct or clarify if this is for specific use cases.

Suggested change
| BF16 | 7 | 1e-1 | 1e-3 |
| BF16 | 7 | 1e-2 | 1e-3 |

Copilot uses AI. Check for mistakes.
| FP8 | 3-4 | 1e-3 | 1e-3 |
| BF8 | 2-3 | 1e-3 | 1e-3 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rtol for BF8 lower than BF16?

| FP4 | 2 | 0.5 | 0.5 |
| INT8/INT32| N/A | 0 | 0 |

## GPU-First Validation Philosophy

Modern CK testing emphasizes **pure GPU validation** to eliminate performance bottlenecks:

### Traditional CPU-Based Approach (Legacy)

```text
GPU Kernel → Transfer to CPU → CPU Verification
↑ BOTTLENECK: PCIe transfer of entire tensor
```

- **Problem**: Transferring multi-GB tensors over PCIe is 10-100x slower than computation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove the 10-100x

- **Impact**: Test suites become I/O bound rather than compute bound
- **Limitation**: Cannot efficiently test large-scale problems

### Modern GPU-First Approach (Recommended)

```text
GPU Kernel → GPU Reference → GPU Verification → Transfer scalars only
↑ Only ~12 bytes transferred
```

- **Advantage**: All data stays on GPU, only error statistics transfer to CPU
- **Performance**: 10-100x faster for large tensors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove or rephrase

- **Scalability**: Enables testing of multi-GB tensors efficiently
- **Completeness**: Detailed error reporting (count, max error, percentage) without full transfer

### When to Use Each Approach

**Use GPU-First Validation When:**

- Testing production kernels (performance matters)
- Working with large tensors (>1MB)
- Running extensive test suites
- Validating at scale

**Use CPU-Based Validation When:**

- Debugging specific values (need to inspect individual elements)
- Working with tiny tensors (<1KB)
- Maintaining backward compatibility
- Implementing CPU reference algorithms

## Testing Workflow Comparison

### Modern GPU-First Workflow (Recommended)

```cpp
// 1. Allocate device memory only
DeviceMem input_dev(size), output_dev(size), reference_dev(size);

// 2. Initialize on GPU (no CPU involvement)
input_dev.FillUniformRandFp<float>(-1.0f, 1.0f);

// 3. Run kernel under test
run_kernel(input_dev, output_dev, params);

// 4. Run reference on GPU
run_reference_kernel(input_dev, reference_dev, params);

// 5. Verify on GPU (only transfers ~12 bytes of error stats)
bool pass = gpu_verify<float>(output_dev, reference_dev, rtol, atol, size);
if (!pass) {
std::cout << "Validation failed!" << std::endl;
return false;
}
```

**Key advantage**: Zero tensor transfers - all data stays on GPU.

### Legacy CPU-Based Workflow

```cpp
// 1. Create host tensors (allocates CPU memory)
Tensor<float> input_host(dims), output_host(dims), reference_host(dims);

// 2. Generate on CPU
input_host.GenerateTensorValue(GeneratorTensor_3<float>{-1.0f, 1.0f});

// 3. Allocate device memory
DeviceMem input_dev(size), output_dev(size);

// 4. Transfer to device (slow for large tensors)
input_dev.ToDevice(input_host.data());

// 5. Run kernel
run_kernel(input_dev, output_dev, params);

// 6. Transfer back to CPU (slow for large tensors)
output_dev.FromDevice(output_host.data());

// 7. Compute reference on CPU
compute_reference(input_host, reference_host, params);

// 8. Verify on CPU
bool pass = check_err(output_host, reference_host, "Test failed");
```

**Bottleneck**: Steps 4 and 6 transfer entire tensors over PCIe.

## Supporting Utilities

### Tensor Management

- **`host_tensor.hpp`**: CPU-side tensor container with multi-dimensional support
- `HostTensorDescriptor`: Dimension, stride, and layout management
- `Tensor<T>`: Host tensor with generation and conversion utilities
- **`device_memory.hpp`**: GPU memory management with RAII semantics
- `DeviceMem`: Device allocation, transfer, and initialization
- Device-side random value generation
- `SetZero()`: Zero-initialize device memory (required for backward passes)

### Data Generation

- **`device_tensor_generator.hpp`**: GPU-side tensor initialization (recommended)
- `FillUniformRandFp<T>()`: Fill with uniform random floating-point values
- `FillUniformRandInt<T>()`: Fill with uniform random integer values
- **`host_tensor_generator.hpp`**: CPU-side functor-based generators (legacy)
- Various patterns: zero, constant, random, sequential, diagonal, checkerboard
- **`fill.hpp`**: STL-style fill functors for containers

### Convolution Utilities

- **`convolution_parameter.hpp`**: Convolution parameter management
- `ConvParam`: Encapsulates dimensions, strides, padding, dilations
- Output dimension calculation and FLOP estimation
- **`convolution_host_tensor_descriptor_helper.hpp`**: Tensor descriptor creation helpers
- **`conv_common.hpp`**: Common convolution utilities

**See:** `test/convnd_fwd/convnd_fwd_naive.cpp` for convolution parameter usage

### Workspace Management

Some operations require temporary GPU memory for intermediate computations:

```cpp
// Check if workspace is needed
const std::size_t workspace_sz = op_ptr->GetWorkSpaceSize(argument_ptr.get());

// Allocate and set workspace if needed
if (workspace_sz > 0) {
DeviceMem workspace_dev(workspace_sz);
op_ptr->SetWorkSpacePointer(argument_ptr.get(), workspace_dev.GetDeviceBuffer());
}
```

### Algorithmic Utilities

- **`algorithm.hpp`**: Generic algorithms
- **`ranges.hpp`**: Range-based utilities and concepts
- **`iterator.hpp`**: Custom iterator implementations
- **`numeric.hpp`**: Numeric operations

### Miscellaneous

- **`host_common_util.hpp`**: Common host-side utilities
- **`host_gemm.hpp`**: CPU reference GEMM implementation
- **`literals.hpp`**: User-defined literals
- **`thread.hpp`**: Threading utilities

## Best Practices

### Choosing Tolerances

1. **Prefer automatic computation**: Use `gpu_verify()` with automatic tolerance calculation
2. **Consider accumulation**: Pass `number_of_accumulations` for matrix operations
3. **Respect data type limits**: Don't expect FP16 to match FP32 precision
4. **Account for algorithm**: Different operations have different error characteristics

### Performance Optimization

1. **Use GPU-first validation** for all new tests
2. **Avoid CPU transfers** unless debugging specific values
3. **Generate data on GPU** when possible
4. **Batch verification** to amortize kernel launch overhead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we do or support batch verification. It would be very VRAM-intense since many device-side tensors would have to be kept in memory instead of clearing the non-reference output after each kernel that is tested.