From 69fc05dd55b016c45b45638946136215dc75fd0a Mon Sep 17 00:00:00 2001 From: John Shumway Date: Tue, 27 Jan 2026 23:13:29 -0500 Subject: [PATCH] Add a readme file to ck/library/util I'm collecting information about our current testing and added a README to the directory to emphasize the GPU-first testing strategy and our support for type-specific tolerances. --- include/ck/library/utility/README.md | 290 +++++++++++++++++++++++++++ 1 file changed, 290 insertions(+) create mode 100644 include/ck/library/utility/README.md diff --git a/include/ck/library/utility/README.md b/include/ck/library/utility/README.md new file mode 100644 index 00000000000..3549b0d217e --- /dev/null +++ b/include/ck/library/utility/README.md @@ -0,0 +1,290 @@ +# CK Library Utility + +This directory contains utility headers for testing, benchmarking, and validating Composable Kernel (CK) operations. The utilities support both modern GPU-first validation for high-performance testing and legacy CPU-based approaches for backward compatibility. + +## Quick Start + +1. **Use GPU validation** for all new tests (10-100x faster than CPU validation) +2. **Let the system compute tolerances** automatically based on data types +3. **Only transfer error statistics**, not full tensors + +## File-to-Purpose Quick Reference + +| Need to... | Use this file | Key function/class | +|-------------------------------------|-----------------------------------|---------------------------| +| Validate on GPU (recommended) | `gpu_verification.hpp` | `gpu_verify()` | +| Validate on CPU (legacy/debugging) | `check_err.hpp` | `check_err()` | +| Compute tolerances automatically | `check_err.hpp` | `get_relative_threshold<>()` | +| Allocate GPU memory | `device_memory.hpp` | `DeviceMem` | +| Create CPU tensors | `host_tensor.hpp` | `Tensor` | +| Generate test data on GPU | `device_tensor_generator.hpp` | `FillUniformRandFp()` | +| Generate test data on CPU (legacy) | `host_tensor_generator.hpp` | `GeneratorTensor_*` | +| Set up convolution parameters | `convolution_parameter.hpp` | `ConvParam` | +| Create tensor descriptors | `host_tensor.hpp` | `HostTensorDescriptor` | + +## Core Validation Tools + +### GPU Validation (Recommended) + +**`gpu_verification.hpp`** - Complete on-device verification + +- `gpu_verify()`: Compares device tensors entirely on GPU + - Automatic tolerance computation based on data types + - Only transfers error statistics (~12 bytes), not tensors + - Detailed error reporting (count, max error, percentage) + - Supports all CK data types (fp32, fp16, bf16, fp8, int8, etc.) +- `gpu_reduce_max()`: Computes max(abs(tensor)) on GPU for tolerance scaling +- Grid-stride kernels with LDS reduction for optimal performance + +**Performance**: 10-100x faster than CPU validation for large tensors. + +**Example usage:** + +```cpp +// Explicit tolerance +bool pass = gpu_verify(output_dev, reference_dev, 1e-5f, 1e-6f, size); + +// Automatic tolerance for mixed precision +bool pass = gpu_verify(output_dev, reference_dev, K_dim, size); +``` + +**See:** `test/gpu_verification/test_gpu_verification.cpp` + +### Tolerance Computation + +**`check_err.hpp`** - Automatic tolerance calculation + +- `get_relative_threshold()`: Computes relative tolerance from mantissa bits +- `get_absolute_threshold()`: Computes absolute tolerance scaled by magnitude +- Type-specific overloads for all CK data types +- Accumulation-aware error bounds + +**Theory**: Based on IEEE 754 floating-point arithmetic and error propagation analysis. + +### Legacy CPU Validation + +**`check_err.hpp`** - CPU-based error checking (legacy) + +- Overloaded `check_err()` functions for different data types +- Type-aware default tolerances +- Detailed error reporting (first 5 mismatches, statistics) + +**Note**: Requires full tensor transfer to CPU - slow for large tensors. Use `gpu_verification.hpp` for new tests. + +**See:** `test/convnd_fwd/convnd_fwd_naive.cpp` for legacy CPU validation patterns + +## Numerical Validation Strategy + +**TL;DR:** CK computes tolerances from IEEE 754 precision limits, not arbitrary values. FP32 gets ~1e-5 relative tolerance, FP16 gets ~1e-3, etc. The system accounts for accumulation effects in matrix operations. + +CK implements a **theoretically-grounded approach to numerical validation** that goes beyond simple fixed tolerances. The validation system is designed around three core principles: + +### 1. Type-Aware Tolerance Computation + +Rather than using arbitrary threshold values, CK computes tolerances based on the datatypes: + +- **Relative tolerance**: Derived from mantissa bits as `2^(-mantissa_bits) * 0.5` +- **Absolute tolerance**: Scaled by value magnitude as `2^(exponent - mantissa_bits) * 0.5` +- **Multi-type analysis**: Considers compute type, output type, and accumulator type separately +- **Conservative bounds**: Takes maximum error across all data paths + +### 2. Algorithm-Aware Validation + +Different algorithms have different error characteristics: + +- **Accumulation effects**: Matrix operations (GEMM, convolution) accumulate errors proportional to the number of operations +- **Precision cascades**: Mixed-precision operations require careful tolerance selection based on the weakest link +- **Operation-specific bounds**: Tolerances scale with problem size (e.g., K dimension in GEMM) + +The validation system accepts `number_of_accumulations` to adjust tolerances for algorithmic context. + +### 3. Data Type Characteristics + +Each data type has inherent precision limits that inform validation: + +| Data Type | Mantissa Bits | Typical rtol | Typical atol | +|-----------|---------------|--------------|--------------| +| FP32 | 23 | 1e-5 | 3e-6 | +| TF32 | 10 | 5e-4 | 5e-4 | +| FP16 | 10 | 1e-3 | 1e-3 | +| BF16 | 7 | 1e-1 | 1e-3 | +| FP8 | 3-4 | 1e-3 | 1e-3 | +| BF8 | 2-3 | 1e-3 | 1e-3 | +| FP4 | 2 | 0.5 | 0.5 | +| INT8/INT32| N/A | 0 | 0 | + +## GPU-First Validation Philosophy + +Modern CK testing emphasizes **pure GPU validation** to eliminate performance bottlenecks: + +### Traditional CPU-Based Approach (Legacy) + +```text +GPU Kernel → Transfer to CPU → CPU Verification + ↑ BOTTLENECK: PCIe transfer of entire tensor +``` + +- **Problem**: Transferring multi-GB tensors over PCIe is 10-100x slower than computation +- **Impact**: Test suites become I/O bound rather than compute bound +- **Limitation**: Cannot efficiently test large-scale problems + +### Modern GPU-First Approach (Recommended) + +```text +GPU Kernel → GPU Reference → GPU Verification → Transfer scalars only + ↑ Only ~12 bytes transferred +``` + +- **Advantage**: All data stays on GPU, only error statistics transfer to CPU +- **Performance**: 10-100x faster for large tensors +- **Scalability**: Enables testing of multi-GB tensors efficiently +- **Completeness**: Detailed error reporting (count, max error, percentage) without full transfer + +### When to Use Each Approach + +**Use GPU-First Validation When:** + +- Testing production kernels (performance matters) +- Working with large tensors (>1MB) +- Running extensive test suites +- Validating at scale + +**Use CPU-Based Validation When:** + +- Debugging specific values (need to inspect individual elements) +- Working with tiny tensors (<1KB) +- Maintaining backward compatibility +- Implementing CPU reference algorithms + +## Testing Workflow Comparison + +### Modern GPU-First Workflow (Recommended) + +```cpp +// 1. Allocate device memory only +DeviceMem input_dev(size), output_dev(size), reference_dev(size); + +// 2. Initialize on GPU (no CPU involvement) +input_dev.FillUniformRandFp(-1.0f, 1.0f); + +// 3. Run kernel under test +run_kernel(input_dev, output_dev, params); + +// 4. Run reference on GPU +run_reference_kernel(input_dev, reference_dev, params); + +// 5. Verify on GPU (only transfers ~12 bytes of error stats) +bool pass = gpu_verify(output_dev, reference_dev, rtol, atol, size); +if (!pass) { + std::cout << "Validation failed!" << std::endl; + return false; +} +``` + +**Key advantage**: Zero tensor transfers - all data stays on GPU. + +### Legacy CPU-Based Workflow + +```cpp +// 1. Create host tensors (allocates CPU memory) +Tensor input_host(dims), output_host(dims), reference_host(dims); + +// 2. Generate on CPU +input_host.GenerateTensorValue(GeneratorTensor_3{-1.0f, 1.0f}); + +// 3. Allocate device memory +DeviceMem input_dev(size), output_dev(size); + +// 4. Transfer to device (slow for large tensors) +input_dev.ToDevice(input_host.data()); + +// 5. Run kernel +run_kernel(input_dev, output_dev, params); + +// 6. Transfer back to CPU (slow for large tensors) +output_dev.FromDevice(output_host.data()); + +// 7. Compute reference on CPU +compute_reference(input_host, reference_host, params); + +// 8. Verify on CPU +bool pass = check_err(output_host, reference_host, "Test failed"); +``` + +**Bottleneck**: Steps 4 and 6 transfer entire tensors over PCIe. + +## Supporting Utilities + +### Tensor Management + +- **`host_tensor.hpp`**: CPU-side tensor container with multi-dimensional support + - `HostTensorDescriptor`: Dimension, stride, and layout management + - `Tensor`: Host tensor with generation and conversion utilities +- **`device_memory.hpp`**: GPU memory management with RAII semantics + - `DeviceMem`: Device allocation, transfer, and initialization + - Device-side random value generation + - `SetZero()`: Zero-initialize device memory (required for backward passes) + +### Data Generation + +- **`device_tensor_generator.hpp`**: GPU-side tensor initialization (recommended) + - `FillUniformRandFp()`: Fill with uniform random floating-point values + - `FillUniformRandInt()`: Fill with uniform random integer values +- **`host_tensor_generator.hpp`**: CPU-side functor-based generators (legacy) + - Various patterns: zero, constant, random, sequential, diagonal, checkerboard +- **`fill.hpp`**: STL-style fill functors for containers + +### Convolution Utilities + +- **`convolution_parameter.hpp`**: Convolution parameter management + - `ConvParam`: Encapsulates dimensions, strides, padding, dilations + - Output dimension calculation and FLOP estimation +- **`convolution_host_tensor_descriptor_helper.hpp`**: Tensor descriptor creation helpers +- **`conv_common.hpp`**: Common convolution utilities + +**See:** `test/convnd_fwd/convnd_fwd_naive.cpp` for convolution parameter usage + +### Workspace Management + +Some operations require temporary GPU memory for intermediate computations: + +```cpp +// Check if workspace is needed +const std::size_t workspace_sz = op_ptr->GetWorkSpaceSize(argument_ptr.get()); + +// Allocate and set workspace if needed +if (workspace_sz > 0) { + DeviceMem workspace_dev(workspace_sz); + op_ptr->SetWorkSpacePointer(argument_ptr.get(), workspace_dev.GetDeviceBuffer()); +} +``` + +### Algorithmic Utilities + +- **`algorithm.hpp`**: Generic algorithms +- **`ranges.hpp`**: Range-based utilities and concepts +- **`iterator.hpp`**: Custom iterator implementations +- **`numeric.hpp`**: Numeric operations + +### Miscellaneous + +- **`host_common_util.hpp`**: Common host-side utilities +- **`host_gemm.hpp`**: CPU reference GEMM implementation +- **`literals.hpp`**: User-defined literals +- **`thread.hpp`**: Threading utilities + +## Best Practices + +### Choosing Tolerances + +1. **Prefer automatic computation**: Use `gpu_verify()` with automatic tolerance calculation +2. **Consider accumulation**: Pass `number_of_accumulations` for matrix operations +3. **Respect data type limits**: Don't expect FP16 to match FP32 precision +4. **Account for algorithm**: Different operations have different error characteristics + +### Performance Optimization + +1. **Use GPU-first validation** for all new tests +2. **Avoid CPU transfers** unless debugging specific values +3. **Generate data on GPU** when possible +4. **Batch verification** to amortize kernel launch overhead