-
Notifications
You must be signed in to change notification settings - Fork 270
Add a README.md file to ck/library/util #3665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,290 @@ | ||||||
| # CK Library Utility | ||||||
|
|
||||||
| This directory contains utility headers for testing, benchmarking, and validating Composable Kernel (CK) operations. The utilities support both modern GPU-first validation for high-performance testing and legacy CPU-based approaches for backward compatibility. | ||||||
|
|
||||||
| ## Quick Start | ||||||
|
|
||||||
| 1. **Use GPU validation** for all new tests (10-100x faster than CPU validation) | ||||||
| 2. **Let the system compute tolerances** automatically based on data types | ||||||
| 3. **Only transfer error statistics**, not full tensors | ||||||
|
|
||||||
| ## File-to-Purpose Quick Reference | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A small thing, but the table is organized from left-to-right as purpose-to-file. The current section name is the reverse.
Suggested change
|
||||||
|
|
||||||
| | Need to... | Use this file | Key function/class | | ||||||
| |-------------------------------------|-----------------------------------|---------------------------| | ||||||
| | Validate on GPU (recommended) | `gpu_verification.hpp` | `gpu_verify()` | | ||||||
| | Validate on CPU (legacy/debugging) | `check_err.hpp` | `check_err()` | | ||||||
| | Compute tolerances automatically | `check_err.hpp` | `get_relative_threshold<>()` | | ||||||
| | Allocate GPU memory | `device_memory.hpp` | `DeviceMem` | | ||||||
| | Create CPU tensors | `host_tensor.hpp` | `Tensor<T>` | | ||||||
| | Generate test data on GPU | `device_tensor_generator.hpp` | `FillUniformRandFp()` | | ||||||
| | Generate test data on CPU (legacy) | `host_tensor_generator.hpp` | `GeneratorTensor_*` | | ||||||
| | Set up convolution parameters | `convolution_parameter.hpp` | `ConvParam` | | ||||||
| | Create tensor descriptors | `host_tensor.hpp` | `HostTensorDescriptor` | | ||||||
|
|
||||||
| ## Core Validation Tools | ||||||
|
|
||||||
| ### GPU Validation (Recommended) | ||||||
|
|
||||||
| **`gpu_verification.hpp`** - Complete on-device verification | ||||||
|
|
||||||
| - `gpu_verify()`: Compares device tensors entirely on GPU | ||||||
| - Automatic tolerance computation based on data types | ||||||
| - Only transfers error statistics (~12 bytes), not tensors | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can remove (~12 bytes) |
||||||
| - Detailed error reporting (count, max error, percentage) | ||||||
| - Supports all CK data types (fp32, fp16, bf16, fp8, int8, etc.) | ||||||
| - `gpu_reduce_max()`: Computes max(abs(tensor)) on GPU for tolerance scaling | ||||||
| - Grid-stride kernels with LDS reduction for optimal performance | ||||||
|
|
||||||
| **Performance**: 10-100x faster than CPU validation for large tensors. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe remove the 10-100x number? |
||||||
|
|
||||||
| **Example usage:** | ||||||
|
|
||||||
| ```cpp | ||||||
| // Explicit tolerance | ||||||
| bool pass = gpu_verify<float>(output_dev, reference_dev, 1e-5f, 1e-6f, size); | ||||||
|
|
||||||
| // Automatic tolerance for mixed precision | ||||||
| bool pass = gpu_verify<float, half_t, float>(output_dev, reference_dev, K_dim, size); | ||||||
|
Comment on lines
+44
to
+48
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it worth mentioning when to use an explicit vs. automatic tolerance?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
On a slightly separate note, this does not currently support split-k, which requires accounting for accumulation in multiple data types. See issue #3673. |
||||||
| ``` | ||||||
|
|
||||||
| **See:** `test/gpu_verification/test_gpu_verification.cpp` | ||||||
|
|
||||||
| ### Tolerance Computation | ||||||
|
|
||||||
| **`check_err.hpp`** - Automatic tolerance calculation | ||||||
|
|
||||||
| - `get_relative_threshold<ComputeType, OutType, AccType>()`: Computes relative tolerance from mantissa bits | ||||||
| - `get_absolute_threshold<ComputeType, OutType, AccType>()`: Computes absolute tolerance scaled by magnitude | ||||||
|
Comment on lines
+57
to
+58
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it is helpful, a short example for each of these calls could help users see its usage. |
||||||
| - Type-specific overloads for all CK data types | ||||||
| - Accumulation-aware error bounds | ||||||
|
|
||||||
| **Theory**: Based on IEEE 754 floating-point arithmetic and error propagation analysis. | ||||||
|
|
||||||
| ### Legacy CPU Validation | ||||||
|
|
||||||
| **`check_err.hpp`** - CPU-based error checking (legacy) | ||||||
|
|
||||||
| - Overloaded `check_err()` functions for different data types | ||||||
| - Type-aware default tolerances | ||||||
| - Detailed error reporting (first 5 mismatches, statistics) | ||||||
|
|
||||||
| **Note**: Requires full tensor transfer to CPU - slow for large tensors. Use `gpu_verification.hpp` for new tests. | ||||||
|
|
||||||
| **See:** `test/convnd_fwd/convnd_fwd_naive.cpp` for legacy CPU validation patterns | ||||||
|
|
||||||
| ## Numerical Validation Strategy | ||||||
|
|
||||||
| **TL;DR:** CK computes tolerances from IEEE 754 precision limits, not arbitrary values. FP32 gets ~1e-5 relative tolerance, FP16 gets ~1e-3, etc. The system accounts for accumulation effects in matrix operations. | ||||||
|
|
||||||
| CK implements a **theoretically-grounded approach to numerical validation** that goes beyond simple fixed tolerances. The validation system is designed around three core principles: | ||||||
|
|
||||||
| ### 1. Type-Aware Tolerance Computation | ||||||
|
|
||||||
| Rather than using arbitrary threshold values, CK computes tolerances based on the datatypes: | ||||||
|
|
||||||
| - **Relative tolerance**: Derived from mantissa bits as `2^(-mantissa_bits) * 0.5` | ||||||
| - **Absolute tolerance**: Scaled by value magnitude as `2^(exponent - mantissa_bits) * 0.5` | ||||||
| - **Multi-type analysis**: Considers compute type, output type, and accumulator type separately | ||||||
| - **Conservative bounds**: Takes maximum error across all data paths | ||||||
|
|
||||||
| ### 2. Algorithm-Aware Validation | ||||||
|
|
||||||
| Different algorithms have different error characteristics: | ||||||
|
|
||||||
| - **Accumulation effects**: Matrix operations (GEMM, convolution) accumulate errors proportional to the number of operations | ||||||
| - **Precision cascades**: Mixed-precision operations require careful tolerance selection based on the weakest link | ||||||
| - **Operation-specific bounds**: Tolerances scale with problem size (e.g., K dimension in GEMM) | ||||||
|
|
||||||
| The validation system accepts `number_of_accumulations` to adjust tolerances for algorithmic context. | ||||||
|
|
||||||
| ### 3. Data Type Characteristics | ||||||
|
|
||||||
| Each data type has inherent precision limits that inform validation: | ||||||
|
|
||||||
| | Data Type | Mantissa Bits | Typical rtol | Typical atol | | ||||||
| |-----------|---------------|--------------|--------------| | ||||||
| | FP32 | 23 | 1e-5 | 3e-6 | | ||||||
| | TF32 | 10 | 5e-4 | 5e-4 | | ||||||
| | FP16 | 10 | 1e-3 | 1e-3 | | ||||||
| | BF16 | 7 | 1e-1 | 1e-3 | | ||||||
|
||||||
| | BF16 | 7 | 1e-1 | 1e-3 | | |
| | BF16 | 7 | 1e-2 | 1e-3 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rtol for BF8 lower than BF16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can remove the 10-100x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove or rephrase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we do or support batch verification. It would be very VRAM-intense since many device-side tensors would have to be kept in memory instead of clearing the non-reference output after each kernel that is tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section seems to summarize what our good practices are, key principles, or validation guidelines rather than initial setup steps.