Skip to content

Overlapping compilation time with benchmarking for the autotuner #1400

@ethche

Description

@ethche

Compilation time is often the most time-consuming component of auto-tuning. It is also highly skewed -- there are a few configs with very large, outlier compilation times. Due to batched evaluation of configs, we must wait for all of the configs to complete compiling before beginning to benchmark. The presence of outliers makes this especially inefficient -- often we are waiting on just a handful of configs to finish.

Image

To address this, one approach is to overlap compilation time with benchmarking. As a result, we could start benchmarking before waiting for the outlier configs to finish. However, a key concern is that this could introduce bias in the benchmarking results for CPU-bound kernels. For now, we should probably give the user access to this as an experimental feature that is set off by default (i.e. introduce a HELION_AUTOTUNE_OVERLAP_COMPILATION flag).

To verify the effect of this, we should run benchmarks on kernels with small shapes. Lets aim for super small (16x16, larger if necessary) matmul, layernorm, rmsnorm, softmax, cross-entropy kernels.

@hinriksnaer mentioned that he is interested in this.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions