Background
While implementing #1416, it became clear that the current infrastructure is limited in terms of:
Summary
Extract the compile-and-benchmark lifecycle out of BaseSearch into a
standalone ConfigPipeline. Search algorithms hand a batch of candidate
configs to the pipeline and get scored results back. The pipeline owns
how configs are compiled and benchmarked, the search algorithm owns
which configs to try.
Problem
BaseSearch is responsible for a wide range of implementations:
- Search logic: candidate generation, population management,
convergence
- Compile/benchmark infrastructure: compilation, subprocess management
and GPU benchmarking.
- Accuracy validation: baseline computation, tolerance checking
Consequences:
- No standalone benchmarking. You cannot evaluate a fixed set of configs
without subclassing BaseSearch and replicating its setup ceremony.
- Search coupled with compilation/benchmarking. Noise mitigation
strategies (CPU pinning, SIGSTOP, etc.) require modifying BaseSearch,
even though search algorithms don't care.
- Extensibility is tricky. Distributed autotuning, alternative
benchmarking strategies, or reproducible A/B comparisons all require
working around the current structure.
Design
Interface
class ConfigPipeline:
"""Compile and benchmark a batch of kernel configs."""
def __init__(
self,
kernel: _AutotunableKernel,
args: Sequence[object],
settings: Settings,
*,
log: AutotuningLogger | None = None,
accuracy_check_fn: Callable | None = None,
) -> None: ...
def run(
self,
configs: list[Config],
*,
desc: str = "Benchmarking",
) -> list[BenchmarkResult]: ...
run() takes the full batch. The pipeline internally decides how to
schedule compilation and benchmarking based on settings — sequential,
overlapped, or future strategies. Callers just get list[BenchmarkResult]
back.
What this enables
Custom Pipelines: One search algorithm can support multiple different compilation and benchmarking strategies.
Standalone benchmarking:
with ConfigPipeline(kernel, args, settings) as pipeline:
results = pipeline.run(fixed_configs)
Strategy comparison (same configs, different evaluation):
seq_results = ConfigPipeline(kernel, args, seq_settings).run(configs)
ovl_results = ConfigPipeline(kernel, args, ovl_settings).run(configs)
compare_rankings(seq_results, ovl_results)
Future distributed autotuning: A DistributedPipeline farms
compilation to remote workers or benchmarks on multiple GPUs. Search
algorithms call pipeline.run(configs) unchanged.
Pluggable noise mitigation: CPU pinning, SIGSTOP scheduling, CUDA
graphs, all become pipeline implementation details, invisible to search.
To do
- Create an abstraction layer for
ConfigPipeline
- Refactor existing search algorithms to support modular
ConfigPipeline
- Make existing or custom
ConfigPipelines easily exchangeable within search algoritms
- Create an interface that supports metrics collection and noise measurement.
Background
While implementing #1416, it became clear that the current infrastructure is limited in terms of:
Summary
Extract the compile-and-benchmark lifecycle out of
BaseSearchinto astandalone
ConfigPipeline. Search algorithms hand a batch of candidateconfigs to the pipeline and get scored results back. The pipeline owns
how configs are compiled and benchmarked, the search algorithm owns
which configs to try.
Problem
BaseSearchis responsible for a wide range of implementations:convergence
and GPU benchmarking.
Consequences:
without subclassing
BaseSearchand replicating its setup ceremony.strategies (CPU pinning, SIGSTOP, etc.) require modifying
BaseSearch,even though search algorithms don't care.
benchmarking strategies, or reproducible A/B comparisons all require
working around the current structure.
Design
Interface
run()takes the full batch. The pipeline internally decides how toschedule compilation and benchmarking based on settings — sequential,
overlapped, or future strategies. Callers just get
list[BenchmarkResult]back.
What this enables
Custom Pipelines: One search algorithm can support multiple different compilation and benchmarking strategies.
Standalone benchmarking:
Strategy comparison (same configs, different evaluation):
Future distributed autotuning: A
DistributedPipelinefarmscompilation to remote workers or benchmarks on multiple GPUs. Search
algorithms call
pipeline.run(configs)unchanged.Pluggable noise mitigation: CPU pinning, SIGSTOP scheduling, CUDA
graphs, all become pipeline implementation details, invisible to search.
To do
ConfigPipelineConfigPipelineConfigPipelineseasily exchangeable within search algoritms