Skip to content

[RFC] ConfigPipeline/AutotuneBackend: Separating compilation/benchmarking from search algorithms #1803

@hinriksnaer

Description

@hinriksnaer

Background

While implementing #1416, it became clear that the current infrastructure is limited in terms of:

Summary

Extract the compile-and-benchmark lifecycle out of BaseSearch into a
standalone ConfigPipeline. Search algorithms hand a batch of candidate
configs to the pipeline and get scored results back. The pipeline owns
how configs are compiled and benchmarked, the search algorithm owns
which configs to try.

Problem

BaseSearch is responsible for a wide range of implementations:

  1. Search logic: candidate generation, population management,
    convergence
  2. Compile/benchmark infrastructure: compilation, subprocess management
    and GPU benchmarking.
  3. Accuracy validation: baseline computation, tolerance checking
    Consequences:
  • No standalone benchmarking. You cannot evaluate a fixed set of configs
    without subclassing BaseSearch and replicating its setup ceremony.
  • Search coupled with compilation/benchmarking. Noise mitigation
    strategies (CPU pinning, SIGSTOP, etc.) require modifying BaseSearch,
    even though search algorithms don't care.
  • Extensibility is tricky. Distributed autotuning, alternative
    benchmarking strategies, or reproducible A/B comparisons all require
    working around the current structure.

Design

Interface

class ConfigPipeline:
    """Compile and benchmark a batch of kernel configs."""
    def __init__(
        self,
        kernel: _AutotunableKernel,
        args: Sequence[object],
        settings: Settings,
        *,
        log: AutotuningLogger | None = None,
        accuracy_check_fn: Callable | None = None,
    ) -> None: ...
    def run(
        self,
        configs: list[Config],
        *,
        desc: str = "Benchmarking",
    ) -> list[BenchmarkResult]: ...

run() takes the full batch. The pipeline internally decides how to
schedule compilation and benchmarking based on settings — sequential,
overlapped, or future strategies. Callers just get list[BenchmarkResult]
back.

What this enables

Custom Pipelines: One search algorithm can support multiple different compilation and benchmarking strategies.

Standalone benchmarking:

with ConfigPipeline(kernel, args, settings) as pipeline:
    results = pipeline.run(fixed_configs)

Strategy comparison (same configs, different evaluation):

seq_results = ConfigPipeline(kernel, args, seq_settings).run(configs)
ovl_results = ConfigPipeline(kernel, args, ovl_settings).run(configs)
compare_rankings(seq_results, ovl_results)

Future distributed autotuning: A DistributedPipeline farms
compilation to remote workers or benchmarks on multiple GPUs. Search
algorithms call pipeline.run(configs) unchanged.
Pluggable noise mitigation: CPU pinning, SIGSTOP scheduling, CUDA
graphs, all become pipeline implementation details, invisible to search.

To do

  • Create an abstraction layer for ConfigPipeline
  • Refactor existing search algorithms to support modular ConfigPipeline
  • Make existing or custom ConfigPipelines easily exchangeable within search algoritms
  • Create an interface that supports metrics collection and noise measurement.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions