Skip to content

feat: add dynamic kernel workload tracing (trace-kernel)#11

Open
irvineoy wants to merge 6 commits into
mainfrom
feature/dynamic-kernel-tracing
Open

feat: add dynamic kernel workload tracing (trace-kernel)#11
irvineoy wants to merge 6 commits into
mainfrom
feature/dynamic-kernel-tracing

Conversation

@irvineoy

@irvineoy irvineoy commented May 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR adds the initial trace-kernel workflow for dynamic kernel workload tracing.

The new workflow can temporarily patch Python-visible kernel launch sites or wrapper calls, run a tracing workload, collect JSONL workload metadata, and aggregate the observed shapes/flags into workload ranges for later kernel optimization.

Key changes

  • Add workload_optimizer.py trace-kernel subcommand with full CLI argument parsing
  • Add pipeline/kernel_tracing/ package: runtime, serializer, patchers, overlay runner, postprocessor, mode detection, and agent fallback harness
  • Support Triton launch-site tracing for kernel[grid](...) calls via AST patching
  • Support Python-visible wrapper tracing for custom HIP/op wrapper paths (aiter, vLLM, SGLang)
  • Support local execution and Docker benchmark overlay injection without modifying source repos in place
  • Add workload aggregation output for raw events and grouped workload ranges
  • Add unit coverage plus repo-pattern test cases for aiter, vLLM, and SGLang kernels (30 cases across 3 repos)

New files (15 total)

File Purpose
.gitignore Add results_*/ pattern
pipeline/kernel_tracing/__init__.py Package entry point
pipeline/kernel_tracing/agent_harness.py Constrained agent fallback for complex patches
pipeline/kernel_tracing/mode_detection.py Auto-detect trace mode from source analysis
pipeline/kernel_tracing/overlay.py Module overlay + Docker wrapper for injection
pipeline/kernel_tracing/patch_triton.py AST-based Triton launch site patching
pipeline/kernel_tracing/patch_wrapper.py AST-based Python wrapper patching
pipeline/kernel_tracing/postprocess.py JSONL event aggregation into workload ranges
pipeline/kernel_tracing/runner.py Top-level trace-kernel orchestration
pipeline/kernel_tracing/runtime.py Runtime event emitter (written into patched tree)
pipeline/kernel_tracing/serializer.py Safe tensor/value serialization
pipeline/kernel_tracing/test_cases.py 30 required repo pattern test case definitions
tests/test_kernel_tracing.py Unit tests for patching, overlay, postprocessing
tests/test_kernel_tracing_cases.py Parametrized repo-pattern patchability tests
workload_optimizer.py CLI trace-kernel subcommand + arg parser

Validation

pytest tests/test_kernel_tracing.py tests/test_kernel_tracing_cases.py tests/test_workload_optimizer.py tests/test_backends.py -q
112 passed
  • Local trace-kernel smoke test passed
  • Docker E2E GPT-OSS 20B smoke test passed on MI300
    • Traced kernel_unified_attention_2d
    • Captured 98 target launch events
    • Benchmark completed successfully

Notes

  • HIP tracing in this PR is wrapper-level tracing. It captures Python-visible tensor metadata and flags, but does not yet identify exact bottom-level HIP/CK/ASM kernel variants.

Add trace-kernel CLI support for patching Triton launches and Python-visible custom op wrappers, collecting JSONL workload metadata, and aggregating workload ranges.

Implement local/Docker overlay injection, container-source patching for Docker benchmarks, Agent fallback wiring, and unit coverage for the required repo cases.
@sinarafati-amd sinarafati-amd changed the title Add dynamic kernel workload tracing feat: add dynamic kernel workload tracing (trace-kernel) May 26, 2026
irvineoy added 2 commits May 27, 2026 03:45
Make module_import trace events unsampled so overlay activation can be reliably diagnosed even with low sample rates.

Route aiter-compile-ops tracing through the central aiter.jit.core.compile_ops hook instead of patching only high-level wrapper files.

Add static patch coverage for both aiter ctypes and pybind wrapper paths, plus trace-all support for discovering real low-level op names.

Extend tracing tests and document trace-kernel usage, Docker overlay behavior, result interpretation, and runnable examples.
Render workload signatures and shape ranges as Markdown tables instead of dense inline dictionaries.

Add postprocess coverage to keep the summary output readable for traced tensor input distributions.
return line_idx


def patch_triton_launch_file(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are 2 patch_triton_launch_file modules defined. which one is the correct one?

try:
yield
finally:
os.environ.clear()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might loose all env variables here

from pathlib import Path


RUNTIME_SOURCE = r'''

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this same code as serializer.py? can we consolidate those?

Comment thread tests/test_kernel_tracing_cases.py Outdated


def test_required_case_matrix_has_30_cases():
assert len(TRACE_TEST_CASES) == 30

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having 30 hard coded is that ok?

from pathlib import Path
from typing import Any

import yaml

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add PyYAML to requirements.txt

irvineoy and others added 3 commits May 29, 2026 22:19
Add a generated supported-kernels registry and switch trace-kernel to resolve targets by kernel ID. Include list-trace-kernels for discovery, expand patchability tests across the registry, and document the new flow.

Harden tracing for E2E workloads by preserving container module origins, skipping unsafe torch tracing proxies, supporting trace-all wrapper instrumentation, and separating any-event discovery from exact target hits. Also guard aiter compile_ops overlays against uncheckable annotations.
… refresh

- allow trace-kernel to accept repeated or comma-separated kernel IDs
- patch multiple static trace targets in one workload run
- refresh supported kernel registry from benchmark Docker images
- record source image provenance and relax local checkout validation
- add coverage for multi-target tracing and registry update flows
Extend trace-kernel with a --disable-benchmark-cuda-graph option that rewrites the selected InferenceX benchmark script into a no-cudagraph overlay for Docker benchmark runs. SGLang launch scripts now get --disable-cuda-graph and --disable-piecewise-cuda-graph, while vLLM launch scripts get --enforce-eager. The generated script is bind-mounted through the existing docker wrapper so Magpie and InferenceX stay read-only.

Move trace_raw permission setup and benchmark-script override handling into the runner, keep stdout JSON output stable, and add the compact trace summary on stderr.

Always generate target_kernel_tensor_shapes.json during trace postprocessing, preserving the broader workload_ranges.json while adding a target-kernel-oriented view for shape analysis.

Add a checked-in DeepSeek R1 multi-kernel tracing example script and update README coverage for the new outputs and options.

Tests cover SGLang and vLLM script rewrites, docker extra mounts, CLI dry-run propagation, target shape artifact generation, and the new example script syntax.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants