feat: add dynamic kernel workload tracing (trace-kernel) by irvineoy · Pull Request #11 · AMD-AGI/Apex

irvineoy · 2026-05-22T06:17:51Z

Summary

This PR adds the initial trace-kernel workflow for dynamic kernel workload tracing.

The new workflow can temporarily patch Python-visible kernel launch sites or wrapper calls, run a tracing workload, collect JSONL workload metadata, and aggregate the observed shapes/flags into workload ranges for later kernel optimization.

Key changes

Add workload_optimizer.py trace-kernel subcommand with full CLI argument parsing
Add pipeline/kernel_tracing/ package: runtime, serializer, patchers, overlay runner, postprocessor, mode detection, and agent fallback harness
Support Triton launch-site tracing for kernel[grid](...) calls via AST patching
Support Python-visible wrapper tracing for custom HIP/op wrapper paths (aiter, vLLM, SGLang)
Support local execution and Docker benchmark overlay injection without modifying source repos in place
Add workload aggregation output for raw events and grouped workload ranges
Add unit coverage plus repo-pattern test cases for aiter, vLLM, and SGLang kernels (30 cases across 3 repos)

New files (15 total)

File	Purpose
`.gitignore`	Add `results_*/` pattern
`pipeline/kernel_tracing/__init__.py`	Package entry point
`pipeline/kernel_tracing/agent_harness.py`	Constrained agent fallback for complex patches
`pipeline/kernel_tracing/mode_detection.py`	Auto-detect trace mode from source analysis
`pipeline/kernel_tracing/overlay.py`	Module overlay + Docker wrapper for injection
`pipeline/kernel_tracing/patch_triton.py`	AST-based Triton launch site patching
`pipeline/kernel_tracing/patch_wrapper.py`	AST-based Python wrapper patching
`pipeline/kernel_tracing/postprocess.py`	JSONL event aggregation into workload ranges
`pipeline/kernel_tracing/runner.py`	Top-level trace-kernel orchestration
`pipeline/kernel_tracing/runtime.py`	Runtime event emitter (written into patched tree)
`pipeline/kernel_tracing/serializer.py`	Safe tensor/value serialization
`pipeline/kernel_tracing/test_cases.py`	30 required repo pattern test case definitions
`tests/test_kernel_tracing.py`	Unit tests for patching, overlay, postprocessing
`tests/test_kernel_tracing_cases.py`	Parametrized repo-pattern patchability tests
`workload_optimizer.py`	CLI `trace-kernel` subcommand + arg parser

Validation

pytest tests/test_kernel_tracing.py tests/test_kernel_tracing_cases.py tests/test_workload_optimizer.py tests/test_backends.py -q
112 passed

Local trace-kernel smoke test passed
Docker E2E GPT-OSS 20B smoke test passed on MI300
- Traced kernel_unified_attention_2d
- Captured 98 target launch events
- Benchmark completed successfully

Notes

HIP tracing in this PR is wrapper-level tracing. It captures Python-visible tensor metadata and flags, but does not yet identify exact bottom-level HIP/CK/ASM kernel variants.

Add trace-kernel CLI support for patching Triton launches and Python-visible custom op wrappers, collecting JSONL workload metadata, and aggregating workload ranges. Implement local/Docker overlay injection, container-source patching for Docker benchmarks, Agent fallback wiring, and unit coverage for the required repo cases.

Make module_import trace events unsampled so overlay activation can be reliably diagnosed even with low sample rates. Route aiter-compile-ops tracing through the central aiter.jit.core.compile_ops hook instead of patching only high-level wrapper files. Add static patch coverage for both aiter ctypes and pybind wrapper paths, plus trace-all support for discovering real low-level op names. Extend tracing tests and document trace-kernel usage, Docker overlay behavior, result interpretation, and runnable examples.

Render workload signatures and shape ranges as Markdown tables instead of dense inline dictionaries. Add postprocess coverage to keep the summary output readable for traced tensor input distributions.

sinarafati-amd · 2026-05-26T23:17:29Z

+    return line_idx
+
+
+def patch_triton_launch_file(


there are 2 patch_triton_launch_file modules defined. which one is the correct one?

sinarafati-amd · 2026-05-26T23:18:45Z

+    try:
+        yield
+    finally:
+        os.environ.clear()


we might loose all env variables here

sinarafati-amd · 2026-05-26T23:21:49Z

+from pathlib import Path
+
+
+RUNTIME_SOURCE = r'''


is this same code as serializer.py? can we consolidate those?

sinarafati-amd · 2026-05-29T02:46:20Z

+
+
+def test_required_case_matrix_has_30_cases():
+    assert len(TRACE_TEST_CASES) == 30


having 30 hard coded is that ok?

sinarafati-amd · 2026-05-29T02:48:05Z

+from pathlib import Path
+from typing import Any
+
+import yaml


lets add PyYAML to requirements.txt

Add a generated supported-kernels registry and switch trace-kernel to resolve targets by kernel ID. Include list-trace-kernels for discovery, expand patchability tests across the registry, and document the new flow. Harden tracing for E2E workloads by preserving container module origins, skipping unsafe torch tracing proxies, supporting trace-all wrapper instrumentation, and separating any-event discovery from exact target hits. Also guard aiter compile_ops overlays against uncheckable annotations.

… refresh - allow trace-kernel to accept repeated or comma-separated kernel IDs - patch multiple static trace targets in one workload run - refresh supported kernel registry from benchmark Docker images - record source image provenance and relax local checkout validation - add coverage for multi-target tracing and registry update flows

Extend trace-kernel with a --disable-benchmark-cuda-graph option that rewrites the selected InferenceX benchmark script into a no-cudagraph overlay for Docker benchmark runs. SGLang launch scripts now get --disable-cuda-graph and --disable-piecewise-cuda-graph, while vLLM launch scripts get --enforce-eager. The generated script is bind-mounted through the existing docker wrapper so Magpie and InferenceX stay read-only. Move trace_raw permission setup and benchmark-script override handling into the runner, keep stdout JSON output stable, and add the compact trace summary on stderr. Always generate target_kernel_tensor_shapes.json during trace postprocessing, preserving the broader workload_ranges.json while adding a target-kernel-oriented view for shape analysis. Add a checked-in DeepSeek R1 multi-kernel tracing example script and update README coverage for the new outputs and options. Tests cover SGLang and vLLM script rewrites, docker extra mounts, CLI dry-run propagation, target shape artifact generation, and the new example script syntax.

irvineoy requested review from mycpuorg and sinarafati-amd as code owners May 22, 2026 06:17

sinarafati-amd changed the title ~~Add dynamic kernel workload tracing~~ feat: add dynamic kernel workload tracing (trace-kernel) May 26, 2026

irvineoy added 2 commits May 27, 2026 03:45

Format kernel tracing workload summaries

fbbb9d2

Render workload signatures and shape ranges as Markdown tables instead of dense inline dictionaries. Add postprocess coverage to keep the summary output readable for traced tensor input distributions.

sinarafati-amd reviewed May 29, 2026

View reviewed changes

irvineoy and others added 3 commits May 29, 2026 22:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add dynamic kernel workload tracing (trace-kernel)#11

feat: add dynamic kernel workload tracing (trace-kernel)#11
irvineoy wants to merge 6 commits into
mainfrom
feature/dynamic-kernel-tracing

irvineoy commented May 22, 2026 •

edited by sinarafati-amd

Loading

Uh oh!

sinarafati-amd May 26, 2026

Uh oh!

sinarafati-amd May 26, 2026

Uh oh!

sinarafati-amd May 26, 2026

Uh oh!

sinarafati-amd May 29, 2026

Uh oh!

sinarafati-amd May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def test_required_case_matrix_has_30_cases():
		assert len(TRACE_TEST_CASES) == 30

Conversation

irvineoy commented May 22, 2026 • edited by sinarafati-amd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

New files (15 total)

Validation

Notes

Uh oh!

sinarafati-amd May 26, 2026

Choose a reason for hiding this comment

Uh oh!

sinarafati-amd May 26, 2026

Choose a reason for hiding this comment

Uh oh!

sinarafati-amd May 26, 2026

Choose a reason for hiding this comment

Uh oh!

sinarafati-amd May 29, 2026

Choose a reason for hiding this comment

Uh oh!

sinarafati-amd May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

irvineoy commented May 22, 2026 •

edited by sinarafati-amd

Loading