Three axioms govern hardware classification across all test categories:
- sim = no hardware:
a2a3sim,a5sim, and no--platformare all equivalent — no hardware required. - a2a3 / a5 are distinct platforms: Tests may support one, both, or neither.
requires_hardwarehas two levels:requires_hardware(no argument) — needs any hardware (a2a3 or a5)requires_hardware("a2a3")— needs specifically a2a3
These principles apply uniformly to ut-py (pytest markers), ut-cpp (ctest labels), and st (@scene_test(platforms=[...])).
# Python unit tests, no hardware (sim or github-hosted)
pytest tests/ut
# Python unit tests, a2a3 hardware
pytest tests/ut --platform a2a3
# C++ unit tests, no hardware
cmake -B tests/ut/cpp/build -S tests/ut/cpp && cmake --build tests/ut/cpp/build
ctest --test-dir tests/ut/cpp/build -LE requires_hardware --output-on-failure
# C++ unit tests, a2a3 hardware (only hw + a2a3-specific tests)
ctest --test-dir tests/ut/cpp/build -L "^requires_hardware(_a2a3)?$" --output-on-failure
# Scene tests (pytest, @scene_test classes)
pytest examples tests/st # all sim platforms (auto-parametrized)
pytest examples tests/st --platform a2a3sim # specific sim
pytest examples tests/st --platform a2a3 # hardware
# Scene tests (legacy ci.py, golden.py directory scanning)
python ci.py -p a2a3sim
python ci.py -p a2a3 -d 4-7
# Single scene test (standalone)
python examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py -p a2a3sim
# Standalone with build-from-source
python examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py -p a2a3sim --build
# Benchmark mode (100 rounds, skip golden comparison)
python examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py \
-p a2a3 -d 0 -n 100 --skip-golden
# Profiling (first round only)
python examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py \
-p a2a3 --enable-profiling
# Single example via run_example.py (deprecated — prefer test_*.py standalone)
python examples/scripts/run_example.py \
-k examples/a2a3/host_build_graph/vector_example/kernels \
-g examples/a2a3/host_build_graph/vector_example/golden.py \
-p a2a3simThree test categories:
| Category | Abbrev | Location | Runner | Description |
|---|---|---|---|---|
| System tests | st | examples/, tests/st/ |
pytest + ci.py (legacy) |
Full end-to-end cases (compile + run + validate) |
| Python unit tests | ut-py | tests/ut/ |
pytest | Unit tests for nanobind-exposed and Python modules |
| C++ unit tests | ut-cpp | tests/ut/cpp/ |
ctest (GoogleTest) | Unit tests for pure C++ modules |
ST migration: Scene tests are migrating from ci.py (golden.py directory scanning) to pytest (@scene_test class decorator). New tests should use @scene_test. Existing golden.py-based tests continue to work via ci.py during the transition.
If a module is exposed via nanobind (used by both C++ and Python), test in ut-py (tests/ut/).
If a module is pure C++ with no Python binding, test in ut-cpp (tests/ut/cpp/).
Scene tests support advanced CLI options for benchmarking, profiling, and runtime control. These work identically in both pytest and standalone mode.
pytest --platform a2a3sim # default: 1 round + golden
pytest --platform a2a3 --rounds 100 --skip-golden # benchmark mode
pytest --platform a2a3 --enable-profiling # profiling (first round)
pytest --platform a2a3sim --build # compile runtime from source
pytest --platform a2a3sim --log-level debug # verbose C++ loggingpython test_xxx.py -p a2a3sim # default: 1 round + golden
python test_xxx.py -p a2a3 -d 0 -n 100 --skip-golden # benchmark mode
python test_xxx.py -p a2a3 --enable-profiling # profiling (first round)
python test_xxx.py -p a2a3sim --build # compile runtime from source
python test_xxx.py -p a2a3sim --log-level debug # verbose C++ logging| Option | Short | Default | Description |
|---|---|---|---|
--rounds N |
-n |
1 | Run each case N times |
--skip-golden |
false | Skip golden comparison (for benchmarking) | |
--enable-profiling |
false | Enable profiling on first round only | |
--build |
false | Compile runtime from source (not pre-built) | |
--log-level LEVEL |
(none) | Set PTO_LOG_LEVEL env var (error/warn/info/debug) |
Profiling is enabled only on the first round to avoid overhead on subsequent iterations. Output tensors are reset to their initial values between rounds.
The --platform flag is the single source of truth for hardware availability. No separate -m flags are needed.
def is_device(platform: str | None) -> bool:
"""sim and None are both no-hardware."""
return platform is not None and not platform.endswith("sim")| Declaration | no-hw runner | a2a3 runner | a5 runner |
|---|---|---|---|
| (no marker) | run | skip | skip |
@pytest.mark.requires_hardware |
skip | run | run |
@pytest.mark.requires_hardware("a2a3") |
skip | run | skip |
@pytest.mark.requires_hardware("a5") |
skip | skip | run |
Skip logic (conftest.py):
marker = item.get_closest_marker("requires_hardware")
on_device = is_device(platform)
if marker is None:
if on_device:
skip("no-hardware test, runs in no-hw job")
elif marker.args:
if platform != marker.args[0]:
skip(f"requires --platform {marker.args[0]}")
else:
if not on_device:
skip("requires hardware")| Declaration (CMakeLists.txt) | no-hw runner | a2a3 runner | a5 runner |
|---|---|---|---|
| (no label) | run | skip | skip |
LABELS "requires_hardware" |
skip | run | run |
LABELS "requires_hardware_a2a3" |
skip | run | skip |
LABELS "requires_hardware_a5" |
skip | skip | run |
Selection:
| Runner | Command |
|---|---|
| No hardware | ctest -LE requires_hardware |
| a2a3 | ctest -L "^requires_hardware(_a2a3)?$" |
| a5 | ctest -L "^requires_hardware(_a5)?$" |
-LE (exclude regex) on no-hw runner: requires_hardware matches all three label variants, so only unlabeled tests run.
-L (include regex) on device runners: only labeled tests run, unlabeled ones are excluded.
platforms lists all supported platform names (both sim and device).
--platform |
Behavior |
|---|---|
a2a3sim |
Run if "a2a3sim" in platforms, else skip |
a2a3 |
Run if "a2a3" in platforms, else skip |
| (none) | Auto-parametrize over all *sim entries in platforms. Skip if no sim platform declared |
Auto-parametrization logic (conftest.py pytest_generate_tests):
def pytest_generate_tests(metafunc):
cls = metafunc.cls
if not (cls and hasattr(cls, "_scene_platforms")):
return
platform = metafunc.config.getoption("--platform")
if platform is None:
sims = [p for p in cls._scene_platforms if p.endswith("sim")]
if sims:
metafunc.parametrize("st_platform", sims, indirect=True)Examples:
@scene_test(level=2, platforms=["a2a3sim", "a2a3", "a5sim", "a5"], ...)
class TestFoo(SceneTestCase): ...
# --platform a2a3sim → TestFoo[a2a3sim]
# --platform a2a3 → TestFoo[a2a3]
# (none) → TestFoo[a2a3sim] + TestFoo[a5sim]
@scene_test(level=2, platforms=["a2a3"], ...)
class TestHwOnly(SceneTestCase): ...
# --platform a2a3 → TestHwOnly[a2a3]
# (none) → skip (no sim in platforms)No separate st or requires_hardware marker — platforms is the sole declaration.
tests/
conftest.py # pytest configuration (markers, fixtures, parametrization)
ut/ # Python unit tests (ut-py)
test_task_interface.py
test_runtime_builder.py
cpp/ # C++ unit tests (ut-cpp, GoogleTest)
st/ # Scene tests
setup/ # Compilation toolchain (KernelCompiler, RuntimeBuilder, etc.)
conftest.py # ST-specific sys.path setup
test_worker_api.py # L3 distributed worker tests
a2a3/ # Legacy golden.py-based tests (ci.py)
a5/ # Legacy golden.py-based tests (ci.py)
examples/ # Small examples (sim + onboard)
a2a3/
tensormap_and_ringbuffer/
vector_example/
test_vector_example.py # @scene_test — new style
golden.py # legacy (ci.py)
kernels/kernel_config.py # legacy (ci.py)
a5/...
conftest.py # Root: --platform/--device options, ST fixtures
GoogleTest-based tests for shared components (src/common/task_interface/ and src/{arch}/runtime/common/):
test_data_type.cpp— DataType enum, get_element_size(), get_dtype_name()
cmake -B tests/ut/cpp/build -S tests/ut/cpp
cmake --build tests/ut/cpp/build
ctest --test-dir tests/ut/cpp/build --output-on-failureTests for the nanobind extension and the Python build pipeline:
test_task_interface.py— DataType, ContinuousTensor, ChipStorageTaskArgs, torch integrationtest_runtime_builder.py— RuntimeBuilder discovery, error handling, build logic (mocked), and real compilation integration tests
# No-hardware runner (hw tests auto-skip, no-hw tests run)
pytest tests/ut
# a2a3 hardware runner (no-hw tests skip, hw + a2a3-specific tests run)
pytest tests/ut --platform a2a3Small, fast examples that run on both simulation and real hardware. Organized by runtime:
host_build_graph/— HBG examplesaicpu_build_graph/— ABG examplestensormap_and_ringbuffer/— TMR examples
Each example has a golden.py with generate_inputs() and compute_golden() for result validation.
Hardware-only scene tests for large-scale and feature-rich scenarios that are too slow or unsupported on simulation. Organized by runtime. Same structure as examples but focused on testing specific runtime behaviors and edge cases.
| Attribute | examples/ |
tests/st/ |
|---|---|---|
| Runs on sim | Yes | No |
| Runs on device | Yes | Yes |
| Scale | Small, fast | Large, thorough |
| Purpose | Examples + basic regression | Deep functionality/performance |
Add a new test file to tests/ut/cpp/ and register it in tests/ut/cpp/CMakeLists.txt:
add_executable(test_my_component
test_my_component.cpp
test_stubs.cpp
)
target_include_directories(test_my_component PRIVATE ${COMMON_DIR} ${TMR_RUNTIME_DIR} ${PLATFORM_INCLUDE_DIR})
target_link_libraries(test_my_component gtest_main)
add_test(NAME test_my_component COMMAND test_my_component)
# If hardware required:
# set_tests_properties(test_my_component PROPERTIES LABELS "requires_hardware")
# If specific platform required:
# set_tests_properties(test_my_component PROPERTIES LABELS "requires_hardware_a2a3")Create a test_*.py file using the @scene_test decorator:
from setup import SceneTestCase, scene_test
from simpler.task_interface import ArgDirection as D
@scene_test(level=2, platforms=["a2a3sim", "a2a3"], runtime="tensormap_and_ringbuffer")
class TestMyKernel(SceneTestCase):
ORCHESTRATION = {
"source": "kernels/orchestration/my_orch.cpp",
"function_name": "aicpu_orchestration_entry",
"signature": [D.IN, D.OUT],
}
KERNELS = [{"func_id": 0, "source": "kernels/aiv/my_kernel.cpp", "core_type": "aiv"}]
RUNTIME_CONFIG = {"aicpu_thread_num": 4, "block_dim": 3}
__outputs__ = ["y"]
def generate_inputs(self, params):
return [("x", torch.ones(1024)), ("y", torch.zeros(1024))]
def compute_golden(self, tensors, params):
tensors["y"][:] = tensors["x"] + 1
if __name__ == "__main__":
SceneTestCase.run_module(__name__)Run it:
# Via pytest (batch, ChipWorker reuse across tests)
pytest examples tests/st --platform a2a3sim
# Standalone (single case)
python test_my_kernel.py -p a2a3sim
# On hardware
pytest examples tests/st --platform a2a3Key fields:
level: 2 = single ChipWorker, 3 = distributed Worker (future)platforms: which platforms this test supports (sim names end in "sim")runtime: which runtime to useORCHESTRATION.source/KERNELS[].source: paths relative to the test file
The golden.py + kernel_config.py directory format is still supported via ci.py:
Create a directory under tests/st/{arch}/{runtime}/my_test/ with:
golden.py— Input generation and golden output computationkernels/kernel_config.py— Kernel and runtime configuration
The test will be automatically picked up by ci.py. New tests should prefer the @scene_test format above.
See ci.md for the full CI pipeline documentation, including the job matrix, runner constraints, marker scheme, and ci.sh internals.
One device can only run one runtime per process. Switching runtimes on the same device within a single process causes AICPU kernel hangs.
CANN's AICPU dispatch uses a framework SO (libaicpu_extend_kernels.so) with a global singleton BackendServerHandleManager that:
SaveSoFile: Writes the user AICPU .so to disk on first call, then setsfirstCreatSo_ = trueto skip all subsequent writes.SetTileFwkKernelMap:dlopens the .so and caches function pointers on first call, then setsfirstLoadSo_ = trueto skip all subsequent loads.
When a second runtime launches on the same device (same CANN process context), the Init kernel call hits the cached flags — the new AICPU .so is never written or loaded. The Exec kernel then calls function pointers from the first runtime's .so, which operates on incompatible data structures and hangs.
| Scenario | Result |
|---|---|
| Same runtime, same device, sequential | Works (same .so, cached pointers valid) |
| Different runtime, same device, sequential | Hangs (stale .so, wrong function pointers) |
| Different runtime, different device | Works (separate CANN context per device) |
| Different runtime, different process, same device | Works (rtDeviceReset between processes clears context) |
The conftest.py device allocator groups tests by runtime and assigns each runtime group to exclusive devices. See "Device Allocation Algorithm" below.
When running pytest --platform a2a3 --device 8-11, the fixture must allocate devices to tests such that:
- Runtime isolation: A device used by runtime A must not be reused by runtime B in the same process.
- L3 multi-device: L3 tests may need 2+ contiguous devices.
- Efficiency: Devices freed by one test of the same runtime can be reused by the next.
Phase 1: Group tests by runtime
tensormap_and_ringbuffer: [TestVectorExample, TestScalarData, TestL3Dependency, ...]
aicpu_build_graph: [TestPagedAttentionAicpuBuildGraph]
host_build_graph: [TestPagedAttentionHostBuildGraph]
Phase 2: Partition devices across runtime groups
Available: [8, 9, 10, 11]
tensormap_and_ringbuffer (6 tests, needs max 2 for L3 group): devices [8, 9]
aicpu_build_graph (1 test, needs 1): devices [10]
host_build_graph (1 test, needs 1): devices [11]
Phase 3: Within each group, allocate from group's device pool
TestVectorExample: dev 8 → run → release → dev 8 available again
TestScalarData: dev 8 → run → release → OK (same runtime)
TestL3Dependency: dev 8 → run → release
TestL3Group: dev [8, 9] → run → release
TestPagedAttentionAicpuBuildGraph: dev 10 → run → release
TestPagedAttentionHostBuildGraph: dev 11 → run → release
The DevicePool in conftest.py is extended with runtime-aware partitioning. The st_worker fixture checks the test class's _st_runtime and allocates from the corresponding partition.
On sim (a2a3sim, a5sim), device IDs are virtual — no hardware state, no isolation constraint. All tests share a single virtual pool with auto-incrementing IDs.
The @scene_test(platforms=[...]) decorator provides per-case platform filtering. A single test class declares which platforms it supports:
@scene_test(level=2, platforms=["a2a3sim", "a2a3"], runtime="tensormap_and_ringbuffer")
class TestSmallCase(SceneTestCase):
... # runs on sim and a2a3 hardware
@scene_test(level=2, platforms=["a2a3"], runtime="tensormap_and_ringbuffer")
class TestLargeCase(SceneTestCase):
... # hardware only (too slow for sim)This eliminates the need for separate examples/ (sim) and tests/st/ (device) directories when only scale differs. Both cases can live in the same file.
When kernels themselves differ (e.g., templated tile sizes tuned for device), separate test files remain the correct approach.