-
Notifications
You must be signed in to change notification settings - Fork 39
[Bug] Known Issue: AICPU Task Timeout with Small Ring Buffers Due to Scheduler Hot-Path Overhead #409
Description
Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
Summary
When the scheduler hot path carries non-trivial overhead (~10μs per iteration), AICPU stream synchronization fails with error code 507018 on test cases that use small ring buffer configurations (e.g., paged_attention_ringbuffer with window=128, heap=256KB). Two known triggers:
- CANN device log level 0 (DEBUG): Two
DEV_DEBUGcalls in the scheduler hot path each take ~10μs. --enable-profiling: Per-task profiling operations (perf_aicpu_complete_record()with fanout list traversal,perf_aicpu_record_phase()) add comparable overhead.
Both pass with default ring sizes or when the overhead is removed.
Recommendation:
- Do not use CANN device log level 0 for testing with small ring buffer configurations. Use level 1 (INFO) or above instead.
- Profiling (
--enable-profiling) is not supported with small ring buffer configurations. Use default ring sizes when profiling.
Root Cause
The scheduler hot path in aicpu_executor.cpp (check_running_cores_for_completion and the dispatch loop) must process task completions fast enough to keep the ring drained. Any per-iteration overhead at the ~10μs level slows the scheduler loop. When ring buffer resources are tight, the slow scheduler causes the orchestrator to block repeatedly in alloc(), extending total AICPU execution time from milliseconds to seconds — exceeding CANN's internal AICPU task timeout threshold, resulting in termination (error 507018).
With large ring buffers, the orchestrator never blocks, execution completes in tens of milliseconds, well within the timeout.
Trigger 1 — DEV_DEBUG (~10μs each):
Controlled experiments confirmed this is purely an execution time issue, not related to dlog internals or CANN DEBUG log accumulation. Replacing DEV_DEBUG with a busy-wait of equal duration (no dlog calls, log level 1) produces the same failure.
Trigger 2 — Profiling:
Keeping profiling_enabled = true but commenting out the actual operations (perf_aicpu_complete_record, fanout traversal, perf_aicpu_record_phase) makes the test pass, confirming the same overhead-induced timeout pattern.
Affected Configurations
| Configuration | Log Level 0 | Log Level 1+ | --enable-profiling |
|---|---|---|---|
| Default ring size (window=16384, heap=256MB) | Works | Works | Works |
| Small ring size (window=128, heap=256KB) | Fails (507018) | Works | Fails (507018) |
Workaround
- Use CANN device log level 1 (INFO) or above when running tests with small ring buffer configurations.
- Do not use
--enable-profilingwith small ring buffer configurations. Use default ring sizes for profiling.
Notes
- The exact mechanism by which dlog blocks a single thread internally is a CANN implementation detail and has not been determined.
- A future fix could move profiling operations off the scheduler hot path (deferred write or conditional compilation), similar to the
PTO2_HOT_PATH_LOGGINGfix for DEV_DEBUG.
Steps to Reproduce
# Trigger 1: CANN log level 0
export ASCEND_GLOBAL_LOG_LEVEL=0
export ASCEND_DEVICE_LOG_LEVEL=0
export GLOBAL_LOG_LEVEL=0
python examples/scripts/run_example.py \
-k tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/kernels \
-g tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/golden.py \
-p a2a3
# Trigger 2: Profiling
python examples/scripts/run_example.py \
-k tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/kernels \
-g tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/golden.py \
-p a2a3 --enable-profilingExpected Behavior
None
Actual Behavior
None
Git Commit ID
CANN Version
No response
Driver Version
No response
Host Platform
Linux (aarch64)
Additional Context
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status