Skip to content

[Bug] Known Issue: AICPU Task Timeout with Small Ring Buffers Due to Scheduler Hot-Path Overhead #409

@ChaoZheng109

Description

@ChaoZheng109

Platform

a2a3 (Ascend 910B/C hardware)

Runtime Variant

tensormap_and_ringbuffer

Description

Summary

When the scheduler hot path carries non-trivial overhead (~10μs per iteration), AICPU stream synchronization fails with error code 507018 on test cases that use small ring buffer configurations (e.g., paged_attention_ringbuffer with window=128, heap=256KB). Two known triggers:

  1. CANN device log level 0 (DEBUG): Two DEV_DEBUG calls in the scheduler hot path each take ~10μs.
  2. --enable-profiling: Per-task profiling operations (perf_aicpu_complete_record() with fanout list traversal, perf_aicpu_record_phase()) add comparable overhead.

Both pass with default ring sizes or when the overhead is removed.

Recommendation:

  • Do not use CANN device log level 0 for testing with small ring buffer configurations. Use level 1 (INFO) or above instead.
  • Profiling (--enable-profiling) is not supported with small ring buffer configurations. Use default ring sizes when profiling.

Root Cause

The scheduler hot path in aicpu_executor.cpp (check_running_cores_for_completion and the dispatch loop) must process task completions fast enough to keep the ring drained. Any per-iteration overhead at the ~10μs level slows the scheduler loop. When ring buffer resources are tight, the slow scheduler causes the orchestrator to block repeatedly in alloc(), extending total AICPU execution time from milliseconds to seconds — exceeding CANN's internal AICPU task timeout threshold, resulting in termination (error 507018).

With large ring buffers, the orchestrator never blocks, execution completes in tens of milliseconds, well within the timeout.

Trigger 1 — DEV_DEBUG (~10μs each):
Controlled experiments confirmed this is purely an execution time issue, not related to dlog internals or CANN DEBUG log accumulation. Replacing DEV_DEBUG with a busy-wait of equal duration (no dlog calls, log level 1) produces the same failure.

Trigger 2 — Profiling:
Keeping profiling_enabled = true but commenting out the actual operations (perf_aicpu_complete_record, fanout traversal, perf_aicpu_record_phase) makes the test pass, confirming the same overhead-induced timeout pattern.

Affected Configurations

Configuration Log Level 0 Log Level 1+ --enable-profiling
Default ring size (window=16384, heap=256MB) Works Works Works
Small ring size (window=128, heap=256KB) Fails (507018) Works Fails (507018)

Workaround

  • Use CANN device log level 1 (INFO) or above when running tests with small ring buffer configurations.
  • Do not use --enable-profiling with small ring buffer configurations. Use default ring sizes for profiling.

Notes

  • The exact mechanism by which dlog blocks a single thread internally is a CANN implementation detail and has not been determined.
  • A future fix could move profiling operations off the scheduler hot path (deferred write or conditional compilation), similar to the PTO2_HOT_PATH_LOGGING fix for DEV_DEBUG.

Steps to Reproduce

# Trigger 1: CANN log level 0
export ASCEND_GLOBAL_LOG_LEVEL=0
export ASCEND_DEVICE_LOG_LEVEL=0
export GLOBAL_LOG_LEVEL=0
python examples/scripts/run_example.py \
    -k tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/kernels \
    -g tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/golden.py \
    -p a2a3

# Trigger 2: Profiling
python examples/scripts/run_example.py \
    -k tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/kernels \
    -g tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/golden.py \
    -p a2a3 --enable-profiling

Expected Behavior

None

Actual Behavior

None

Git Commit ID

fe63325

CANN Version

No response

Driver Version

No response

Host Platform

Linux (aarch64)

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions