[Bug] Known Issue: AICPU Task Timeout with Small Ring Buffers Due to Scheduler Hot-Path Overhead

### Platform

a2a3 (Ascend 910B/C hardware)

### Runtime Variant

tensormap_and_ringbuffer

### Description

## Summary

When the scheduler hot path carries non-trivial overhead (~10μs per iteration), AICPU stream synchronization fails with error code **507018** on test cases that use small ring buffer configurations (e.g., `paged_attention_ringbuffer` with window=128, heap=256KB). Two known triggers:

1. **CANN device log level 0 (DEBUG)**: Two `DEV_DEBUG` calls in the scheduler hot path each take ~10μs.
2. **`--enable-profiling`**: Per-task profiling operations (`perf_aicpu_complete_record()` with fanout list traversal, `perf_aicpu_record_phase()`) add comparable overhead.

Both pass with default ring sizes or when the overhead is removed.

**Recommendation:**
- Do not use CANN device log level 0 for testing with small ring buffer configurations. Use level 1 (INFO) or above instead.
- Profiling (`--enable-profiling`) is not supported with small ring buffer configurations. Use default ring sizes when profiling.

## Root Cause

The scheduler hot path in `aicpu_executor.cpp` (`check_running_cores_for_completion` and the dispatch loop) must process task completions fast enough to keep the ring drained. Any per-iteration overhead at the ~10μs level slows the scheduler loop. When ring buffer resources are tight, the slow scheduler causes the orchestrator to block repeatedly in `alloc()`, extending total AICPU execution time from milliseconds to seconds — **exceeding CANN's internal AICPU task timeout threshold**, resulting in termination (error 507018).

With large ring buffers, the orchestrator never blocks, execution completes in tens of milliseconds, well within the timeout.

**Trigger 1 — DEV_DEBUG (~10μs each):**
Controlled experiments confirmed this is purely an execution time issue, not related to dlog internals or CANN DEBUG log accumulation. Replacing `DEV_DEBUG` with a busy-wait of equal duration (no dlog calls, log level 1) produces the same failure.

**Trigger 2 — Profiling:**
Keeping `profiling_enabled = true` but commenting out the actual operations (`perf_aicpu_complete_record`, fanout traversal, `perf_aicpu_record_phase`) makes the test pass, confirming the same overhead-induced timeout pattern.

## Affected Configurations

| Configuration | Log Level 0 | Log Level 1+ | --enable-profiling |
|---|---|---|---|
| Default ring size (window=16384, heap=256MB) | Works | Works | Works |
| Small ring size (window=128, heap=256KB) | **Fails (507018)** | Works | **Fails (507018)** |

## Workaround

- Use CANN device log level 1 (INFO) or above when running tests with small ring buffer configurations.
- Do not use `--enable-profiling` with small ring buffer configurations. Use default ring sizes for profiling.

## Notes

- The exact mechanism by which dlog blocks a single thread internally is a CANN implementation detail and has not been determined.
- A future fix could move profiling operations off the scheduler hot path (deferred write or conditional compilation), similar to the `PTO2_HOT_PATH_LOGGING` fix for DEV_DEBUG.

### Steps to Reproduce

```markdown
# Trigger 1: CANN log level 0
export ASCEND_GLOBAL_LOG_LEVEL=0
export ASCEND_DEVICE_LOG_LEVEL=0
export GLOBAL_LOG_LEVEL=0
python examples/scripts/run_example.py \
    -k tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/kernels \
    -g tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/golden.py \
    -p a2a3

# Trigger 2: Profiling
python examples/scripts/run_example.py \
    -k tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/kernels \
    -g tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/golden.py \
    -p a2a3 --enable-profiling
```

### Expected Behavior

None

### Actual Behavior

None

### Git Commit ID

fe63325094dabed918eafa63edb1a2fc40c3be6f

### CANN Version

_No response_

### Driver Version

_No response_

### Host Platform

Linux (aarch64)

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Known Issue: AICPU Task Timeout with Small Ring Buffers Due to Scheduler Hot-Path Overhead #409

Platform

Runtime Variant

Description

Summary

Root Cause

Affected Configurations

Workaround

Notes

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Configuration	Log Level 0	Log Level 1+	--enable-profiling
Default ring size (window=16384, heap=256MB)	Works	Works	Works
Small ring size (window=128, heap=256KB)	Fails (507018)	Works	Fails (507018)

[Bug] Known Issue: AICPU Task Timeout with Small Ring Buffers Due to Scheduler Hot-Path Overhead #409

Description

Platform

Runtime Variant

Description

Summary

Root Cause

Affected Configurations

Workaround

Notes

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions