-
Notifications
You must be signed in to change notification settings - Fork 40
[Bug] Fan-in >=16 causes silent dependency truncation and tensor arg overflow #412
Description
Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
The Graph-fanin_N test case produces incorrect results when fanin_width > 16 (e.g., Fanin24, Fanin32), reporting:
[ERROR] TEST FAILED: Output 'result' does not match golden.
Mismatched elements: 1/1
rtol=1e-05, atol=1e-05
Root cause is a dual overflow:
1. Silent dependency truncation (PTO2_MAX_INPUTS=16)
In pto_orchestrator.cpp, the fanin_states[] array used to collect fan-in dependencies is sized to PTO2_MAX_INPUTS (16). When a barrier task has more than 16 INPUT dependencies (producer tasks), the excess dependencies are silently discarded with no error or log message:
// pto_orchestrator.cpp:471-475
if (!already_added) {
if (fanin_count < PTO2_MAX_INPUTS) { // hard limit of 16
fanin_states[fanin_count++] = prod_state;
}
// exceeds 16 → silently dropped, no error reported
}This causes the barrier task to only wait for the first 16 producers instead of all N.
2. Tensor argument array out-of-bounds write (MAX_TENSOR_ARGS=16)
The barrier task's arguments consist of 1 INOUT (result) + N INPUTs (producer outputs). When N=24, the total is 25 tensor args, exceeding MAX_TENSOR_ARGS=16. When payload->init() writes into PTO2TaskPayload::tensors[MAX_TENSOR_ARGS], it causes an out-of-bounds write that corrupts the subsequent dispatch_args memory region, resulting in the barrier kernel receiving incorrect tensor pointers.
Relevant hardcoded constants (pto_types.h):
#define MAX_TENSOR_ARGS 16 // Barrier needs 1+N args; overflows when N>15
#define PTO2_MAX_INPUTS 16 // Dependency tracking limitFixed-size arrays in PTO2TaskPayload (pto_runtime2_types.h:378-380):
PTO2TaskSlotState* fanin_slot_states[PTO2_MAX_INPUTS]; // [16]
Tensor tensors[MAX_TENSOR_ARGS]; // [16]Steps to Reproduce
# Fanin4 — passes
python examples/scripts/run_example.py \
-k tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/kernels \
-g tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/golden.py \
-p onboard --case Fanin4
# Fanin24 — fails
python examples/scripts/run_example.py \
-k tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/kernels \
-g tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/golden.py \
-p onboard --case Fanin24
# Fanin32 — fails
python examples/scripts/run_example.py \
-k tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/kernels \
-g tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/golden.py \
-p onboard --case Fanin32Trigger condition: any single task whose fan-in dependency count (number of INPUT tensor args) exceeds 16.
| Case | Producers | Actually tracked deps | Result |
|---|---|---|---|
| Fanin4 | 4 | 4 | PASS |
| Fanin16 | 16 | 16 | PASS |
| Fanin24 | 24 | 16 (truncated) | FAIL |
| Fanin32 | 32 | 16 (truncated) | FAIL |
Expected Behavior
All fan-in cases (including Fanin24 and Fanin32) should pass correctly with output result=1.0 matching the golden value. Alternatively, when the runtime's capacity limit is exceeded, a clear error message should be reported instead of silently truncating dependencies.
Actual Behavior
[ERROR] TEST FAILED: Output 'result' does not match golden.
Mismatched elements: 1/1
rtol=1e-05, atol=1e-05
Silent dependency truncation combined with tensor arg array out-of-bounds write causes the barrier kernel to produce an incorrect result.
Git Commit ID
Host Platform
Linux (aarch64)
Additional Context
Affected files:
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h:43—PTO2_MAX_INPUTSdefinitionsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp:471-475— dependency truncation logicsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h:378-380— payload fixed-size arraystests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/— triggering test case
Possible fix directions:
- Raise the limits: increase
PTO2_MAX_INPUTS,MAX_TENSOR_ARGS, etc. (increases per-task memory footprint) - Multi-stage fan-in at orchestration layer: split N-way fan-in into a multi-level tree (e.g., 24 → 6 groups × 4-way → 1 × 6-way), ensuring each task stays within the 16-input limit
- Add bounds checking: emit an error in
Arg::add_input()or during orchestrator submission when tensor arg count exceeds the limit, instead of silently truncating
Metadata
Metadata
Assignees
Labels
Type
Projects
Status