Skip to content

[Bug] Fan-in >=16 causes silent dependency truncation and tensor arg overflow #412

@chenshengxin2026

Description

@chenshengxin2026

Platform

a2a3 (Ascend 910B/C hardware)

Runtime Variant

tensormap_and_ringbuffer

Description

The Graph-fanin_N test case produces incorrect results when fanin_width > 16 (e.g., Fanin24, Fanin32), reporting:

[ERROR] TEST FAILED: Output 'result' does not match golden.
Mismatched elements: 1/1
rtol=1e-05, atol=1e-05

Root cause is a dual overflow:

1. Silent dependency truncation (PTO2_MAX_INPUTS=16)

In pto_orchestrator.cpp, the fanin_states[] array used to collect fan-in dependencies is sized to PTO2_MAX_INPUTS (16). When a barrier task has more than 16 INPUT dependencies (producer tasks), the excess dependencies are silently discarded with no error or log message:

// pto_orchestrator.cpp:471-475
if (!already_added) {
    if (fanin_count < PTO2_MAX_INPUTS) {   // hard limit of 16
        fanin_states[fanin_count++] = prod_state;
    }
    // exceeds 16 → silently dropped, no error reported
}

This causes the barrier task to only wait for the first 16 producers instead of all N.

2. Tensor argument array out-of-bounds write (MAX_TENSOR_ARGS=16)

The barrier task's arguments consist of 1 INOUT (result) + N INPUTs (producer outputs). When N=24, the total is 25 tensor args, exceeding MAX_TENSOR_ARGS=16. When payload->init() writes into PTO2TaskPayload::tensors[MAX_TENSOR_ARGS], it causes an out-of-bounds write that corrupts the subsequent dispatch_args memory region, resulting in the barrier kernel receiving incorrect tensor pointers.

Relevant hardcoded constants (pto_types.h):

#define MAX_TENSOR_ARGS 16   // Barrier needs 1+N args; overflows when N>15
#define PTO2_MAX_INPUTS 16   // Dependency tracking limit

Fixed-size arrays in PTO2TaskPayload (pto_runtime2_types.h:378-380):

PTO2TaskSlotState* fanin_slot_states[PTO2_MAX_INPUTS];  // [16]
Tensor tensors[MAX_TENSOR_ARGS];                         // [16]

Steps to Reproduce

# Fanin4 — passes
python examples/scripts/run_example.py \
  -k tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/kernels \
  -g tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/golden.py \
  -p onboard --case Fanin4

# Fanin24 — fails
python examples/scripts/run_example.py \
  -k tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/kernels \
  -g tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/golden.py \
  -p onboard --case Fanin24

# Fanin32 — fails
python examples/scripts/run_example.py \
  -k tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/kernels \
  -g tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/golden.py \
  -p onboard --case Fanin32

Trigger condition: any single task whose fan-in dependency count (number of INPUT tensor args) exceeds 16.

Case Producers Actually tracked deps Result
Fanin4 4 4 PASS
Fanin16 16 16 PASS
Fanin24 24 16 (truncated) FAIL
Fanin32 32 16 (truncated) FAIL

Expected Behavior

All fan-in cases (including Fanin24 and Fanin32) should pass correctly with output result=1.0 matching the golden value. Alternatively, when the runtime's capacity limit is exceeded, a clear error message should be reported instead of silently truncating dependencies.

Actual Behavior

[ERROR] TEST FAILED: Output 'result' does not match golden.
Mismatched elements: 1/1
rtol=1e-05, atol=1e-05

Silent dependency truncation combined with tensor arg array out-of-bounds write causes the barrier kernel to produce an incorrect result.

Git Commit ID

1d97ac5

Host Platform

Linux (aarch64)

Additional Context

Affected files:

  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h:43PTO2_MAX_INPUTS definition
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp:471-475 — dependency truncation logic
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h:378-380 — payload fixed-size arrays
  • tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/ — triggering test case

Possible fix directions:

  1. Raise the limits: increase PTO2_MAX_INPUTS, MAX_TENSOR_ARGS, etc. (increases per-task memory footprint)
  2. Multi-stage fan-in at orchestration layer: split N-way fan-in into a multi-level tree (e.g., 24 → 6 groups × 4-way → 1 × 6-way), ensuring each task stays within the 16-input limit
  3. Add bounds checking: emit an error in Arg::add_input() or during orchestrator submission when tensor arg count exceeds the limit, instead of silently truncating

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions