[Bug] Fan-in >=16 causes silent dependency truncation and tensor arg overflow

### Platform

a2a3 (Ascend 910B/C hardware)

### Runtime Variant

tensormap_and_ringbuffer

### Description

The `Graph-fanin_N` test case produces incorrect results when `fanin_width > 16` (e.g., Fanin24, Fanin32), reporting:

```
[ERROR] TEST FAILED: Output 'result' does not match golden.
Mismatched elements: 1/1
rtol=1e-05, atol=1e-05
```

Root cause is a **dual overflow**:

**1. Silent dependency truncation (`PTO2_MAX_INPUTS=16`)**

In `pto_orchestrator.cpp`, the `fanin_states[]` array used to collect fan-in dependencies is sized to `PTO2_MAX_INPUTS (16)`. When a barrier task has more than 16 INPUT dependencies (producer tasks), the excess dependencies are **silently discarded with no error or log message**:

```cpp
// pto_orchestrator.cpp:471-475
if (!already_added) {
    if (fanin_count < PTO2_MAX_INPUTS) {   // hard limit of 16
        fanin_states[fanin_count++] = prod_state;
    }
    // exceeds 16 → silently dropped, no error reported
}
```

This causes the barrier task to only wait for the first 16 producers instead of all N.

**2. Tensor argument array out-of-bounds write (`MAX_TENSOR_ARGS=16`)**

The barrier task's arguments consist of 1 INOUT (result) + N INPUTs (producer outputs). When N=24, the total is 25 tensor args, exceeding `MAX_TENSOR_ARGS=16`. When `payload->init()` writes into `PTO2TaskPayload::tensors[MAX_TENSOR_ARGS]`, it causes an **out-of-bounds write** that corrupts the subsequent `dispatch_args` memory region, resulting in the barrier kernel receiving incorrect tensor pointers.

Relevant hardcoded constants (`pto_types.h`):
```c
#define MAX_TENSOR_ARGS 16   // Barrier needs 1+N args; overflows when N>15
#define PTO2_MAX_INPUTS 16   // Dependency tracking limit
```

Fixed-size arrays in `PTO2TaskPayload` (`pto_runtime2_types.h:378-380`):
```cpp
PTO2TaskSlotState* fanin_slot_states[PTO2_MAX_INPUTS];  // [16]
Tensor tensors[MAX_TENSOR_ARGS];                         // [16]
```

### Steps to Reproduce

```bash
# Fanin4 — passes
python examples/scripts/run_example.py \
  -k tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/kernels \
  -g tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/golden.py \
  -p onboard --case Fanin4

# Fanin24 — fails
python examples/scripts/run_example.py \
  -k tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/kernels \
  -g tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/golden.py \
  -p onboard --case Fanin24

# Fanin32 — fails
python examples/scripts/run_example.py \
  -k tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/kernels \
  -g tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/golden.py \
  -p onboard --case Fanin32
```

Trigger condition: any single task whose fan-in dependency count (number of INPUT tensor args) exceeds 16.

| Case | Producers | Actually tracked deps | Result |
|------|-----------|----------------------|--------|
| Fanin4 | 4 | 4 | PASS |
| Fanin16 | 16 | 16 | PASS |
| Fanin24 | 24 | 16 (truncated) | **FAIL** |
| Fanin32 | 32 | 16 (truncated) | **FAIL** |

### Expected Behavior

All fan-in cases (including Fanin24 and Fanin32) should pass correctly with output `result=1.0` matching the golden value. Alternatively, when the runtime's capacity limit is exceeded, a clear error message should be reported instead of silently truncating dependencies.

### Actual Behavior

```
[ERROR] TEST FAILED: Output 'result' does not match golden.
Mismatched elements: 1/1
rtol=1e-05, atol=1e-05
```

Silent dependency truncation combined with tensor arg array out-of-bounds write causes the barrier kernel to produce an incorrect result.

### Git Commit ID

1d97ac5f3ae59b51f1b1c6563a06c95eabeb4d62

### Host Platform

Linux (aarch64)

### Additional Context

**Affected files:**
- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h:43` — `PTO2_MAX_INPUTS` definition
- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp:471-475` — dependency truncation logic
- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h:378-380` — payload fixed-size arrays
- `tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/` — triggering test case

**Possible fix directions:**
1. **Raise the limits**: increase `PTO2_MAX_INPUTS`, `MAX_TENSOR_ARGS`, etc. (increases per-task memory footprint)
2. **Multi-stage fan-in at orchestration layer**: split N-way fan-in into a multi-level tree (e.g., 24 → 6 groups × 4-way → 1 × 6-way), ensuring each task stays within the 16-input limit
3. **Add bounds checking**: emit an error in `Arg::add_input()` or during orchestrator submission when tensor arg count exceeds the limit, instead of silently truncating

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Fan-in >=16 causes silent dependency truncation and tensor arg overflow #412

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

Host Platform

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Case	Producers	Actually tracked deps	Result
Fanin4	4	4	PASS
Fanin16	16	16	PASS
Fanin24	24	16 (truncated)	FAIL
Fanin32	32	16 (truncated)	FAIL

[Bug] Fan-in >=16 causes silent dependency truncation and tensor arg overflow #412

Description

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

Host Platform

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions