Output Tensor Computation Order within a Subgraph

The output format specifies `traversal_orders` to control tile iteration order, but does not specify the order in which output tensors of a fused subgraph are computed. For fused subgraphs with multiple output tensors that share an input tensor loaded in different block shapes (e.g., as RHS in one op vs LHS in another), the computation order affects implicit reuse of the shared tensor, and therefore total memory traffic.

Consider this graph:

```mermaid
graph LR
 TA(("T_A 256x256")) -- "LHS" --> OpA["Op_A MatMul"]
 T1(("T1 256x128")) -- "RHS" --> OpA
 T1 -- "RHS" --> OpB["Op_B MatMul"]
 TB(("T_B 256x256")) -- "LHS" --> OpB
 T1 -- "LHS" --> OpC["Op_C MatMul"]
 TC(("T_C 128x256")) -- "RHS" --> OpC

 OpA --> OutA(("Out_A 256x128"))
 OpB --> OutB(("Out_B 256x128"))
 OpC --> OutC(("Out_C 256x256"))
```

All three ops are fused into one subgraph `[Op_A, Op_B, Op_C]` with granularity `[128, 128, 256]`. T1 (256×128) is shared by all ops, but loaded in different block shapes:

- **Op_A, Op_B**: T1 as RHS. K=256, output is 256×128 (2×1 grid). RHS strip = full T1 (256×128), same block at every tile.
- **Op_C**: T1 as LHS. K=128, output is 256×256 (2×2 grid). LHS strip = 128×128 row strip, differs by tile row.

Since these are independent ops in the same fused subgraph, they execute one output tensor at a time. The computation order determines which T1 block is resident at each phase boundary:

- After an RHS phase: T1 full (256×128) is resident.
- After the LHS phase: T1 row-1 strip (128×128) is resident.

Only an RHS→RHS boundary gives implicit reuse (same full block). RHS→LHS and LHS→RHS involve different block shapes, so no reuse per the partial-reuse ruling in #59.

### Order A→B→C (RHS→RHS→LHS): 11 loads

| Step | Phase | Tile | T1 block | T1 action | Other loads |
|------|-------|------|----------|-----------|-------------|
| 1 | Op_A | (0,0) | full 256×128 | **load** | T_A row0 |
| 2 | Op_A | (1,0) | full 256×128 | reuse | T_A row1 |
| 3 | Op_B | (0,0) | full 256×128 | **reuse (cross-phase)** | T_B row0 |
| 4 | Op_B | (1,0) | full 256×128 | reuse | T_B row1 |
| 5 | Op_C | (0,0) | row0 128×128 | **load** | T_C col0 |
| 6 | Op_C | (0,1) | row0 128×128 | reuse | T_C col1 |
| 7 | Op_C | (1,0) | row1 128×128 | **load** | T_C col0 |
| 8 | Op_C | (1,1) | row1 128×128 | reuse | T_C col1 |

Total T1 loads: 3. Total other loads: 8. **Grand total: 11.**

### Order A→C→B (RHS→LHS→RHS): 12 loads

| Step | Phase | Tile | T1 block | T1 action | Other loads |
|------|-------|------|----------|-----------|-------------|
| 1 | Op_A | (0,0) | full 256×128 | **load** | T_A row0 |
| 2 | Op_A | (1,0) | full 256×128 | reuse | T_A row1 |
| 3 | Op_C | (0,0) | row0 128×128 | **load** (256×128 !=128×128) | T_C col0 |
| 4 | Op_C | (0,1) | row0 128×128 | reuse | T_C col1 |
| 5 | Op_C | (1,0) | row1 128×128 | **load** | T_C col0 |
| 6 | Op_C | (1,1) | row1 128×128 | reuse | T_C col1 |
| 7 | Op_B | (0,0) | full 256×128 | **load** (128×128 != 256×128) | T_B row0 |
| 8 | Op_B | (1,0) | full 256×128 | reuse | T_B row1 |

Total T1 loads: 4. Total other loads: 8. **Grand total: 12.**

### Question

The difference is 1 full T1 load (32,768 elements). What output tensor computation order should participants assume when computing `subgraph_latencies` for a fused subgraph with multiple output tensors?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output Tensor Computation Order within a Subgraph #75

Order A→B→C (RHS→RHS→LHS): 11 loads

Order A→C→B (RHS→LHS→RHS): 12 loads

Question

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Step	Phase	Tile	T1 block	T1 action	Other loads
1	Op_A	(0,0)	full 256×128	load	T_A row0
2	Op_A	(1,0)	full 256×128	reuse	T_A row1
3	Op_B	(0,0)	full 256×128	reuse (cross-phase)	T_B row0
4	Op_B	(1,0)	full 256×128	reuse	T_B row1
5	Op_C	(0,0)	row0 128×128	load	T_C col0
6	Op_C	(0,1)	row0 128×128	reuse	T_C col1
7	Op_C	(1,0)	row1 128×128	load	T_C col0
8	Op_C	(1,1)	row1 128×128	reuse	T_C col1

Output Tensor Computation Order within a Subgraph #75

Description

Order A→B→C (RHS→RHS→LHS): 11 loads

Order A→C→B (RHS→LHS→RHS): 12 loads

Question

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions