Skip to content

Output Tensor Computation Order within a Subgraph #75

@sohnryang

Description

@sohnryang

The output format specifies traversal_orders to control tile iteration order, but does not specify the order in which output tensors of a fused subgraph are computed. For fused subgraphs with multiple output tensors that share an input tensor loaded in different block shapes (e.g., as RHS in one op vs LHS in another), the computation order affects implicit reuse of the shared tensor, and therefore total memory traffic.

Consider this graph:

graph LR
    TA(("T_A<br>256x256")) -- "LHS" --> OpA["Op_A<br>MatMul"]
    T1(("T1<br>256x128")) -- "RHS" --> OpA
    T1 -- "RHS" --> OpB["Op_B<br>MatMul"]
    TB(("T_B<br>256x256")) -- "LHS" --> OpB
    T1 -- "LHS" --> OpC["Op_C<br>MatMul"]
    TC(("T_C<br>128x256")) -- "RHS" --> OpC

    OpA --> OutA(("Out_A<br>256x128"))
    OpB --> OutB(("Out_B<br>256x128"))
    OpC --> OutC(("Out_C<br>256x256"))
Loading

All three ops are fused into one subgraph [Op_A, Op_B, Op_C] with granularity [128, 128, 256]. T1 (256×128) is shared by all ops, but loaded in different block shapes:

  • Op_A, Op_B: T1 as RHS. K=256, output is 256×128 (2×1 grid). RHS strip = full T1 (256×128), same block at every tile.
  • Op_C: T1 as LHS. K=128, output is 256×256 (2×2 grid). LHS strip = 128×128 row strip, differs by tile row.

Since these are independent ops in the same fused subgraph, they execute one output tensor at a time. The computation order determines which T1 block is resident at each phase boundary:

  • After an RHS phase: T1 full (256×128) is resident.
  • After the LHS phase: T1 row-1 strip (128×128) is resident.

Only an RHS→RHS boundary gives implicit reuse (same full block). RHS→LHS and LHS→RHS involve different block shapes, so no reuse per the partial-reuse ruling in #59.

Order A→B→C (RHS→RHS→LHS): 11 loads

Step Phase Tile T1 block T1 action Other loads
1 Op_A (0,0) full 256×128 load T_A row0
2 Op_A (1,0) full 256×128 reuse T_A row1
3 Op_B (0,0) full 256×128 reuse (cross-phase) T_B row0
4 Op_B (1,0) full 256×128 reuse T_B row1
5 Op_C (0,0) row0 128×128 load T_C col0
6 Op_C (0,1) row0 128×128 reuse T_C col1
7 Op_C (1,0) row1 128×128 load T_C col0
8 Op_C (1,1) row1 128×128 reuse T_C col1

Total T1 loads: 3. Total other loads: 8. Grand total: 11.

Order A→C→B (RHS→LHS→RHS): 12 loads

Step Phase Tile T1 block T1 action Other loads
1 Op_A (0,0) full 256×128 load T_A row0
2 Op_A (1,0) full 256×128 reuse T_A row1
3 Op_C (0,0) row0 128×128 load (256×128 !=128×128) T_C col0
4 Op_C (0,1) row0 128×128 reuse T_C col1
5 Op_C (1,0) row1 128×128 load T_C col0
6 Op_C (1,1) row1 128×128 reuse T_C col1
7 Op_B (0,0) full 256×128 load (128×128 != 256×128) T_B row0
8 Op_B (1,0) full 256×128 reuse T_B row1

Total T1 loads: 4. Total other loads: 8. Grand total: 12.

Question

The difference is 1 full T1 load (32,768 elements). What output tensor computation order should participants assume when computing subgraph_latencies for a fused subgraph with multiple output tensors?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions