This document specifies how LOOM lowers nested-SCF memory operations into
handshake.load, handshake.store, and memory-interface operators, while
preserving the intended read/write order by explicit ctrl-done token wiring.
The core design rule is:
- data values travel on data ports
- ordering travels on a separate ctrl-done network
- every SCF region is abstracted as an
entry -> doneblock - nested regions are connected by a serial-recursive algorithm
- ctrl-done ordering is built per alias-free memory region, not as one mandatory global chain
The normative intent follows the older
dsa-stack/main/lib/dsa/Transforms/SCFToHandshakeDSA implementation,
especially:
- memory-op replacement and connection
- serial-recursive memory control construction
Related documents:
This document covers:
memref.load/memref.storereplacement- grouping accesses by root memref
- creation of memory/extmemory interface ops
- recursive ctrl-done wiring across nested
scf.if,scf.for, andscf.while
This document does not define hardware memory timing. Hardware memory module behavior is defined elsewhere.
LOOM lowers memory in three conceptual layers.
Each source memref.load / memref.store becomes a handshake.load /
handshake.store.
At this stage:
- address and data paths become explicit SSA edges
- each access is recorded with SCF-path metadata
- the memory op itself still lacks its final done-token ordering input
Accesses that belong to the same root memref are grouped and connected to one memory or extmemory interface node.
At this stage:
- load address results connect to memory load-address inputs
- store data/address results connect to memory store inputs
- memory outputs connect back to the load data input
- per-access done tokens are materialized from the memory interface outputs
The compiler then builds a second graph that carries ordering only.
At this stage:
- each memory access receives an execution-control token within its own ordering domain
- each memory access returns one done token within that same domain
- nested SCF structure is recursively lowered into ctrl-done composition per root memref group
- independent memory groups stay independent unless a later synchronization point explicitly joins them
- the final function return control may join group-level done tokens when the enclosing function as a whole must wait for all memory side effects
Memory accesses must be grouped by root memref, not by the last casted or subview value.
The grouping must strip view-like transforms such as:
memref.castmemref.subviewmemref.reinterpret_cast- shape-collapse or shape-expand style view rewrites
Reason:
- a single logical memory object may appear through multiple derived memref values
- those accesses must still share one memory ordering domain
The default ctrl-done ordering domain is one root memref group.
Normative rule:
- each alias-free root memref group gets one independent ctrl-done network
- accesses inside one such group are ordered against one another by the serial-recursive algorithm
- accesses in different groups must not be ordered against one another unless a separate synchronization rule explicitly requires it
In hardware terms, this means:
- each
handshake.memoryorhandshake.extmemoryinstance normally has its own done-token subnet - those done-token subnets are independent by default
- they are only joined when the program requires cross-group synchronization
This independence is not an optional optimization. It is the default semantic shape under the alias-free memref model.
LOOM relies on the frontend to preserve alias structure as much as possible.
Normative assumption:
- distinct memrefs represent distinct logical address spaces whenever that separation is semantically valid
Compilation implication:
- memory alias analysis should be used to avoid collapsing unrelated memory regions into one overly conservative memref domain
- if two references may alias, they must remain in the same ordering domain unless a stronger proof separates them
- if two references are proven non-overlapping, LOOM should prefer keeping them in separate root-memref groups so their ctrl-done networks remain independent
The objective is not merely better performance. It is to preserve the correct parallelism structure of memory accesses through lowering.
A memref.load is lowered to handshake.load.
Conceptually:
load(addr..., data_from_mem, ctrl) -> (data_to_comp, addr_to_mem...)
During initial replacement:
- the address operands are connected immediately
data_from_memis temporarily left as a memory-supplied inputctrlis temporarily a placeholder
A memref.store is lowered to handshake.store.
Conceptually:
store(addr..., data_from_comp, ctrl) -> (data_to_mem, addr_to_mem...)
During initial replacement:
- address operands are connected immediately
- data operand is connected immediately
ctrlis temporarily a placeholder
Each access record must retain:
- original operation
- root memref
- access kind (
loadorstore) - one global ordering index
- one SCF nesting path
The SCF nesting path is required by the serial-recursive control algorithm.
After access collection, accesses are grouped by root memref and connected to one interface operator.
The compiler may choose:
- memory-like internal memory interface
- extmemory-like external memory interface
The ordering algorithm is the same either way.
Within one memory interface group:
- store operands are packed first
- load operands are packed after stores
This preserves a deterministic port layout.
The memory interface returns:
- load data results
- store done tokens
- load done tokens
The compiler must route:
- load data back to the matching
handshake.load - each done token back to the matching access record
No memory access is considered fully connected until its done token is known.
The ctrl-done graph models execution order independently from payload data.
The ctrl-done graph is built per memory ordering domain, not as one required global graph over all memory operations in the function.
Every SCF region is abstracted as:
entry_token -> region_body -> done_token
This is the key simplification used by the serial-recursive algorithm.
Given a sorted list of memory accesses plus their SCF nesting paths:
- process the current SCF level from left to right
- when a child SCF region is encountered, recurse into that child
- the recursion returns one done token for the whole child block
- continue processing the parent level using that returned done token
The algorithm never reasons about raw operations alone. It reasons about
composable entry -> done blocks.
This recursive construction is applied independently inside each root-memref ordering domain.
At the same SCF level, accesses are ordered serially unless an explicitly allowed parallel case applies.
The default rule is:
current_token -> access -> access_done -> next_token
This serial order is the correctness baseline.
Consecutive loads at the exact same SCF level may be parallelized as an optimization when the algorithm proves there is no intervening write at that level that must order them.
Conceptually:
- fork the entry token to the consecutive loads
- let each load execute independently
- join their done tokens before continuing
This is an optimization, not the semantic base case.
The semantic base case remains serial-recursive ordering.
RAR parallelism is only considered inside one ordering domain. It does not justify merging two unrelated memref groups into one shared ctrl-done chain.
For scf.if, the control token is explicitly branched.
Given parent control %ctrl and branch condition %cond:
- split
%ctrlbycond_br(cond, ctrl) - recursively process then-region with the true token
- recursively process else-region with the false token
- merge the two branch done tokens into one parent-level continuation token
The branch condition must be the branch's actual body-visible condition.
No synthetic extra loop-control event may be introduced here.
This is the most delicate case.
For memory ordering inside the loop body, LOOM must use the loop body's gated control stream, not the raw stream.
That means:
- use
after_cond - do not use
raw_will_continue
Reason:
- memory accesses in the body execute
Ntimes - the raw stream has
N + 1events - wiring memory control to the raw stream introduces one extra control token with no matching memory payload event
The loop body is abstracted as:
loop_entry_ctrl -> body_memory_subgraph -> body_done
The control loop for body-local memory ordering is modeled as a control carry:
loop_ctrl = carry(after_cond, entry_ctrl, body_done_feedback)
Conceptually:
- the initial carry output launches the first iteration
- each true iteration feeds back its body done token
- the final false on
after_condexits the loop
Then:
loop_ctrldrives recursive processing of the loop bodybody_doneis split byafter_cond- true branch feeds the carry feedback
- false branch is the loop exit done token
For body-local memory control in scf.for, using raw_will_continue is
incorrect.
The raw stream is reserved for loop-boundary structures such as iter-arg bridging described in spec-dataflow-compilation.md.
scf.while has a before-region and an after-region.
- before-region memory control uses the before-region condition regime
- after-region memory control uses the gated after-region condition regime
- the recursive algorithm must still treat each region as an
entry -> doneblock
When the while condition becomes false:
- the false branch exits the loop
- there must not be one extra body-local memory-control token after exit
If the surrounding SCF path encodes multiple independent spatial siblings, their control subgraphs may be driven in parallel by:
- forking the parent token
- recursively processing each sibling independently
- joining sibling done tokens at the parent level
This is orthogonal to the body-local serial-recursive rule inside each sibling.
After all memory groups are processed:
- each root memref group yields one group done token
- all group done tokens are joined
- the joined token becomes the function return control token
If there are no memory accesses:
- the function return control falls back to the entry token
Cross-group synchronization is exceptional, not the default.
LOOM should only combine done tokens from different memory ordering domains when the program semantics explicitly require a global synchronization point.
Typical examples are:
- function-level completion, when the function result is not observable until all memory side effects across all groups are complete
- an explicit fence-like synchronization construct
- any later construct whose semantics require multiple memory regions to have jointly completed before progress may continue
Outside such cases:
- each
handshake.memoryorhandshake.extmemorydone-token network should remain independent - the compiler must not add artificial cross-domain dependencies
The following invariants are mandatory.
- every
handshake.loadandhandshake.storemust receive exactly one control input in the final DFG - every access must receive exactly one memory-interface done token
- every load must receive exactly one data-from-memory connection
- every access belongs to exactly one root-memref group
- every access carries one SCF nesting path
- recursive ordering must be computed from those paths, not from textual locality alone
- different root-memref groups are independent ordering domains unless an explicit synchronization construct joins them
- body-local memory ordering in
scf.forusesafter_cond - loop-boundary iter-arg bridging may use
raw_will_continue - these two uses must never be conflated
- the return control token must postdominate every memory access done token
- no access may remain without its done-token consumer path
If a program finishes with:
- correct data results
- one extra latched
ctrlinhandshake.loadorhandshake.store - one extra latched
condin downstreamhandshake.cond_br
the first thing to check is whether the memory-control chain inside a
scf.for body was wired to raw_will_continue instead of after_cond.