This document is the normative specification for what may appear inside one
fabric.function_unit body.
It defines:
- the operation allowlist for FU bodies
- the structural and typing rules for FU bodies
- the body-level behavior classes that determine whether
latencyandintervalare meaningful - the FU-local completion and
temporal_peoutput-drain model
This document is the single source of truth for the future
fabric.function_unit body validator.
The simulator and RTL generator must stay aligned with the operation coverage
described here and in the backend-specific specs, while validation always
enforces the body allowlist and timing rules in this document.
Related documents:
fabric.function_unit is LOOM's hardware abstraction for one software-visible
operation or one software-visible subgraph.
Compared with the legacy design's older fabric.pe body rules, LOOM keeps an explicit body
allowlist but intentionally relaxes several blanket exclusivity rules:
handshake.loadandhandshake.storeare not forced to be singleton or body-exclusivehandshake.constantis not forced to be singleton or body-exclusive- LOOM does not use the legacy design's homogeneous-consumption grouping rule
LOOM instead judges legality by three layers of rules:
- explicit body operation allowlist
- explicit structural and typing constraints
- explicit body-level timing class
This means that a compound FU body is legal when:
- every operation is individually allowed
- the body obeys the structural and typing rules below
- the whole body still has one coherent externally visible firing behavior
One fabric.function_unit body is a single-block SSA graph over native
software semantic values.
Normative consequences:
- FU block arguments are the FU inputs
- internal SSA values are FU-internal software values
fabric.yielddefines the FU outputs- PE-side
!fabric.bits<...>or!fabric.tagged<...>transport types must not appear inside the FU body - width adaptation and tag handling belong to the enclosing PE and switch network, not to the FU body itself
The body is not a place to build routing structure, memory hierarchy structure, or nested PE hierarchy. It is only a place to describe the software behavior implemented by one hardware computation resource.
Only the operations listed in this section may appear as non-terminator
operations inside one fabric.function_unit body.
Allowed:
fabric.muxfabric.yieldas the terminator
Rules:
fabric.muxis LOOM-specific and may appear only insidefabric.function_unitfabric.yieldmust be the only terminator of the body- no other
fabric.*operation is allowed inside an FU body
Allowed operations:
arith.addfarith.addiarith.andiarith.cmpfarith.cmpiarith.divfarith.divsiarith.divuiarith.extsiarith.extuiarith.fptosiarith.fptouiarith.index_castarith.index_castuiarith.mulfarith.muliarith.minimumfarith.negfarith.oriarith.remsiarith.remuiarith.selectarith.shliarith.shrsiarith.shruiarith.sitofparith.subfarith.subiarith.trunciarith.uitofparith.xori
Notes:
arith.cmpiandarith.cmpfhave runtime-configurable predicates as described in spec-fabric-function_unit.mdarith.constantis not currently in the FU-body allowlist; constant injection inside one FU body is modeled byhandshake.constant
Allowed operations:
math.absfmath.cosmath.expmath.floormath.fmamath.log2math.rsqrtmath.sinmath.sqrt
Notes:
arith.minimumfandmath.rsqrtare first-class supported compute ops in current simulation and RTL generation paths, and they are treated as normalfabric.function_unitbody operations rather than special control helpers.math.flooris accepted by the current body validator and simulator. The RTL generator documents its own support matrix separately.
Allowed operations:
llvm.intr.bitreverse
No other llvm.* operation is currently allowed inside one FU body.
Allowed operations:
dataflow.carrydataflow.gatedataflow.invariantdataflow.stream
Notes:
- all four current
dataflowoperations are dedicated fixed-behavior state-machine FUs in LOOM - therefore each of them must occupy one exclusive
fabric.function_unit - a
function_unitthat contains anydataflow.*operation must uselatency = -1andinterval = -1
Allowed operations:
handshake.cond_brhandshake.constanthandshake.joinhandshake.loadhandshake.muxhandshake.store
Notes:
- LOOM does not currently allow
handshake.sinkinside one FU body - unused FU outputs are handled by PE-side discard or disconnect behavior, not
by inserting
handshake.sink - LOOM does not require
handshake.load,handshake.store, orhandshake.constantto be body-exclusive
All non-terminator operations outside the allowlist above are illegal inside
one fabric.function_unit body unless this document is amended.
Some allowed FU-body operations carry runtime-configurable fields.
Current configurable FU-body operations are:
fabric.muxhandshake.constanthandshake.joinarith.cmpiarith.cmpfdataflow.stream
No other FU-body operation currently contributes runtime configuration bits.
The current configurable fields are:
fabric.mux:sel,discard,disconnecthandshake.constant: output literal valuehandshake.join:join_maskarith.cmpi: integer comparison predicatearith.cmpf: floating-point comparison predicatedataflow.stream:cont_cond
Normative notes:
- these fields belong to the FU-internal runtime configuration payload, not to the enclosing PE mux or demux state
- body textual attributes provide the structural default or initial value, but the final mapped configuration may overwrite them
- configurable-field serialization order is defined in spec-fabric-function_unit.md
The FU signature must use native semantic types.
Currently allowed FU input and output types are:
- signless integers such as
i1,i8,i16,i32,i64 - floating-point types such as
f16,f32,f64 indexnone
The following are not allowed on FU ports:
!fabric.bits<...>!fabric.tagged<...>memref<...>- PE, switch, or storage container types
Internal SSA values inside one FU body follow the same rule:
- use native semantic value types only
- do not use Fabric transport types
- do not use memory-reference types
Practical consequence:
- tag stripping or tag insertion must happen outside the FU body
- address transport into memory-capable structures happens through PE and
switch wiring, then appears inside the FU body as native
index handshake.loadandhandshake.storeoperate on native typed values in the FU body view
A fabric.function_unit body must contain exactly one block and must end in
fabric.yield.
The body must not contain nested control-flow regions or nested symbol definitions.
Normative rules:
fabric.yieldmust be the terminator- the number of
fabric.yieldoperands must equal the declared result count - yield operand types must match the declared result types
- a yield operand must not be a direct block argument of the same FU
The last rule forbids trivial passthrough FUs. If a design wants a pure forwarding path, it must be modeled outside the FU body by PE or switch routing structure.
Every FU block argument must be consumed by at least one non-terminator body operation.
LOOM treats unused FU inputs as illegal dead interface.
An FU body must contain at least one non-terminator operation.
An empty body or a body that only forwards block arguments to fabric.yield
is illegal.
The body must be representable as one single-block SSA graph.
Normative consequences:
- no
func.*,cf.*,scf.*, oraffine.*control-flow structure is allowed inside an FU body - no nested
fabric.function_unit,fabric.spatial_pe,fabric.temporal_pe,fabric.spatial_sw, orfabric.temporal_swis allowed - stateful behavior must be expressed through allowed body operations such as
dataflow.carry,dataflow.invariant, ordataflow.stream, not through region control flow
The following operations are never allowed inside one FU body:
fabric.modulefabric.instancefabric.spatial_pefabric.temporal_pefabric.spatial_swfabric.temporal_swfabric.memoryfabric.extmemoryfabric.fifofabric.add_tagfabric.map_tagfabric.del_tag
These operations belong to ADG hierarchy, routing, memory topology, or tag boundary modeling. They are not FU-body operations.
LOOM allows one fabric.function_unit body to contain multiple internal
operations and to represent a compound software subgraph rather than one
single software op.
Typical legal patterns include:
- arithmetic pipelines such as multiply-then-add
- predicate formation followed by
handshake.cond_brorhandshake.mux - memory address generation plus one or more
handshake.loadorhandshake.store - internal configurable alternatives expressed by one or more
fabric.mux
LOOM does not require a compound body to be homogeneous by operation family. For example, the following mixtures are legal in principle:
arith.*together withhandshake.*handshake.loadorhandshake.storetogether with arithmetic helpershandshake.constanttogether with other allowed operations
The dataflow family is the current exception:
dataflow.streamdataflow.gatedataflow.carrydataflow.invariant
Each of these must occupy an exclusive FU body and must not be mixed with other non-terminator operations.
The only requirement is that the whole body still has one coherent externally visible FU behavior as described by the timing classes below.
An FU body belongs to this class when one firing produces exactly one externally visible result tuple.
The tuple may contain:
- zero results
- one result
- multiple results
The key property is that one firing has one FU-local completion event and one result tuple.
For this class:
latencyis meaningfulintervalis meaningfullatencymust be>= 0intervalmust be>= 1
Current operations that belong to this class when used as standalone FU behaviors:
- all allowed
arith.* - all allowed
math.* llvm.intr.bitreversehandshake.cond_brhandshake.constanthandshake.joinhandshake.loadhandshake.muxhandshake.store
The dataflow family is not in this class in LOOM's current hardware model.
LOOM treats all four current dataflow operations as dedicated fixed state
machines rather than ordinary scalar-latency datapath FUs.
For this class:
latencyis not modeled as one scalar fire-to-completion delayintervalis not modeled as one scalar refire spacinglatencymust be-1intervalmust be-1- the body must contain exactly one non-terminator
dataflowoperation
Current members of this class:
dataflow.carrydataflow.gatedataflow.invariantdataflow.stream
For one compound FU body, the timing class is a property of the whole body, not of any single internal operation.
Normative rule:
- if the body contains any
dataflow.*operation, it enters the dedicated dataflow state-machine class and must contain no other non-terminator operation - if the whole body behaves as one firing to one result tuple, then it is single-fire single-result-set
Current conservative classifier for LOOM:
- any body containing
dataflow.stream,dataflow.gate,dataflow.carry, ordataflow.invariantis treated as a dedicated dataflow state-machine FU - any other currently allowed body is expected to satisfy the single-fire single-result-set contract
This is the intended first validator contract. A future LOOM revision may add more refined whole-body behavior inference.
fabric.mux is the only FU-internal structural routing primitive.
Normative rules:
- it may appear only inside one
fabric.function_unit - it may be used to select among alternative internal producers or consumers
- it does not change the body-level timing class by itself
- its runtime-config fields are part of the FU internal configuration payload
See spec-fabric-function_unit.md for the
full sel, discard, and disconnect semantics.
These three operations all express some form of selection, but they belong to three different semantic layers and must not be conflated.
fabric.mux is a configuration-time structural selector.
Normative interpretation:
- its
selcomes from FU-internal runtime configuration, not from one runtime token operand - once configured, it behaves as one fixed hard path inside the selected FU shape
- it is best understood as one FU-internal static routing primitive, similar
in spirit to a tiny
spatial_swinside thefunction_unit - tech-mapping and config generation treat it as part of the hardware shape choice of the FU
In other words, fabric.mux chooses which hardware subgraph is active, not
which runtime input token is selected on one dynamic firing.
handshake.mux is a runtime dataflow operator.
Normative interpretation:
- its selector is a runtime operand
- one firing consumes the selector and the selected data input
- non-selected data inputs are not consumed by that firing
- therefore non-selected inputs remain blocked with respect to that firing
handshake.mux is not a hard configured route. It is an executing software
operator with firing-time consume behavior.
arith.select is a runtime datapath operator.
Normative interpretation:
- its selector is a runtime operand
- one firing consumes the selector and all data operands
- the result value is whichever data operand the selector chooses
- even the non-chosen data operand is still consumed as part of the firing
arith.select is therefore different from handshake.mux even though both
are runtime-controlled selectors.
The three-way distinction is:
fabric.mux: configuration-selected hard route inside the FU bodyhandshake.mux: runtime-controlled handshake selection that consumes only the selected input and blocks the othersarith.select: runtime-controlled datapath selection that consumes all inputs and only chooses which consumed value becomes the output
Normative rules:
- it is allowed to coexist with other allowed operations
- it contributes runtime-config bits for its literal value
- it is not body-exclusive in LOOM
LOOM treats one handshake.join inside an FU body as a fixed maximum hardware
fan-in synchronizer with runtime-selectable participating inputs.
Normative rules:
- textual operand count defines hardware fan-in
- current supported hardware fan-in range is
1..64 - the join contributes one runtime-config bit per textual input
- multiple joins may appear inside one FU body
Normative rules:
- they may coexist with other allowed operations
- they are not singleton-only and not body-exclusive in LOOM
- multiple loads and stores may appear in one compound body if the whole body still represents one coherent FU behavior
This is a deliberate LOOM relaxation compared with the legacy design's earlier PE-body rules.
Normative rules:
- each must occupy an exclusive FU body
- the containing FU must use
latency = -1andinterval = -1 - they are not mixed with arithmetic, handshake, or
fabric.muxoperations in the current LOOM hardware model
Normative rules:
- it must occupy an exclusive FU body
- therefore the containing FU must use
latency = -1andinterval = -1 - its continuation condition field contributes runtime configuration bits
The following patterns are illegal even if each individual operation were otherwise allowed:
- direct passthrough from FU input block argument to
fabric.yield - unused FU block arguments
- nested container or routing operations inside the body
- tag-boundary manipulation inside the body
- control-flow regions inside the body
- memory hierarchy construction inside the body
- any body that mixes one
dataflow.*operation with any other non-terminator operation - body-local operation sets that cannot be classified into one coherent body-level timing class
In particular, the following old legacy-style assumptions are not part of LOOM:
- no load/store exclusivity rule
- no constant exclusivity rule
- no homogeneous-consumption grouping rule
The future fabric.function_unit body validator should enforce at least the
following rules:
- every non-terminator body operation is in the allowlist above
- the body has exactly one block and ends in
fabric.yield fabric.yieldarity and types match the FU signature- no yield operand is a direct block argument
- every block argument is consumed
- the body contains at least one non-terminator operation
- no prohibited Fabric hierarchy, routing, memory, or tag operation appears inside the body
- no control-flow or region-bearing program structure appears inside the body
handshake.joinoperand count lies in the supported1..64range- body timing class and
latencyorintervalsettings are consistent - any body containing one
dataflow.*operation is checked under the dedicated dataflow state-machine rules
This document defines the intended validator contract even if the current code does not yet enforce all of it.
This section defines how FU-local completion interacts with
fabric.temporal_pe.
A function_unit fires when it consumes one complete input tuple for its
currently selected behavior.
For a direct single-op FU, this means the underlying operation has accepted all operands required by that operation.
For a compound FU body that implements one software subgraph, this means the FU has accepted one externally visible input tuple for that subgraph.
A function_unit completes when it has produced one externally visible result
tuple for one firing.
The completion event is FU-local. In a temporal_pe, FU-local completion is
not identical to immediate visibility on the PE egress, because the result may
wait in one FU-local output register before arbitration grants an egress slot.
latency is the FU-local delay from firing to FU-local completion.
Normative rules:
- valid range is
0.. 0means combinational completion in the same cycle as the firing-1means not applicablelatencyis defined only for single-fire single-result-set behavior
latency does not include extra waiting introduced by temporal-PE output
arbitration after the FU has already completed.
interval is the minimum number of cycles from one firing to the next firing
of the same FU, assuming no additional blocking condition exists.
Normative rules:
- valid range is
1.. 1means fully pipelined-1means not applicableintervalis defined only for single-fire single-result-set behavior
In a temporal_pe, the refire condition is the conjunction of:
- the FU's intrinsic
intervalconstraint - the temporal scheduler selecting that FU
- all required operands being ready
- the FU not being busy because one or more FU-local output registers still hold undrained results
Normative rule:
- each
temporal_pemay fire at most one FU per cycle
The selected FU is determined by the active instruction slot and the temporal PE scheduler.
Every FU output port in one temporal_pe has a dedicated FU-local output
register.
Normative rules:
- every FU completion writes its produced result values into these FU-local output registers
- there is no direct bypass from FU completion to one temporal-PE egress port
- the arbitration stage always observes FU-local output registers, never raw FU combinational outputs
This rule applies even when static analysis suggests that no conflict is possible in one particular program.
An FU is busy if any of its FU-local output registers still contains one undrained valid result.
Normative rules:
- a busy FU must not fire again
- this prohibition applies even if an instruction selects that FU and all operand inputs are ready
- the FU becomes idle only after all valid output-register contents produced by the previous firing have been drained to their intended PE egresses
Therefore the temporal-PE refire rule is stricter than raw interval alone.
Two different FUs may complete in the same cycle even though only one FU fires per cycle, because their latencies may differ.
Example:
fuAfires at cycle0fuBfires at cycle1fuA.latency = 4fuB.latency = 3
Both FU-local completions occur at cycle 4.
When multiple valid FU-local output registers request the same temporal-PE egress opportunity, the PE uses round-robin arbitration.
Normative rules:
- arbitration order is by FU definition order inside the
temporal_pe - FU definition order is the same order used for opcode numbering
- after reset, the initial highest priority is the lowest opcode
- after a successful grant, the round-robin pointer advances so later grants continue from the next FU in cyclic order
If one or more requesting FUs are not granted:
- their results remain stored in their own FU-local output registers
- those FUs remain busy
- they become eligible for output again in later arbitration cycles
In a temporal_pe, the observable egress timing of one FU result is:
- FU fire time
- plus FU-local
latency - plus any additional waiting time in the FU-local output register until the arbiter grants egress
Therefore:
latencymodels FU-local compute delay- arbitration delay is a separate temporal-PE effect
LOOM deliberately differs from the legacy design's older PE-body rules in the following ways:
fabric.muxis an allowed LOOM-specific FU-body operationhandshake.sinkis not an allowed FU-body operationhandshake.loadandhandshake.storeare not body-exclusivehandshake.constantis not body-exclusive- current
dataflowoperations are exclusive dedicated FUs withlatency = -1andinterval = -1 - no homogeneous-consumption grouping rule is used
- FU bodies are structurally stricter about hierarchy: no nested
fabric.instanceand no nested PE definitions inside the body