LOOM memory-facing hardware uses explicit external-memory resources together with switch routing, optional tag transforms, and a backend-independent memory backing abstraction.
Placement rules:
fabric.memoryandfabric.extmemorydefinitions may appear directly in the top-level module or infabric.module- inline
fabric.memoryandfabric.extmemoryinstantiations may appear directly only infabric.module fabric.instancetargeting onefabric.memoryorfabric.extmemorydefinition may appear directly only infabric.module
LOOM does not require a one-to-one relationship between software memrefs and hardware memory interfaces.
Instead:
- each software
handshake.extmemoryorhandshake.memoryis placed onto a compatible hardwarefabric.extmemoryorfabric.memory - one hardware memory interface may host multiple software memory regions up to
its
numRegioncapacity - region selection is part of mapping, not a fixed syntactic binding
This means the mapper may choose either:
- separate hardware memory interfaces for separate software regions
- or a shared hardware memory interface with multiple regions
When load and store counts are both one, a memory may be used without tagged memory routing.
When ldCount > 1 or stCount > 1, LOOM uses tagged memory routing so multiple
logical streams can share one memory-facing endpoint.
LOOM hardware memory interfaces are organized by signal family, not by software operand order.
fabric.memory uses these physical families:
- inputs:
load_addr,store_addr,store_data - outputs:
load_data,load_done,store_done
Family omission is structural:
- if
ldCount = 0, there is noload_addr,load_data, orload_done - if
stCount = 0, there is nostore_addr,store_data, orstore_done
When ldCount > 1 or stCount > 1, the family still appears once
physically, but its payload becomes tagged.
For a tagged hardware memory-family port:
- tagged versus non-tagged is a hardware parameter
tagWidthis a hardware parameter- the concrete tag carried by one software stream is a runtime value
For fabric.memory and fabric.extmemory, the tagged family width must be
large enough to encode the maximum logical lane count:
tagWidth >= log2Ceil(max(ldCount, stCount))
The mapper may assign or transform runtime tag values, but it does not infer the hardware tag width.
fabric.memory may additionally expose a memref-style externally visible view
when is_private is not true. That memref represents a slave-style,
memory-mapped access path into the scratchpad. In the visualization model,
that public memref is the opposite-facing counterpart of fabric.extmemory:
fabric.extmemoryshows one memref input on the ingress side- non-private
fabric.memoryshows one memref output on the egress side
fabric.extmemory uses the same request and response family order, but always
consumes one incoming module memref operand first. That memref represents the
master-style backing memory interface that the accelerator actively accesses.
Software memory ops use CIRCT Handshake ordering, which is intentionally different from hardware family order.
handshake.memory uses:
- operands: all stores first as
(stdata1, staddr1, stdata2, staddr2, ...), then all loads as(ldaddr1, ldaddr2, ...) - results: all load data as
(lddata1, lddata2, ...), then all completion tokens ordered like the request operands:(stnone1, stnone2, ..., ldnone1, ldnone2, ...)
handshake.extmemory uses the same load and store ordering, with one leading
memref operand naming the backing memory object.
The mapper must therefore bridge between:
- software request order: store-first, then load addresses
- hardware family order: load address first, then store address/data
- software result order: load data, then store done, then load done
- hardware family order: load data, then load done, then store done
For multi-port memory access:
- each load or store stream is tagged with its logical port index
- tagged streams are merged through routing fabric into the memory endpoint
- the memory demultiplexes by tag internally
- return data or completion tokens preserve tag identity on the way back
This mechanism does not require one fixed micro-topology.
The essential requirement is:
- software streams that share one hardware tagged memory-family port must carry distinct runtime tag values along the shared portion of the path
Those runtime tag values may be introduced or transformed in multiple ways:
- by
fabric.add_tag - by
fabric.map_tag - by a hierarchy of tagged routing stages
They may be merged:
- through one centralized tagged switch
- or through multiple staged tagged switches
They may be stripped on egress by fabric.del_tag, but stripping is only
required when the destination side expects a non-tagged value.
In other words, LOOM does not require a canonical chain such as:
add_tag -> one switch -> memory- or
memory -> one switch -> del_tag
Bridge extraction for mapper and visualization purposes may therefore stop at
the nearest tagged route-stage port that bounds the shared memory path, even
when no explicit fabric.add_tag or fabric.del_tag exists at the compute
container side. This is valid when the compute-facing side already remains
tagged.
For conflict checking and route validation, LOOM must compare software streams on the full bridge-expanded shared path, not only on the truncated place-and-route boundary path stored in mapper state. Shared tagged resources may appear entirely inside the recovered bridge suffix or prefix.
What matters is the semantic contract:
- a shared tagged family port must see distinct runtime tag values for the logically different software streams that share it
- any
fabric.map_tagon that path changes the runtime tag value seen by later tag-dependent routing or execution stages - operations other than
fabric.add_tag,fabric.map_tag, andfabric.del_tagdo not change tagged shape; they only transport it
Current SciComp collateral uses the same bridge contract for multi-lane
fabric.extmemory families: the generated bridge expands the ingress and
egress memory families per logical lane, and the bridge may terminate at route
stage ports that are already tagged without requiring an explicit
fabric.add_tag or fabric.del_tag at that boundary.
Tagged memory traffic still obeys the switch rules of the enclosing routing resources. The only difference is that payload identity includes a tag field.
LOOM may carry tagged memory traffic through:
fabric.temporal_sw, when route choice is tag-dependentfabric.spatial_sw, when route choice is tag-agnostic and the tag is only part of the payload
In current LOOM memory-bridge design intent:
- ingress request mixing may use any tagged, tag-compatible routing path
- tagged
fabric.spatial_swis legal for tag-agnostic merging - tagged
fabric.temporal_swis legal for tag-dependent merging or splitting - ingress may use one stage or multiple hierarchical stages of tagged routing
- if an ingress path already carries a tag before it reaches the memory-family port, that path-derived tag remains authoritative
- the software lane id is only a fallback runtime-tag source when the routed path has not attached any tag yet
- egress may apply
fabric.map_tagbefore the tag-dependent split if the response tag namespace must be rewritten - egress separation may keep the tag when the destination remains tagged
- egress separation uses
fabric.del_tagonly when the destination expects a non-tagged value
The memory tagging scheme does not authorize illegal structural merging at one spatial-switch output, whether the spatial switch payload is tagged or non-tagged.
This remains true after tag-width adaptation. Two source-side tag values that start out different may still be illegal if their routed path would require one of them to become unrepresentable on a tagged hardware port, or if an explicit later tag transform makes them equal on a shared tagged resource.
LOOM must not treat implicit width adaptation itself as a legal tag-rewrite
mechanism. Only explicit fabric.add_tag, fabric.map_tag, or
fabric.del_tag boundaries may change the runtime tag meaning seen by later
shared tagged resources.
LOOM treats external memory as a set of numbered regions. Each region has:
- a region id
- a tag range
- a base address
- an element-size code
- a backing implementation
The accelerator runtime binds these regions before launch.
addr_offset_table entries use the LOOM tuple:
validstart_tagend_tagaddr_offsetelem_size_log2
elem_size_log2 follows AXI AxSIZE style encoding:
0= 1 byte1= 2 bytes2= 4 bytes3= 8 bytes
The element-size code is per region because different software memory regions may share one wider hardware memory interface.
This region model works together with tagged family ports:
addr_offset_tableidentifies which tag range belongs to which software region, where that region starts, and what element size it uses- tagged request and response ports carry the stream identity for that region or logical access lane
- different tags may progress independently
- requests with the same tag must preserve order
In LOOM syntax:
- hardware-structure fields such as
ldCount,stCount,lsqDepth,memrefType, andnumRegionbelong to[] - runtime region programming such as
addr_offset_tablebelongs toattributes {}
One important LOOM distinction is ownership of the region base address:
- for
fabric.memory, the mapper may compute on-chip base offsets directly - for
fabric.extmemory, the mapper does not know the host virtual or physical base address of the backing memory object
Therefore:
fabric.extmemoryconfiguration emitted by the mapper always serializesaddr_offsetas0- the host runtime is responsible for patching or programming the actual base address before launch
For software-to-hardware memory mapping, LOOM matches memrefs by element width, not by the original scalar kind.
Examples:
handshake.extmemory(memref<?xi32>)may map tofabric.extmemory(memref<?xi32>)handshake.extmemory(memref<?xf32>)may also map tofabric.extmemory(memref<?xi32>)handshake.extmemory(memref<?xi16>)may map tofabric.extmemory(memref<?xi64>)
The rule is:
- software memref element width must be less than or equal to hardware memory interface element width
- exact integer-vs-float element kind is not relevant at mapping time
The simulation core and gem5 device should share a backend-neutral memory backing abstraction with equivalent responsibilities to:
read(regionId, byteOffset, numBytes)write(regionId, byteOffset, data, numBytes)
Standalone mode may implement this over local arrays. gem5 mode may implement this over DMA requests into simulated physical memory.
Memory side effects are architecturally observable outputs. A run is not fully validated unless memory regions that are supposed to change can be compared against their expected post-execution contents.