Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -109,19 +109,25 @@ For canonical shader-kernel execution (`BSTART.MPAR` / `BSTART.MSEQ`), v0.4 lock
v0.4 defines two predicate domains and keeps them distinct:

* **Block-control predicate** (`BARG.CARG`): produced by `SETC.*` / `C.SETC.*`, consumed by block-control forms (`BSTART COND`,
`BWE`, `BWI`, etc.) at block commit.
`BWE`, `BWI`, etc.) at block commit. In the current v0.4 kernel-body contract, `SETC.*` / `C.SETC.*` are not used inside
canonical kernel bodies.
* **Kernel EXEC mask** (`p`): a 64-bit architectural source/destination used inside `MPAR`/`MSEQ` kernel bodies.
Vector instructions are implicitly predicated by `p`, and in-kernel `B.Z` / `B.NZ` test `p==0` / `p!=0`.
Vector instructions are implicitly predicated by `p`. In-kernel `B.Z` / `B.NZ` are explicit EXEC-mask tests (`p==0` /
`p!=0`), while scalar conditional control flow uses the normal scalar branch forms with explicit operands (`B.EQ` /
`B.NE` / `B.LT` / `B.GE` / ...). Scalar-uniform instructions may read and write `p` directly.
`p` predicates `v.*` execution only; scalar-uniform `l.*` execution is not implicitly masked by `p`.
`p==0` does **not** by itself terminate the kernel; the scalar-uniform lane remains architecturally active.

Kernel mask generation contract:

* `V.CMP.* ->p` is the normative mask-producing rule.
* `V.CMP.* ->p` is the normative vector-lane mask-producing rule, but not the only architectural way to update `p`.
* When `V.CMP.* ->p` executes under an existing EXEC mask, inactive lanes clear their destination bit to `0`.

Mixed scalar/vector kernel-body execution rules (for `BSTART.MSEQ` / `BSTART.MPAR`):

* Each group executes a structured kernel body using one scalar-uniform control-flow context plus a lane mask `p`.
* Scalar-uniform body instructions may read/write `GPR`, `t/u`, `p`, and other architecturally defined scalar block-local state.
* Even when `p==0`, scalar-uniform body execution may continue; only `v.*` vector-lane execution is fully masked off.
* The unified `lx64` any-operand rule applies: if any operand names `vt`/`vu`/`vm`/`vn`, the instruction executes as `v.*`;
otherwise it executes as `l.*`.
* Scalar/group-domain inputs used by `v.*` broadcast to all active lanes.
Expand Down
58 changes: 35 additions & 23 deletions docs/architecture/isa-manual/src/chapters/04_block_isa.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -73,24 +73,27 @@ actual instruction body is placed elsewhere.
*Body form (required structure):*

* The body begins at `BodyTPC` (from the header's `B.TEXT`).
* For block types other than canonical shader kernels (`BSTART.MPAR` / `BSTART.MSEQ`), the body MUST be a **linear**
snippet unless that block type defines a different contract.
* For block types other than canonical SIMT kernel forms (`BSTART.MPAR` / `BSTART.MSEQ` / `BSTART.VPAR` / `BSTART.VSEQ`),
the body MUST be a **linear** snippet unless that block type defines a different contract.
* Generic linear bodies MUST NOT contain:
** any `BSTART.*` / `C.BSTART.*` / `HL.BSTART.*`,
** any template block (`FENTRY`/`FEXIT`/`FRET.*`/`MCOPY`/`MSET`),
** any branch/jump/call/return instruction, or any instruction that performs an architectural control transfer,
** any `B.*` header descriptor.
* A generic linear body MUST terminate at `BSTOP`/`C.BSTOP`. A body containing only `BSTOP`/`C.BSTOP` is a valid
empty body.
* For `BSTART.MPAR` / `BSTART.MSEQ`, the body is instead a **structured instruction stream**:
** branches, jumps, and other in-body architectural control transfers are legal and execute at group granularity through
the scalar-uniform lane context,
** the kernel ends at the **first terminator marker** encountered in the body stream,
* For `BSTART.MPAR` / `BSTART.MSEQ` / `BSTART.VPAR` / `BSTART.VSEQ`, the body is instead a **structured instruction stream**:
** direct branches and direct jumps are legal and execute at group granularity through the scalar-uniform lane context,
** in-body indirect jump (`JR`) and in-body call/return are reserved and not part of the current bring-up contract,
** `BodyTPC` is the kernel entrypoint; the kernel is not modeled as a separately declared static region,
** valid terminators are `BSTOP`, `C.BSTOP`, `BSTART.*`, `C.BSTART.*`, and `HL.BSTART.*`,
** dynamic kernel execution ends when execution reaches the first such terminator marker,
** `p` is the architectural 64-bit EXEC mask for the body, and `V.CMP.* ->p` is the normative mask-generation rule,
** the unified `lx64` any-operand rule applies: if any operand names `vt`/`vu`/`vm`/`vn`, the instruction executes as
`v.*`; otherwise it executes as `l.*`,
** canonical kernel global-memory access uses bridged forms only: `l.*.brg` and `v.*.brg`.
** `M*` kernels (`MPAR`/`MSEQ`) are memory-capable and may access global memory only through bridged forms (`l.*.brg`
and `v.*.brg`) while also accessing tile-register/local-tile state,
** `V*` kernels (`VPAR`/`VSEQ`) are memory-free tile-only kernels and MUST NOT perform architectural memory access.

*Header→body→return execution:*

Expand All @@ -99,8 +102,8 @@ actual instruction body is placed elsewhere.
* On body termination, execution resumes at the header continuation address:
** if the header ended with `BSTOP`/`C.BSTOP`: continuation is the instruction address immediately after that stop,
** if the header ended implicitly: continuation is the address of the next block start marker in the linear stream.
* For `MPAR`/`MSEQ`, reaching the first terminator marker in the body is the kernel completion event that returns to the
header continuation point.
* For `MPAR`/`MSEQ`/`VPAR`/`VSEQ`, reaching the first terminator marker in the body is the kernel completion event that
returns to the header continuation point.

*Safety rule interaction (`B.TEXT`):*

Expand All @@ -111,8 +114,8 @@ actual instruction body is placed elsewhere.
Precision: `E_BLOCK(EC_CFI)` is **precise** — no architectural state changes from the offending control-flow instruction are
committed. Targets that are invalid encodings but are not *interior-entry* CFI violations remain `E_INST(EC_ILLEGAL)` unless
otherwise specified.)
* In-body control-flow targets that escape the fetchable kernel/body region MUST fault as `E_BLOCK(EC_BFETCH)` rather
than as an external CFI violation.
* `B.TEXT` supplies the entrypoint for entering body execution; the current bring-up contract does not define an additional
kernel-specific static-region containment rule beyond normal instruction fetch/decode legality.

[[blockisa-forms-template]]
==== Template blocks (standalone blocks)
Expand Down Expand Up @@ -186,7 +189,7 @@ Additional bring-up fields used by tooling and trap recovery include:
The bring-up profile constrains certain combinations (examples):

* `.SYS` blocks fall through only (`FALL`), optionally with a fixup label.
* `.MPAR` / `.MSEQ` are the canonical v0.4 shader-kernel launch forms and select memory-parallel versus
* `.MPAR` / `.MSEQ` are the canonical v0.4 memory-capable kernel launch forms and select memory-parallel versus
memory-sequential execution modes.
* `.FP` blocks mirror `.STD` transitions but mark a floating-point execution context.

Expand Down Expand Up @@ -505,8 +508,8 @@ an out-of-line SIMT body selected by `B.TEXT`.

Bring-up execution model (normative for v0.4):

* `LB0..LB2` are written by `B.DIM` / `C.B.DIM*` in the block header.
* Hardware exposes lane counters `lc0..lc2` to the block body; on block entry they are initialized to `0`.
* `LB0..LB2` are written by `B.DIM` / `C.B.DIM*` in the block header; in canonical v0.4 bring-up they are 16-bit architectural loop/lane boundary registers.
* Hardware exposes lane counters `lc0..lc2` to the block body; in canonical v0.4 bring-up they are 16-bit architectural loop/lane counters and are initialized to `0` on block entry.
* The body is executed for each lane tuple `(lc0, lc1, lc2)` with:
** `lc0` iterating fastest: `0 .. (LB0-1)`
** `lc1`: `0 .. (max(LB1,1)-1)`
Expand All @@ -525,11 +528,16 @@ Decoupled body form (v0.4 canonical constraint):
* The header MUST contain exactly one `B.TEXT <tpc>` selecting an out-of-line body.
* The body MUST terminate at a block terminator marker. Valid terminators are `BSTOP` / `C.BSTOP` and `BSTART.*` / `C.BSTART.*`.
- First terminator wins (anything after is unreachable).
- Any `BSTART` encountered inside the body stream acts as both the terminator trigger and the next architectural block start
after the kernel returns to the header continuation point.
- Any `BSTART` encountered during body execution acts only as an implicit terminator for the current kernel; it is not
executed as a nested block start within the body.
- Explicit direct branch/jump to `BSTOP` / `C.BSTOP` is legal and has normal kernel-termination semantics.
- Explicit direct branch/jump to such a `BSTART.*` / `C.BSTART.*` target is legal and has the same early-exit effect.
- If execution falls through or fetches past valid body code without reaching a terminator marker, the body is malformed and MUST fault as `E_BLOCK(EC_BFETCH)` rather than terminating implicitly.
* The body MUST NOT contain any `B.*` header descriptors.
* For canonical `MPAR` / `MSEQ` shader kernels, the body is a structured instruction stream; in-body branches and jumps are legal,
and the scalar-uniform lane uses `p` as the architectural EXEC mask.
* For canonical `MPAR` / `MSEQ` / `VPAR` / `VSEQ` SIMT kernels, the body is a structured instruction stream; in-body direct
branches and direct jumps are legal, while in-body indirect jump (`JR`) and call/return remain reserved in the current
bring-up contract. Self-loops and other non-terminating loops are architecturally legal; the ISA does not guarantee that
body execution eventually reaches a terminator. The scalar-uniform lane uses `p` as the architectural EXEC mask.

Example (2-D nested loop using `MSEQ`):

Expand Down Expand Up @@ -577,7 +585,7 @@ Execution/ordering rules:
[[blockisa-v04-bridge]]
=== Kernel memory-bridge naming (canonical profile)

For canonical shader kernels, bridge operands and mnemonic families are:
For canonical memory-capable `M*` kernels, bridge operands and mnemonic families are:

* `load.local` / `store.local`: tile/local direction accesses.
* `load.brg` / `store.brg`: bridged global-memory accesses.
Expand All @@ -591,7 +599,7 @@ Encoding/mnemonic mapping (normative):

* `load.local` / `store.local` correspond to tile-local access forms.
* `load.brg` / `store.brg` correspond to bridged global-memory access forms.
* In the unified `lx64` kernel space, both scalar-uniform `l.*` forms and per-lane `v.*` forms may use `.brg`.
* In the unified `lx64` kernel space, memory-space selection remains orthogonal to `l.*` versus `v.*`: both scalar-uniform `l.*` forms and per-lane `v.*` forms may use either `.brg` (global) or the corresponding non-`.brg` local form, subject to the selected form's operand rules.

Bring-up operand constraints (strict profile):

Expand All @@ -600,7 +608,11 @@ Bring-up operand constraints (strict profile):
* `.local` accesses MUST remain within the tracked byte-size of the referenced bound tile; out-of-range tile-local access is illegal.
* `.local` stores MUST target output/scratch bases only (`TO` or `TS`). `TA/TB/TC/TD` are read-only input bases in
canonical v0.4.
* `VSEQ`/`VPAR` remain tile-only vector execution forms; canonical shader-kernel global-memory issue uses `MPAR`/`MSEQ`.
* `VSEQ`/`VPAR` remain tile-only, memory-free vector execution forms.
* `MPAR`/`MSEQ` are the memory-capable kernel forms and are the only SIMT families that may issue canonical bridged
global-memory accesses.
* Because in-body call/return is currently reserved, all currently supported in-body control flow is branch/jump-based;
kernel execution returns to the header continuation point only when a terminator marker is reached.

Bring-up tile-base binding for `.local` ops (normative for v0.4):

Expand All @@ -617,8 +629,8 @@ Vector block bodies may mix scalar-uniform and vector instructions; scalar-unifo

Predicate and replay semantics for mixed bodies:

* `SETC.*` in the body updates the scalar block-control predicate domain (`BARG.CARG`); it does not directly mask vector lanes.
* `p` is the architectural EXEC mask for canonical `MPAR` / `MSEQ` kernels, and `V.CMP.* ->p` is the normative mask producer.
* `SETC.*` / `C.SETC.*` are not supported inside canonical kernel bodies; body control flow instead uses explicit scalar branch forms and explicit `p` tests.
* `p` is the architectural EXEC mask for canonical `MPAR` / `MSEQ` / `VPAR` / `VSEQ` kernels; scalar-uniform body instructions may read/write `p` directly, and `V.CMP.* ->p` is the normative vector-lane mask producer.
* The body is replayed once per `LC` tuple. Scalar instructions execute on every replay iteration in program order together
with vector instructions.
* Vector instructions MUST import scalar inputs through `B.IOR`/`ri*` (except `zero` where encoding permits). Direct scalar
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ Vector body termination:

* The out-of-line SIMT body stream terminates on the first terminator marker encountered.
* Valid terminators are `BSTOP`/`C.BSTOP` and `BSTART.*`/`C.BSTART.*`.
* If a `BSTART.*` is reached during body execution, it acts only as an implicit terminator for the current kernel; it is not executed as a nested block start.

If a mandatory descriptor is missing, hardware MUST trap with `E_BLOCK(EC_BLOCKFMT)` and report the missing descriptor family in `TRAPARG0` (see System/Privilege trap reporting).

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -156,15 +156,24 @@ Canonical shader-kernel behavior is expressed through `.MPAR` / `.MSEQ` executio
* Prior side-effecting tile stores must be globally ordered before kernel-side memory issue starts (TSO-preserving barrier
requirement).
* Canonical kernel global-memory accesses use bridged forms only: `l.*.brg` and `v.*.brg`.
* Memory space selection is orthogonal to the `l.*` versus `v.*` execution-domain split: `.brg` selects global memory, and the corresponding non-`.brg` form selects local memory.
* Accordingly, in memory-capable `M*` kernels, both scalar-uniform `l.*` forms and per-lane `v.*` forms may use either the bridged global-memory space (`*.brg`) or the corresponding local-memory space (non-`.brg`), subject to the operand/addressing rules of the selected form.
* Archived VREG-only raw local-register fragments are not part of canonical v0.4 behavior.

Bridged address-formation contract (canonical v0.4):

* `l.*.brg` and `v.*.brg` use the same architectural addressing grammar.
* The effective address is formed only from the explicitly encoded base, index, immediate, and shift terms of the
selected instruction form.
* There is no implicit extra lane-index term added merely because the operation is `v.*.brg`.
* For `v.*` memory forms, lane-counter terms such as `lc0/lc1/lc2` are part of the vector SIMT source-side addressing
grammar itself, exactly as written by the selected instruction form.
* `.brg` does not inject any additional hidden lane term beyond that explicit `v.*` grammar; it only selects the
global-memory space instead of the corresponding local-memory space.
* `ri*` remains the required base namespace for `.brg` kernel memory ops.
* For vector memory forms, the effective address is computed independently for each active lane (or packed lane-pair for
64-bit vector data forms) using that lane's explicit address terms.
* Hardware may coalesce those per-lane addresses as an implementation optimization, but coalescing does not change the
architectural per-lane memory semantics.
* When a `v.*.brg` form names lane-varying operands, those operands are evaluated per active lane under the current
EXEC mask; when an `l.*.brg` form names only scalar/group-domain operands, the address is evaluated once per group.
* Scalar/group-domain source operands used by `v.*.brg` continue to broadcast under the unified `lx64` kernel rule.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,12 @@ SSR access instructions:
|`0x0023` |`VERSION` |RO |Core version.
|`0x0024` |`LCFR` |RO |Core feature description (includes LaneNum).
|`0x0025` |`LCFR_EN` |RW |Core feature enable.
|`0x0050` |`LB0` |RW |Loop/lane boundary register 0 (bring-up).
|`0x0051` |`LB1` |RW |Loop/lane boundary register 1 (bring-up).
|`0x0052` |`LB2` |RW |Loop/lane boundary register 2 (bring-up).
|`0x0053` |`LC0` |RW |Loop/lane counter register 0 (bring-up).
|`0x0054` |`LC1` |RW |Loop/lane counter register 1 (bring-up).
|`0x0055` |`LC2` |RW |Loop/lane counter register 2 (bring-up).
|`0x0050` |`LB0` |RW |16-bit loop/lane boundary register 0 (bring-up).
|`0x0051` |`LB1` |RW |16-bit loop/lane boundary register 1 (bring-up).
|`0x0052` |`LB2` |RW |16-bit loop/lane boundary register 2 (bring-up).
|`0x0053` |`LC0` |RW |16-bit loop/lane counter register 0 (bring-up).
|`0x0054` |`LC1` |RW |16-bit loop/lane counter register 1 (bring-up).
|`0x0055` |`LC2` |RW |16-bit loop/lane counter register 2 (bring-up).
|`0x0C00` |`CYCLE` |RO |Cycle counter (bring-up may model as instruction count).
|===

Expand Down
Loading