This repository represents the TTSKY26a silicon submission — a GF16 dot4 mesh kernel that was submitted before the 2026-05-11 deadline. Tape-out scheduled 2026-10-28.
The full Trinity SUPER-CROWN SoC (18 modules: BLAKE3, BPB, ALU-9, RING27, BitNet, VSA, Lucas POST) has been forked into a new dedicated repo for the TTSKY26b submission as part of the Trinity TRI-NET sacred-constant triplet:
- 🔥 e-engine → tt-trinity-euler — 8×2 SUPER-CROWN SoC @ TTSKY26b
- φ-anchor → tt-trinity-phi — 1×1 Lucas POST @ TTSKY26b (was tt-trinity-nano)
- γ-surface → tt-trinity-gamma — 8×4 mesh @ TTSKY26b (was tt-trinity-max-true)
This repo continues to track the TTSKY26a silicon. New SUPER-CROWN development happens in tt-trinity-euler.
Bare-RTL processorless prototype of a GF16 dot4 mesh computer. There is no Linux, no soft-CPU, no AXI. The host (board pins / future UART / USB-JTAG) talks to the mesh through a small packet protocol; an on-die FSM walks the protocol so that nothing on-chip needs an instruction stream.
This is v0 of the Trinity Silicon roadmap (R-SI-* compliance, see info.yaml). It is
NOT a decentralised internet mesh — it is the on-chip packet fabric foundation that
future radio / Ethernet / mesh adapters will plug into.
┌────────────────────────────────────────────────────┐
│ tt_um_ghtag_trinity_gf16 (TT top) │
│ │
│ ┌──────────────┐ 32-bit pkt ┌──────────────┐ │
│ │ master_fsm │─────────────▶│ │ │
│ │ (no CPU) │ │ router_2x2 │ │
│ │ canned LOAD/ │◀─────────────│ (v0 xbar) │ │
│ │ COMPUTE/READ │ └─────┬────────┘ │
│ └──────────────┘ │ │
│ ▼ │
│ ┌───────────┬────────────┬────┴────┐ │
│ ▼ ▼ ▼ ▼ │
│ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │tile 0 │ │tile 1 │ │tile 2 │ │tile 3 │ │
│ │gf16_ │ │gf16_ │ │gf16_ │ │gf16_ │ │
│ │dot4 │ │dot4 │ │dot4 │ │dot4 │ │
│ └───────┘ └───────┘ └───────┘ └───────┘ │
│ uo_out / uio_out ◀── final_result │
└────────────────────────────────────────────────────┘
gf16_mul.v,gf16_add.v,gf16_dot4.v— existing combinational GF16 demogf16_dot8.v— L-S20 8-lane dot product: twogf16_dot4units in parallel + onegf16_addaccumulator; delivers 2× TOPS/tile at ~2× MAC area with no impact on the canonicaldot4primitivetrinity_packet.vh— 32-bit packet format constants (op, dst, src, lane, payload)trinity_gf16_tile.v— wrapsgf16_dot4(orgf16_dot8whenDOT_WIDTH=8) as a packet-addressable tile (LOAD_A / LOAD_B / LOAD_JOB / LOAD_NONCE / COMPUTE / READ_RES → RESULT + paired RECEIPT). On-die receipt emission is the G4 silicon-anchored DePIN attestation. Lanes 4–7 are available forDOT_WIDTH=8.trinity_router_2x2.v— single-hop crossbar with 4 tile ports + host port (round-robin return). Honest name: minimal mesh fabric v0, not a full XY-routed mesh yettrinity_mesh_2x2.v— 4 tiles + 1 router wired as the fabrictrinity_master_fsm.v— CPU-less host FSM, canned[1,2,3,4]·[1,2,3,4]boot sequencett_um_ghtag_trinity_gf16.v— TT top, preserves the legacy combinational output AND exposes the mesh result on the same pins after boot
standard packet (op != RECEIPT):
[31:28] op 4'h1 LOAD_A | 4'h2 LOAD_B | 4'h3 COMPUTE | 4'h4 RESULT |
4'h5 READ_RES | 4'h7 LOAD_JOB | 4'h8 LOAD_NONCE
[27:26] dst flat tile id 0..3
[25:24] src flat tile id of sender (host uses 0)
[23:20] lane 0..3 for operand lanes
[19:16] rsv
[15:0] payload GF16 operand or result (LOAD_JOB/NONCE take low 8 bits)
receipt packet (op == 4'h6 TRN_OP_RECEIPT, emitted on-die after every RESULT):
[31:28] op 4'h6 RECEIPT
[27:26] dst host id (always 0 in v0)
[25:24] tile_id the producing tile (signed-by silicon attribution)
[23:20] op_code echoes the settled op (4'h3 COMPUTE for v0)
[19:16] rsv
[15:8] checksum (job_id_q ^ result_q[7:0]) & 0xFF -- pure XOR-fold
[7:0] job_lo persisted job_id_q (low 8 bits)
The checksum field matches
tools/receipt_verifier/tri_receipt_verifier.compute_checksum(job_id, observed)
byte-for-byte — silicon ↔ host contract closed by
tools/receipt_verifier/test_g4_verifier.py::T8 chip_emitted_packet.
Unchanged from the previous submission (info.yaml). ui_in[0] doubles as
load_mode (reserved for future host operand override).
- Legacy:
dot4([1,2,3,4], [1,2,3,4]) = 30.0 = 0x47C0— visible immediately on{uio_out, uo_out}. - Mesh: the same value reached via the packet protocol after ~20 cycles.
tb.vcovers both paths. - L-S20 dot8:
sim/tb_gf16_dot8.v— 16 diverse vectors + canonical 0x47C0 re-check; all 17 tests PASS.
cd test
make # cocotb + iverilog
# Standalone dot8 sim (iverilog):
iverilog -o /tmp/sim_dot8 src/gf16_mul.v src/gf16_add.v src/gf16_dot4.v \
src/gf16_dot8.v sim/tb_gf16_dot8.v && /tmp/sim_dot8L-S20 dot8 expansion (EPIC gHashTag/trinity-fpga#51)
| Metric | dot4 (before) | dot8 (L-S20) |
|---|---|---|
| MAC lanes per tile | 4 | 8 |
| TOPS/tile (relative) | N | 2N |
| Extra modules | — | gf16_dot8.v |
| Area (MAC only) | 1× | ~2× MAC area |
| Canonical 0x47C0 | PASS | PASS (unchanged) |
| Build-time opt-in | (n/a) | DOT_WIDTH=8 |
| Backwards compat | yes | DOT_WIDTH=4 reverts |
gf16_dot8 = gf16_dot4(a[0..3], b[0..3]) + gf16_dot4(a[4..7], b[4..7]), accumulated
through a single gf16_add. The dot4 primitive is not modified — the 0x47C0 canonical
test vector is preserved bit-exact. trinity_gf16_tile gains a DOT_WIDTH parameter
(default 4) that selects the MAC unit at synthesis time, giving tape-out flexibility.
Lanes 4–7 are addressed via the existing LOAD_A/LOAD_B packet ops using lane[2:0]
(values 4–7), fully backwards-compatible with existing lane[1:0] traffic.
- R-SI-1 Zero NEW multipliers: no
*introduced in new RTL.gf16_mul.v:30keeps its pre-existing 10×10 mantissa multiply (legacy, deliberately not touched in v0). - R-SI-2 Ternary/GF16 path preserved; the tile interface is operand-agnostic so a ternary matmul tile can drop in later by swapping
gf16_dot4insidetrinity_gf16_tile.v. - R-SI-4 50 MHz clock, no PLL, synchronous design with async-low reset (
negedge rst_n). - Apache-2.0 only.
This v0 keeps the existing TT pinout and synthesises stand-alone. To target a board now:
- Wrap
tt_um_ghtag_trinity_gf16in a board-specific top (clock, reset, LEDs) — e.g. QMTECH XC7A100T via openXC7:clk← 50 MHz on-board oscillator,rst_n← active-low button, exposeuo_out/uio_outon LEDs/PMOD,ui_inon DIP switches. - Host I/O later: replace the canned master FSM with a UART / USB-UART RX→packet parser (RX byte stream → 32-bit packet) and TX driver (RESULT packet → bytes). FSM module stays; only the operand source changes.
- Future Trinity CPU integration: replace
trinity_master_fsm.vwith the Trinity CPU's instruction-fetch unit and let it issue the same 32-bit packets directly.
The end-goal product is a small FPGA-centric node that runs ternary / GF16 compute jobs, emits deterministic receipts, and is paid in TRI tokens by peers in a mesh — a Helium-style DePIN, but for compute. This repo is step 0 of that path: only the on-die packet fabric is real. USB-3 host I/O, external radios, multi-hop mesh routing, and TRI settlement are not implemented yet. The boundary contracts are documented in docs/TRINITY_DEPIN_NODE.md and exposed by two synthesizable boundary stubs:
src/trinity_usb3_fifo_bridge.v— FT60x (FT600/FT601) synchronous-FIFO shim to the Trinity 32-bit packet handshake. Skeleton-only: real FT601 timing, byte-enables, and CDC are marked TODO. Not wired into the TT top (TinyTapeout has no FT60x pins).src/trinity_mesh_adapter_stub.v— pass-through boundary to an external radio / backhaul module (LoRa / ESP32 / etc.). No LoRa/Wi-Fi PHY in fabric. Not wired into the TT top.
G4 silicon-anchored receipts (new): every tile now emits a paired
TRN_OP_RECEIPT = 4'h6 packet immediately after its RESULT handshake,
carrying (tile_id, op_code, checksum, job_id_lo). The checksum is the same
(job_id ^ result_lo) & 0xFF XOR-fold that
tri_receipt_verifier.compute_checksum()
uses on the host, so a host verifier can attribute work to this node
byte-for-byte. R-SI-1 is preserved — zero new multipliers; the checksum
is pure XOR. TRI token settlement itself remains off-chip per
docs/TRINITY_DEPIN_NODE.md §5–§6.
| Gate | Deliverable | Status |
|---|---|---|
| G0 | On-die 32-bit packet fabric + 4 GF16 tiles + CPU-less FSM | done (PR #2) |
| G1 | USB-3 FIFO loopback on dev FPGA + FT601 breakout | GREEN in sim |
| G2 | UART/USB packet parser (byte stream ↔ 32-bit Trinity packet) | GREEN in sim |
| G3 | 2× node mesh demo over the external radio adapter | spec frozen |
| G4 | TRI receipt verifier (host SW + on-die receipt emission) | done — silicon-anchored |
| G5 | Custom Trinity DePIN carrier board (FPGA + FT601 + radio) | spec frozen |
| L-S20 | dot8 expansion: 2× dot4 lanes → 2× TOPS/tile (gf16_dot8.v, DOT_WIDTH=8) |
merged (partial EPIC #51) |
| L-S21 | φ-prior skip-zero sparsity gating → 2× effective TOPS | merged (partial EPIC #51) |
Until G3 is demonstrated on real hardware, this project will NOT claim a full external mesh implementation, and the term "ternary internet" stays a design-doc concept, not a product claim.
The φ-prior weight distribution (arising naturally from Lucas/Fibonacci initialisation) produces ~60% zero weights in ternary and low-bit-width GF16 models (ANCHOR: φ² + φ⁻² = 3 · DOI 10.5281/zenodo.19227877). If ~60% of MAC lanes can be skipped per dot4, the compute budget halves — doubling effective TOPS with no functional change.
src/gf16_dot4_sparse.v — a drop-in wrapper around gf16_dot4 that adds:
-
Per-lane zero detection on the
b(weight) operand busses:wire bN_zero = (bN[14:0] == 15'd0);Detects GF16 zero (exp=0, mant=0, regardless of sign bit).
-
lane_active[3:0]mask:assign lane_active[i] = !sparsity_enable || !bN_zero;When
sparsity_enable=0, all lanes are always active → bit-identical to original. -
Operand clock-gate / bypass: when
lane_active[i]=0, the input to the multiplier is forced to16'h0000. The combinational GF16 multiplier already returns zero for zero inputs — this eliminates spurious input toggling and prevents dynamic power consumption on the multiply tree for that lane. -
sparsity_enableconfig bit (1 bit, default0):0→ bit-identical togf16_dot4(constitutional safety / golden compare)1→ skip-zero gating ON
sparsity_enable=0 passes every bit through gf16_dot4 unchanged —
confirmed by T1 (canonical GF16 dot4 0x47C0 PASS) and T1b (dense == sparse check).
Replace any gf16_dot4 instance with gf16_dot4_sparse and tie sparsity_enable
to a config register bit:
gf16_dot4_sparse u_dot (
.sparsity_enable(cfg_sparsity_en),
.a0(a0), .a1(a1), .a2(a2), .a3(a3),
.b0(b0), .b1(b1), .b2(b2), .b3(b3),
.result(dot_out),
.lane_active(dbg_lane_active)
);Run the dedicated testbench:
cd src
iverilog -o /tmp/sim_sparsity tb_sparsity_gate.v gf16_dot4_sparse.v gf16_dot4.v gf16_mul.v gf16_add.v
/tmp/sim_sparsityExpected output:
PASS T1: canonical 0x47C0, lane_active=1111 (sparsity OFF)
PASS T1b: dense==sparse with sparsity_enable=0
PASS T2: canonical with sparsity ON, result=0x47C0
PASS T2b-T2f: sparse output matches dense on 5 mixed-sparsity vectors
PASS T3: active fraction 0.350 in [0.35, 0.45]
ALL PASS (9/9)
With 60% zero weights and sparsity_enable=1:
- ~40% of lanes toggle their multiply tree per dot4 evaluation
- Dynamic power on the MAC array: roughly −40% vs dense
- End-to-end throughput: same wall-clock cycles, but only 40% of MAC operations are real → 2× effective TOPS at iso-power
- Static / leakage power: unchanged
Apache-2.0
EPIC: gHashTag/trinity-fpga#51 ANCHOR: φ²+φ⁻²=3 · DOI 10.5281/zenodo.19227877 · Apache-2.0
PR feat/L-S19-pipeline-popcount introduces 3-stage pipelining into the
XOR-popcount inner-product path used by vsa_matmul_8x8 and vsa_matmul_16x16.
New modules:
src/gf16_popcount.v— 3-stage pipelined inner product for 8-element ternary vectors (LATENCY=3)src/gf16_popcount16.v— same for 16-element vectors (used byvsa_matmul_16x16)
Updated modules:
src/vsa_matmul_8x8.v— replacesinner_product()function with 64 parallelgf16_popcountinstancessrc/vsa_matmul_16x16.v— replacesip16()function with 256 parallelgf16_popcount16instances
New testbench:
sim/ls19/tb_ls19_pipeline.v— standalone iverilog testbench for the pipeline modules
The old design computed all 64 (or 256) inner products in a single combinational cloud, limiting Fmax to ~50 MHz (17-LUT critical path through 8-stage adder tree plus sign logic). Splitting across 3 registered stages removes combinational depth:
| Stage | Logic | Registered output |
|---|---|---|
| 1 | Decode (AND/XOR per element pair) | same[7:0], diff[7:0] + valid |
| 2 | Popcount tree (8→4 bits via 3:2 compressors) | cnt_pos[3:0], cnt_neg[3:0] + valid |
| 3 | Final subtraction, sign-extend | result[7:0] + valid_out |
Target Fmax: 150 MHz (3× vs. 50 MHz baseline) → 3× TOPS at the same gate budget. LATENCY: 3 cycles (was 1 combinational pass-through).
R-SI-1 compliant: zero * operators; all arithmetic is + on single-bit values.
The matmul FSM absorbs the 3-cycle pipeline: start latches inputs, the next
clock fires valid_in into all popcount units, and done asserts when
valid_out returns (5 clocks after start instead of 2). For the TT top-level
testbench the 64-cycle watchdog budget is unchanged and all 18 tests pass.
=== L-S19 Pipeline Popcount Tests ===
PASS legacy_dot4_0x47C0: 0x47C0 = 30.0 UNCHANGED
PASS pc8_all_pos: result=8 valid_out=1
PASS pc8_pos_vs_neg: result=-8 valid_out=1
PASS pc8_all_zeros: result=0 valid_out=1
PASS pc8_mixed_zero: result=0 valid_out=1
PASS pc8_6p2n: result=4 valid_out=1
PASS pc16_all_pos: result=16 valid_out=1
PASS pc16_pos_vs_neg: result=-16 valid_out=1
PASS latency_3: valid_out at T+3 (LATENCY=3 cycles confirmed)
PASS mm8_results: c[0][0]=8 c[0][1]=8 (all=8)
PASS mm8_ok: matmul_ok=1
=== Results: 11 pass, 0 fail ===
ALL L-S19 PIPELINE TESTS PASSED
=== TT Trinity GF16 Tests (full tb.v) ===
PASS legacy_dot4_result: 0x47C0 = 30.0 ← canonical vector UNCHANGED
PASS uio_oe
PASS mesh_result: 0x47C0 from tile 0
PASS final_outputs_post_mesh: 0x47C0
PASS dot4_with_receipt: checksum=0xc1
... [18/18 tests PASS] ...
GF16 canonical test vector 0x47C0 verified PASS — non-negotiable.