Skip to content

Releases: virtualsecureplatform/HOGE

BatchBootstrapping: TFHE NAND on Alveo U280 @ 300MHz

13 Mar 14:29

Choose a tag to compare

BatchBootstrapping

First working implementation of batched TFHE NAND gate bootstrapping on Alveo U280 FPGA at 300 MHz.

Architecture

Three-kernel pipeline across SLRs:

  • HomGate (SLR0): Identity Key Switching (IKS) + SEI output
  • BRBack (SLR1): ExternalProduct (MULandACC + NTT) via AXISBRLater
  • BRFront (SLR2): BlindRotate (PMBX feedback loop) + SEI

numbatch=2 parallel TFHE ciphertexts per gate invocation.

Performance

Metric Value
Gate latency (gate 0) 1.32 ms
Gate latency (gate 1) 1.25 ms
Clock frequency 300 MHz
Timing WNS +0.016 ns
Timing violations 0
Target device Xilinx Alveo U280 (xcu280-fsvh2892-2L-e)

Changelog

Fix: BK bus desync — atomic TREADY in AXISBRLater (06b3f12)

Root cause of real-FPGA hang (hw_emu unaffected):

AXISBRLater buffers bootstrapping key (BK) data for ExternalProductMiddle via 4 AXI4-Stream register slices per polynomial half. Each slice previously received TREADY := trgswinready(k) independently, while the write-enable to TRGSWBatchMemory requires all 4 buses valid simultaneously (trgswinvalid = Cat(tvalidvec).andR). When one HBM bank momentarily stalled, the other 3 slices consumed their data without writing anything — silently corrupting the BK accumulation. This caused MULandACC to miss outflag events, finreg to stall short of its target, and BlindRotate to hang in FINWAIT forever.

Fix: slices(i).io.manager.TREADY := trgswinready(k) && allValid — all 4 buses must be simultaneously valid before any slice advances.

Why hw_emu was unaffected: simulated HBM delivers all buses in lock-step; the AND is always 1.

Fix: ap_done via TLAST instead of brvalid falling-edge counting (77a4026)

HomGateTop previously counted numbatch TVALID falling edges on the brvalid stream to detect completion. On real FPGA, HBM output stalls absorb the 1-cycle inter-batch gap from SEI, producing only 1 falling edge instead of 2 → hung at gate 0. Fixed by using TLAST from the SEI output stream (fires after all numbatch*(N+1) beats).

Fix: finreg deadlock — count burst completions instead of TVALID edges (5db359b)

pmbxgap=13 < feedback_burst_length=64, so consecutive feedback bursts overlap with no TVALID gap. Changed finreg to increment on initcnt == 2*numcycle-1 (burst boundary) rather than TVALID falling edge.

Add: bk2numslice=4 to reduce BK2Formerslice reset fanout (cd19975)

Separate depth parameter for BK2Formerslice (4) vs axi4snumslice (8) improves timing by reducing reset fanout on the SLR-crossing register slices.

Artifact

HomGate_hw.xclbin — bitstream for Xilinx Alveo U280, built with Vitis 2023.2.

Usage:

ulimit -s unlimited
./nand HomGate_hw.xclbin

IKS16bit

07 Mar 05:25

Choose a tag to compare

HomGate 300 MHz FPGA Build

Successfully builds and runs at 300 MHz on Xilinx Alveo U280 with Vitis 2023.2.

Changelog (since switchtrans)

RTL Fixes

  • Split IKS accumulator into 10 independent 512-bit sub-accumulators (was single 5120-bit), eliminating ~4000 unroutable nets from BRAM fanout congestion
  • Pipeline IKS addr comparison in AXISIKS (addrPipe + wrapBubble) to break critical path through address decomposition output queue
  • Increase AXI4 register slices from 6 to 8 stages for better SLR crossing timing

Parameter & Build Fixes

  • Derive totaliksknumbus and iksknumsegments from qbit parameter instead of hardcoding, supporting both 16-bit and 32-bit LWE configurations
  • Use Congestion_SpreadLogic_high Vivado implementation strategy for reliable routing
  • Bump TFHEpp and fix RTL for uint16_t lvl0param::T

Test

  • NAND gate test passes on real FPGA: ~1.3 ms per gate

Timing

  • WNS: -0.037 ns (residual violation in Xilinx HBM platform IP, not user RTL)
  • 0 unroutable nets

Artifact

  • HomGate_hw.xclbin: Pre-built bitstream for Xilinx U280 (xilinx_u280_gen3x16_xdma_1_202211_1)

SwitchTrans

03 Mar 03:51

Choose a tag to compare

Maybe due to the change of Vitis version to 2023.2, I found that the routing of the design becomes quite difficult and ended up with a malfuntioning binary. Hence, I decided to change Lbuffer to SwitchTrans, that will embed the transpose operation into butterfly operations, to make it more easier to route. Although this will cause the divergence from the paper's architecture, I believe that a reproducible working implementation is better for everyone.

Paper Version

16 Feb 15:48

Choose a tag to compare

Becauae of the discontinuation of Alveo U280 and thw EOL of Ubuntu 22.04 is getting closer, I decided to make this repository much more reproducible in Alma Linux 9 with Vitis 2023.2, the last supported version. This may not preserve the reproducible benchmark result but that would be better for anyone who hope to develop successor of HOGE.
This release is made to indicate the original RTL while I will modify some RTL codes to make it compilable in 2023.2.