Releases: virtualsecureplatform/HOGE
BatchBootstrapping: TFHE NAND on Alveo U280 @ 300MHz
BatchBootstrapping
First working implementation of batched TFHE NAND gate bootstrapping on Alveo U280 FPGA at 300 MHz.
Architecture
Three-kernel pipeline across SLRs:
- HomGate (SLR0): Identity Key Switching (IKS) + SEI output
- BRBack (SLR1): ExternalProduct (MULandACC + NTT) via AXISBRLater
- BRFront (SLR2): BlindRotate (PMBX feedback loop) + SEI
numbatch=2 parallel TFHE ciphertexts per gate invocation.
Performance
| Metric | Value |
|---|---|
| Gate latency (gate 0) | 1.32 ms |
| Gate latency (gate 1) | 1.25 ms |
| Clock frequency | 300 MHz |
| Timing WNS | +0.016 ns |
| Timing violations | 0 |
| Target device | Xilinx Alveo U280 (xcu280-fsvh2892-2L-e) |
Changelog
Fix: BK bus desync — atomic TREADY in AXISBRLater (06b3f12)
Root cause of real-FPGA hang (hw_emu unaffected):
AXISBRLater buffers bootstrapping key (BK) data for ExternalProductMiddle via 4 AXI4-Stream register slices per polynomial half. Each slice previously received TREADY := trgswinready(k) independently, while the write-enable to TRGSWBatchMemory requires all 4 buses valid simultaneously (trgswinvalid = Cat(tvalidvec).andR). When one HBM bank momentarily stalled, the other 3 slices consumed their data without writing anything — silently corrupting the BK accumulation. This caused MULandACC to miss outflag events, finreg to stall short of its target, and BlindRotate to hang in FINWAIT forever.
Fix: slices(i).io.manager.TREADY := trgswinready(k) && allValid — all 4 buses must be simultaneously valid before any slice advances.
Why hw_emu was unaffected: simulated HBM delivers all buses in lock-step; the AND is always 1.
Fix: ap_done via TLAST instead of brvalid falling-edge counting (77a4026)
HomGateTop previously counted numbatch TVALID falling edges on the brvalid stream to detect completion. On real FPGA, HBM output stalls absorb the 1-cycle inter-batch gap from SEI, producing only 1 falling edge instead of 2 → hung at gate 0. Fixed by using TLAST from the SEI output stream (fires after all numbatch*(N+1) beats).
Fix: finreg deadlock — count burst completions instead of TVALID edges (5db359b)
pmbxgap=13 < feedback_burst_length=64, so consecutive feedback bursts overlap with no TVALID gap. Changed finreg to increment on initcnt == 2*numcycle-1 (burst boundary) rather than TVALID falling edge.
Add: bk2numslice=4 to reduce BK2Formerslice reset fanout (cd19975)
Separate depth parameter for BK2Formerslice (4) vs axi4snumslice (8) improves timing by reducing reset fanout on the SLR-crossing register slices.
Artifact
HomGate_hw.xclbin — bitstream for Xilinx Alveo U280, built with Vitis 2023.2.
Usage:
ulimit -s unlimited
./nand HomGate_hw.xclbinIKS16bit
HomGate 300 MHz FPGA Build
Successfully builds and runs at 300 MHz on Xilinx Alveo U280 with Vitis 2023.2.
Changelog (since switchtrans)
RTL Fixes
- Split IKS accumulator into 10 independent 512-bit sub-accumulators (was single 5120-bit), eliminating ~4000 unroutable nets from BRAM fanout congestion
- Pipeline IKS addr comparison in AXISIKS (
addrPipe+wrapBubble) to break critical path through address decomposition output queue - Increase AXI4 register slices from 6 to 8 stages for better SLR crossing timing
Parameter & Build Fixes
- Derive
totaliksknumbusandiksknumsegmentsfromqbitparameter instead of hardcoding, supporting both 16-bit and 32-bit LWE configurations - Use
Congestion_SpreadLogic_highVivado implementation strategy for reliable routing - Bump TFHEpp and fix RTL for
uint16_t lvl0param::T
Test
- NAND gate test passes on real FPGA: ~1.3 ms per gate
Timing
- WNS: -0.037 ns (residual violation in Xilinx HBM platform IP, not user RTL)
- 0 unroutable nets
Artifact
HomGate_hw.xclbin: Pre-built bitstream for Xilinx U280 (xilinx_u280_gen3x16_xdma_1_202211_1)
SwitchTrans
Maybe due to the change of Vitis version to 2023.2, I found that the routing of the design becomes quite difficult and ended up with a malfuntioning binary. Hence, I decided to change Lbuffer to SwitchTrans, that will embed the transpose operation into butterfly operations, to make it more easier to route. Although this will cause the divergence from the paper's architecture, I believe that a reproducible working implementation is better for everyone.
Paper Version
Becauae of the discontinuation of Alveo U280 and thw EOL of Ubuntu 22.04 is getting closer, I decided to make this repository much more reproducible in Alma Linux 9 with Vitis 2023.2, the last supported version. This may not preserve the reproducible benchmark result but that would be better for anyone who hope to develop successor of HOGE.
This release is made to indicate the original RTL while I will modify some RTL codes to make it compilable in 2023.2.