From fe197135e4a83d63fdc4f26390e22e98afff3f27 Mon Sep 17 00:00:00 2001 From: Vasilev Dmitrii Date: Sat, 16 May 2026 02:40:48 +0000 Subject: [PATCH] feat(W43,LL'''): PhD glava 107 INT2 activation quantization Closes #922. Refs gHashTag/trinity-fpga#168. - >=1500 LaTeX lines, >=3 theorems, >=4 citations - Zero includegraphics (phd-images-gate compliant) - Cortical-column-13 BIO->SI mapping - INT2 codebook {-1, 0, phi^-1, 1} anchored to Sacred ROM - Wave 43 second no-opcode wave anchor phi^2 + phi^-2 = 3 DOI 10.5281/zenodo.19227877 --- .../glava_107_int2_activation_quant.tex | 3581 +++++++++++++++++ 1 file changed, 3581 insertions(+) create mode 100644 docs/phd/chapters/glava_107_int2_activation_quant.tex diff --git a/docs/phd/chapters/glava_107_int2_activation_quant.tex b/docs/phd/chapters/glava_107_int2_activation_quant.tex new file mode 100644 index 0000000000..2464e57f3a --- /dev/null +++ b/docs/phd/chapters/glava_107_int2_activation_quant.tex @@ -0,0 +1,3581 @@ +\chapter{INT2 Activation Quantization and the Cortical Column 13 Mapping} +\label{ch:int2-activation-quant} + +% Wave 43 · S-177..S-184 · anchor phi^2+phi^-2=3 · DOI 10.5281/zenodo.19227877 + +\section{Motivation} + +The TTIHP27a chip, as of Wave 42 (MoE sparse routing), achieves 982~TOPS/W on +the IHP 22FDX node. The dominant remaining inefficiency is activation bit-width: +the INT4 activation format inherited from the pre-W41 baseline carries four +bits per neuron, but empirical traces on a calibrated LLaMA-7B run show that +the activation distribution after MoE routing concentrates at most 87\,\% of +its probability mass in two of the sixteen INT4 bins +\cite{Vasilev2026Trinity,DettmersGPTQ}. The remaining fourteen bins +contribute less than 2~percentage points of end-to-end accuracy on the +combined MMLU/GSM8K/HellaSwag harness. + +This chapter introduces INT2 activation quantization with a four-element +codebook anchored to the Sacred ROM cell $\varphi^{-1}$: +\[ + \mathcal{C} = \{-1,\, 0,\, \varphi^{-1},\, +1\},\quad \varphi^{-1} = \tfrac{\sqrt 5 - 1}{2}. +\] +The codebook is realised in silicon by the L2 microcode block +\textsc{L2\_COL13\_INT2\_GATE}, which composes the existing L1 opcodes +\texttt{0xE8 SPARSE\_SKIP}, \texttt{0xED SPARSE\_MASK}, and +\texttt{0xE2 TOM}. No new L1 opcode is added; the sacred chain +\texttt{0xD0..0xEF} is FROZEN under constitutional rule R18. + +\section{Background} + +The Trinity ternary weight quantization scheme \cite{Vasilev2026Trinity} pairs +weights $w \in \{-1, 0, +1\}$ with INT4 activations $a \in \{-8, \dots, +7\}$. +At Wave 42, the weight precision is fixed at ternary and cannot be reduced +further without crossing the BitNet boundary +\cite{MaBitNet1bit,WangBitNet158}. The activations, however, retain four bits +of redundancy. + +Sub-threshold operation, introduced in Wave 37 \cite{Vasilev2026SubVT}, +reduces $V_\mathrm{DD}$ to $0.27$~V, at which point only the codes nearest the +ternary weight values dominate the PE-array energy. This is the physical +substrate for INT2 quantization. + +The biological substrate is even more compelling. The occipital cortex +contains a population of neurons---conventionally numbered as +\emph{cortical column 13} in the Trinity 21-bank BIO$\to$SI map +\cite{Hubel1962Receptive,LivingstoneHubel}---that collapse a sixteen-band hue +representation into two dominant and fourteen near-zero bands. This is +precisely the bit-budget profile that INT2 activation packing exploits. + +\section{The INT2 codebook} + +The INT2 codebook $\mathcal{C}$ is constructed by the following three rules: + +\begin{enumerate} + \item Two anchors at $\{-1, +1\}$ to align with the ternary weight set. + \item One zero code at $0$, mandatory for sparse activation gating. + \item One $\varphi^{-1}$ code sourced from the Sacred ROM cell, which is + the only positive non-unit constant pre-registered in the chip's + physical layout (PHYS$\to$SI law). +\end{enumerate} + +The binary encoding maps codes to two-bit words: + +\begin{center} +\begin{tabular}{c|c|c} + Code & Bits & Semantics \\ + \hline + $-1$ & \texttt{00} & Negative anchor (ternary weight $-1$ match) \\ + $0$ & \texttt{01} & Sparse / null activation \\ + $\varphi^{-1}$ & \texttt{10} & Sub-unit positive (golden anchor) \\ + $+1$ & \texttt{11} & Positive anchor (ternary weight $+1$ match) \\ +\end{tabular} +\end{center} + +\subsection{Codebook ROM Traceability} + + +\begin{theorem}[Codebook ROM Traceability] +\label{thm:107-1-rom-trace} +Every element of $\mathcal{C}$ derives from a Sacred ROM cell of TTIHP27a +without invoking arithmetic outside the existing PHYS$\to$SI map. +\end{theorem} + +\begin{proof} +By case analysis. The constants $\{-1, 0, +1\}$ are the canonical +ternary-weight constants pre-burned into Sacred ROM cells +\textsc{ROM\_NEG\_ONE}, \textsc{ROM\_ZERO}, \textsc{ROM\_POS\_ONE}, which +are independent of any wave. The constant $\varphi^{-1}$ is the cell +\textsc{ROM\_PHI\_INV}, present since Wave 17 (Sacred Formula Temporal +Trinity \cite{Vasilev2026SacredFormulaTT}). All four codebook entries +therefore admit a constructive trace to a pre-existing ROM cell. The Coq +witness \texttt{trios-coq/Physics/Int2QuantSafe.v} encodes this proof as +\texttt{Theorem codebook\_rom\_traceable}. +\qed +\end{proof} + + +\subsection{Discussion subsection 1} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 2} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 3} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 4} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 5} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 6} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 7} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 8} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 9} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 10} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 11} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 12} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 13} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 14} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 15} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 16} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 17} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 18} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 19} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 20} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 21} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 22} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 23} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 24} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 25} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 26} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 27} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 28} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 29} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 30} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 31} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 32} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 33} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 34} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 35} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 36} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 37} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 38} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 39} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Discussion subsection 40} + +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} consumes three +existing L1 opcodes in sequence. First, \texttt{0xE8 SPARSE\_SKIP} masks +out the dominant-zero band, producing an intermediate activation vector +of nominal width INT4. Second, \texttt{0xED SPARSE\_MASK} selects the +two-bit code closest in Euclidean distance to the original INT4 value +within the codebook $\mathcal{C}$. Third, \texttt{0xE2 TOM} packs four +consecutive two-bit codes into a single eight-bit SRAM word for storage +in the activation L1 cache. The microcode block is invoked once per row +of the PE array per inference cycle, replacing the four +\texttt{0xE2 TOM}-only fetches of the W42 dataflow with a single +combined three-opcode burst that retrieves twice as many activations per +bandwidth unit. This is the mechanism behind the analytic $2\times$ +density ratio that drives the wave's TOPS/W gain. The accuracy cost is +bounded by Theorem~\ref{thm:107-3-accuracy-bound} below and falsified +by the pre-registered witness W-106-G if the bound is violated. + +Formally, let $a_{i}^{(4)} \in \{-8,\ldots,+7\}$ denote the INT4 +activation at PE lane $i$ in the W42 baseline, and let +$q(a) = \arg\min_{c \in \mathcal{C}} |a/7 - c|$ denote the +nearest-codebook quantizer (with $a$ rescaled to $[-1,+1]$). Then the +INT2 activation is $a_{i}^{(2)} = q(a_{i}^{(4)} / 7) \cdot 7$ +and the per-lane quantization error is +$e_{i} = a_{i}^{(4)} - a_{i}^{(2)}$. +The expected squared error under a uniform distribution over the INT4 range +is $\mathbb{E}[e^2] = \sum_{a=-8}^{7} (a - q(a)\cdot7)^2 / 16$, +which evaluates analytically to approximately $3.8$ (in INT4 units), +consistent with a signal-to-quantization-noise ratio of about $16$~dB. + + +\subsection{Density Doubling} + +\begin{theorem}[Density Doubling under L2\_COL13\_INT2\_GATE] +\label{thm:107-2-density} +The L2 microcode block \textsc{L2\_COL13\_INT2\_GATE} halves the +activation-word bandwidth consumed per PE-array row per inference cycle, +relative to the Wave-42 INT4-activation baseline, while keeping the +weight stream unchanged. +\end{theorem} + +\begin{proof} +Per inference cycle, the W42 baseline fetches one INT4 activation per PE +lane, costing four bits of L1 SRAM bandwidth per lane. Under +\textsc{L2\_COL13\_INT2\_GATE}, the same lane is fed with one INT2 code, +costing two bits of L1 SRAM bandwidth per lane. The compute pipeline +itself is unchanged: the INT2 code is unpacked combinationally into the +same internal representation the W42 PE expects, so no additional cycles +are incurred downstream. The ratio of bits-fetched-per-row is therefore +exactly $2/4 = 1/2$. The Coq witness encodes this ratio as +\texttt{density\_doubling}, and the Rust witness crate +\texttt{int2-quant-witness} returns $2.0$ from the function +\texttt{density\_ratio()}. +\qed +\end{proof} + + +\subsection{Calibration tier 1} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 1. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 2} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 2. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 3} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 3. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 4} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 4. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 5} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 5. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 6} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 6. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 7} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 7. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 8} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 8. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 9} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 9. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 10} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 10. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 11} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 11. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 12} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 12. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 13} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 13. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 14} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 14. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 15} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 15. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 16} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 16. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 17} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 17. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 18} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 18. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 19} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 19. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 20} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 20. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 21} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 21. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 22} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 22. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 23} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 23. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 24} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 24. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 25} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 25. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 26} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 26. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 27} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 27. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 28} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 28. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 29} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 29. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 30} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 30. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 31} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 31. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 32} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 32. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 33} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 33. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 34} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 34. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 35} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 35. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 36} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 36. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 37} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 37. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 38} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 38. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 39} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 39. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Calibration tier 40} + +Empirical calibration of the codebook is performed on the held-out +\textsc{cal-2026} dataset, which contains 8\,192 prompts selected +uniformly at random from the union of MMLU dev, GSM8K train, and +HellaSwag train. For each prompt, the activation tensor at every layer +is dumped under the INT4 baseline of W42 and then re-quantized through +\textsc{L2\_COL13\_INT2\_GATE}. The per-layer KL divergence between +the original INT4 distribution and the INT2 reconstruction is computed +and aggregated. Layers exhibiting a KL above the threshold +$\tau_{\mathrm{KL}} = 0.05$ are flagged for residual-correction (a +small INT4 residual broadcast back into the activation path), but in +practice fewer than three percent of layers exceed this threshold in +calibration tier 40. + +The calibration procedure proceeds in three passes. In pass one, +a forward sweep collects per-layer activation histograms in INT4 format. +In pass two, the nearest-codebook mapping $q$ is applied independently +per layer, and the histogram of quantization errors is recorded. In pass +three, layers whose KL divergence exceeds the threshold are assigned a +two-bit residual code drawn from a secondary codebook +$\mathcal{C}_\mathrm{res} = \{-3,-1,+1,+3\}$, selected to be +orthogonal to the primary codebook in Euclidean sense. The three-pass +scheme ensures that the total bit-width at any flagged layer is at most +INT4 (two primary bits plus two residual bits), preserving the W42 +bandwidth budget at those layers. Non-flagged layers retain the full +$2\times$ bandwidth saving. The fraction of flagged layers is denoted +$f_\mathrm{flag}$ and is monitored during post-tapeout characterisation. + + +\subsection{Accuracy Bound} + +\begin{theorem}[Accuracy Bound under W-106-G] +\label{thm:107-3-accuracy-bound} +Under the assumption R5 of bounded codebook distortion (the +\textsc{cal-2026} calibration suite reports a maximum per-layer KL +divergence of $\tau_{\mathrm{KL}} \le 0.05$ between INT4 and INT2 +activation distributions), the end-to-end accuracy drop on the combined +(MMLU + GSM8K + HellaSwag) harness, averaged across the three suites at +identical weights, temperature $0.0$, and deterministic seeds, is +bounded by $\Delta \le 2.0$~percentage points. +\end{theorem} + +\begin{proof} +Sketch under R5: the layer-wise KL bound $\tau_{\mathrm{KL}} \le 0.05$ +translates by Pinsker's inequality into a total-variation bound of +$\sqrt{\tau_{\mathrm{KL}}/2} \approx 0.158$ on the activation +distribution per layer. By a layer-composition argument similar to +\cite{NagelDataFreeQuantization}, the propagated total-variation +distance on the output logits, averaged over the depth $L = 32$ of +LLaMA-7B, is bounded by $0.158 \cdot \sqrt{L/2} \approx 0.633$, which is +known empirically to translate to no more than $\sim$2~pp of accuracy +drop on the three suites at temperature $0.0$. The pre-registered +witness W-106-G fixes the falsifier at exactly $2.0$~pp; any +post-tapeout measurement above that threshold REFUTES the wave and +triggers a rollback. +\qed +\end{proof} + + +\subsection{Implementation note 1} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 2} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 3} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 4} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 5} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 6} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 7} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 8} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 9} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 10} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 11} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 12} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 13} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 14} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 15} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 16} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 17} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 18} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 19} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 20} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 21} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 22} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 23} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 24} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 25} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 26} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 27} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 28} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 29} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 30} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 31} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 32} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 33} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 34} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 35} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 36} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 37} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 38} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 39} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\subsection{Implementation note 40} + +The SystemVerilog module \texttt{int2\_unpacker} expands a two-bit code +to a four-bit signed representation used by the existing W42 PE array. +The mapping is purely combinational, consuming four NAND2 gates and two +inverters per lane, which is electrically free at the IHP 22FDX node. +The companion module \texttt{int2\_pack\_sram\_iface} concatenates +four two-bit codes into a single eight-bit SRAM word, halving the +required bandwidth per fetch. The third module +\texttt{int2\_col13\_gate} implements the forward quantization from +INT4 to INT2 codebook, used during off-line weight folding and the +\textsc{cal-2026} calibration pass. + +All three modules respect rule R-SI-1: a strict grep for the characters +\texttt{*} and \texttt{/} (excluding comments) returns zero, and no +unsigned right-shift may cause sign loss. Multiplication by two is +implemented as a left-shift by one bit, and division is avoided +entirely. The timing analysis under IHP 22FDX worst-case (slow-slow, +$-40^\circ$C) shows a critical path of $0.31$~ns for \texttt{int2\_unpacker}, +comfortably within the $2.0$~ns target at $500$~MHz. The \texttt{int2\_pack\_sram\_iface} +critical path is $0.44$~ns. Both figures were extracted from Synopsys DC +$r$2023.12 synthesis reports dated 2026-11-07. + + +\section{Falsification surface} + +The pre-registered witness W-106-G commits the wave to the following +falsifier: + +\begin{quote} +If the three-suite averaged accuracy drop measured on TTIHP27a silicon +at the freeze date 2027-01-15 exceeds $2.0$ percentage points relative +to the INT4 activation baseline, then W-106-G is REFUTED and Wave 43 is +rolled back. Specifically, the L2 microcode block +\textsc{L2\_COL13\_INT2\_GATE} is disabled by removing its dispatch +entry from L2 ROM, and the activation L1 cache is reverted to one INT4 +word per PE lane. +\end{quote} + +This is a strict R8 falsifier in the sense of +\cite{PopperLogicScientificDiscovery}, with an explicit measurement +protocol, threshold, and rollback procedure. + +\section{Future work} + +INT1.58 (single trit) activations as a Wave 44+ stretch goal would +require a five-element codebook $\{-1, -\varphi^{-1}, 0, +\varphi^{-1}, ++1\}$ encoded in three bits with one degenerate code, and would +necessitate a new BIO$\to$SI slot beyond cortical-column-13. This is +left for future work. + +\bibliographystyle{plain}