Linear-time language model with liquid neural mixing. O(1) memory. Infinite context. No attention.
LiquidNet replaces quadratic self-attention with a diagonal linear recurrence inspired by liquid neural networks. The state dynamics adapt per-token through input-dependent decay gates, enabling the model to compress arbitrarily long sequences into a fixed-size state vector.
Each LiquidNet block contains two sub-layers:
Input
├── RMSNorm → Liquid Mixer → Residual Add
└── RMSNorm → SwiGLU MLP → Residual Add
Liquid Mixer (replaces self-attention):
v_t = tanh(W_v · z_t) # value
δ_t = softplus(W_δ · z_t) + δ_min # learned positive decay
o_t = sigmoid(W_o · z_t) # output gate
α_t = exp(-δ_t) # retention factor
h_t = α_t · h_{t-1} + (1-α_t) · v_t # state update
y_t = W_y · (o_t ⊙ h_t) # gated output
Key properties:
- O(T) training via parallel associative scan (not O(T²) like attention)
- O(1) inference memory — fixed state vector, never grows regardless of sequence length
- Multi-timescale memory — channels initialized with log-uniform half-lives from 1 to 4096 tokens
- Coupled input-decay gating —
(1-α)·vensures bounded state dynamics
Dense Logic (SwiGLU MLP):
m = swish(W_g · r) ⊙ W_a · r
u = W_d · m
| Config | d_model | d_ff | Layers | Seq Len | Params |
|---|---|---|---|---|---|
| Tiny (proof of concept) | 192 | 576 | 4 | 256 | 11.6M |
| Small (trained & tested) | 384 | 1024 | 6 | 512 | 30M |
| Base (full architecture) | 768 | 2560 | 8 | 8192 | 104.8M |
| Metric | Value |
|---|---|
| Parameters | 30,000,000 |
| Training data | TinyStories (50M tokens, 2 epochs) |
| Total training tokens | 100M |
| Training time | 29.4 minutes |
| Hardware | NVIDIA RTX A5000 (24GB) |
| Throughput | 56,668 tokens/sec |
| Final loss | 2.97 |
| Final perplexity | 35.7 |
| Inference speed | 96 tokens/sec (GTX 1650, 4GB) |
Step 0 | loss 10.80 | ppl 49,080 | random noise
Step 500 | loss 3.55 | ppl 344 | recognizable words
Step 1,000 | loss 3.10 | ppl 96 | basic sentences, character names
Step 2,000 | loss 2.99 | ppl 64 | coherent narratives with dialogue
Step 3,000 | loss 3.03 | ppl 44 | named characters, motivations, plot
Step 4,069 | loss 2.97 | ppl 36 | multi-paragraph coherent stories
Prompt: "Once upon a time"
Once upon a time, there was a little girl named Lily. She loved to play with her toys and play with her favorite ball. One day, she saw a small boat and wanted to play with it. She took it on and ran over to the park.
Prompt: "The king looked at the"
The king looked at the fish and laughed. Mama was excited to help the bird and thanked him for the delicious cake. The man said, "You did a great job, little girl. I'm so happy that you are very strong."
Prompt: "Mom said it was time to go to bed, but"
Mom said it was time to go to bed, but Lily was very scared. She said to her mom and dad, "Mommy, I know. The ball is not too small. It's not brave. It can hurt you." Mommy nodded and said, "I want to take it too, but it is not too hard for me."
| LiquidNet | Transformer | |
|---|---|---|
| Training complexity | O(T) linear | O(T²) quadratic |
| Inference memory | O(1) constant | O(T) grows with context |
| Inference per token | O(1) constant | O(T) recomputes over KV cache |
| Max sequence length | Unlimited | Limited by memory |
| Memory type | Compressed (fuzzy) | Exact (KV lookup) |
| Scaling at 8K context | Same speed | 4-8x slower than at 512 |
| Scaling at 100K context | Same speed | Impractical |
The liquid state h is a single vector (d_model floats) that compresses all past information:
h_t = α_t · h_{t-1} + (1-α_t) · v_t
─────────────── ──────────────
fade the old blend in the new
Different channels have different decay rates (multi-timescale initialization):
- Fast channels (half-life ~1 token): track current word/syntax
- Medium channels (half-life ~50 tokens): track sentence-level context
- Slow channels (half-life ~4096 tokens): track document-level themes
This mimics biological memory — recent events are vivid, old events fade unless reinforced.
git clone https://github.com/Ranjitbarnala0/LiquidNet.git
cd LiquidNet
pip install -r requirements.txt# Downloads TinyStories and trains a 30M model
python train_local.py \
--d_model 384 --d_ff 1024 --n_layers 6 \
--seq_len 512 --batch_size 48 \
--total_tokens 100e6 \
--peak_lr 1e-4 --grad_clip 0.5
# Tiny model for testing (fits on any GPU)
python train_local.py \
--d_model 192 --d_ff 576 --n_layers 4 \
--seq_len 256 --batch_size 16 \
--total_tokens 50e6# Command line
python chat.py --checkpoint checkpoints_local/step_4069.pt
# Web UI
python ui.py
# Open http://localhost:7860from liquidnet import LiquidNet, LiquidNetConfig
config = LiquidNetConfig(d_model=384, d_ff=1024, n_layers=6, max_seq_len=512)
model = LiquidNet(config)
print(f"Parameters: {model.num_parameters()/1e6:.1f}M")
# Forward pass
logits, states = model(input_ids) # Training: parallel scan
logits, states = model(token, states=states) # Inference: O(1) state updateTraining uses a parallel scan to compute all hidden states simultaneously:
Composition: (α₂,β₂) ∘ (α₁,β₁) = (α₂·α₁, α₂·β₁ + β₂)
This is associative, enabling O(T) parallel computation on GPU/TPU.
Two backends:
- Triton kernels — custom GPU kernels for NVIDIA GPUs
- Pure PyTorch fallback — works on any device
- Dense matmuls: bfloat16 (or float16)
- Liquid scan accumulation: float32 (numerical stability)
- Logits for loss: float32
- State
h_t: float32
- Linear layers: N(0, 0.02)
- Output projections (W_y, W_d): zero-initialized for residual stability
- Decay bias (W_δ): multi-timescale init — softplus⁻¹(ln2/τ) with τ log-uniform in [1, 4096]
- Residual scaling: 1/√(2L) per sub-layer
PyTorch (liquidnet/) |
JAX/Flax (hybrid_liquid_dense_jax/) |
|
|---|---|---|
| Training | DDP, GradScaler | Multi-host TPU sharding |
| Scan | Triton kernels + PyTorch fallback | jax.lax.associative_scan |
| Precision | torch.amp.autocast |
param_dtype=float32, dtype=bfloat16 |
| Checkpointing | torch.save |
flax.serialization.to_bytes |
| Gradient checkpointing | Manual | nn.remat per block |
| Hardware | Throughput | Time (100M tokens) | Time (20B tokens) |
|---|---|---|---|
| GTX 1650 (4GB) | 2,320 tok/s | ~12 hours | N/A |
| RTX A5000 (24GB) | 56,668 tok/s | 29 min | ~98 hours |
| TPU v6e-16 (measured) | 477,000 tok/s | 3.5 min | ~12 hours |
| H100 (estimated) | 300,000+ tok/s | 5.5 min | ~18 hours |
LiquidNet/
├── liquidnet/ # Core architecture
│ ├── config.py # LiquidNetConfig dataclass
│ ├── layers.py # RMSNorm, LiquidMixer, SwiGLU, HybridBlock
│ ├── model.py # Full autoregressive model
│ ├── scan.py # Parallel prefix scan (Triton + PyTorch)
│ └── train.py # Distributed training (DDP)
├── train_local.py # Local GPU training with auto data download
├── chat.py # CLI chat interface
├── ui.py # Gradio web UI
└── requirements.txt
MIT
@software{liquidnet2026,
title={LiquidNet: Linear-Time Language Model with Liquid Neural Mixing},
author={Ranjit Barnala},
year={2026},
url={https://github.com/Ranjitbarnala0/LiquidNet}
}