LiquidNet

Linear-time language model with liquid neural mixing. O(1) memory. Infinite context. No attention.

LiquidNet replaces quadratic self-attention with a diagonal linear recurrence inspired by liquid neural networks. The state dynamics adapt per-token through input-dependent decay gates, enabling the model to compress arbitrarily long sequences into a fixed-size state vector.

Architecture

Each LiquidNet block contains two sub-layers:

Input
  ├── RMSNorm → Liquid Mixer → Residual Add
  └── RMSNorm → SwiGLU MLP   → Residual Add

Liquid Mixer (replaces self-attention):

v_t     = tanh(W_v · z_t)              # value
δ_t     = softplus(W_δ · z_t) + δ_min  # learned positive decay
o_t     = sigmoid(W_o · z_t)           # output gate
α_t     = exp(-δ_t)                    # retention factor
h_t     = α_t · h_{t-1} + (1-α_t) · v_t  # state update
y_t     = W_y · (o_t ⊙ h_t)           # gated output

Key properties:

O(T) training via parallel associative scan (not O(T²) like attention)
O(1) inference memory — fixed state vector, never grows regardless of sequence length
Multi-timescale memory — channels initialized with log-uniform half-lives from 1 to 4096 tokens
Coupled input-decay gating — (1-α)·v ensures bounded state dynamics

Dense Logic (SwiGLU MLP):

m = swish(W_g · r) ⊙ W_a · r
u = W_d · m

Model Configurations

Config	d_model	d_ff	Layers	Seq Len	Params
Tiny (proof of concept)	192	576	4	256	11.6M
Small (trained & tested)	384	1024	6	512	30M
Base (full architecture)	768	2560	8	8192	104.8M

Results

30M Model — Trained in 29 Minutes

Metric	Value
Parameters	30,000,000
Training data	TinyStories (50M tokens, 2 epochs)
Total training tokens	100M
Training time	29.4 minutes
Hardware	NVIDIA RTX A5000 (24GB)
Throughput	56,668 tokens/sec
Final loss	2.97
Final perplexity	35.7
Inference speed	96 tokens/sec (GTX 1650, 4GB)

Training Curve

Step      0 | loss 10.80 | ppl 49,080 | random noise
Step    500 | loss  3.55 | ppl    344 | recognizable words
Step  1,000 | loss  3.10 | ppl     96 | basic sentences, character names
Step  2,000 | loss  2.99 | ppl     64 | coherent narratives with dialogue
Step  3,000 | loss  3.03 | ppl     44 | named characters, motivations, plot
Step  4,069 | loss  2.97 | ppl     36 | multi-paragraph coherent stories

Generated Samples (30M model, step 4069)

Prompt: "Once upon a time"

Once upon a time, there was a little girl named Lily. She loved to play with her toys and play with her favorite ball. One day, she saw a small boat and wanted to play with it. She took it on and ran over to the park.

Prompt: "The king looked at the"

The king looked at the fish and laughed. Mama was excited to help the bird and thanked him for the delicious cake. The man said, "You did a great job, little girl. I'm so happy that you are very strong."

Prompt: "Mom said it was time to go to bed, but"

Mom said it was time to go to bed, but Lily was very scared. She said to her mom and dad, "Mommy, I know. The ball is not too small. It's not brave. It can hurt you." Mommy nodded and said, "I want to take it too, but it is not too hard for me."

Why Liquid Mixing?

vs Self-Attention (Transformers)

	LiquidNet	Transformer
Training complexity	O(T) linear	O(T²) quadratic
Inference memory	O(1) constant	O(T) grows with context
Inference per token	O(1) constant	O(T) recomputes over KV cache
Max sequence length	Unlimited	Limited by memory
Memory type	Compressed (fuzzy)	Exact (KV lookup)
Scaling at 8K context	Same speed	4-8x slower than at 512
Scaling at 100K context	Same speed	Impractical

How the State Works

The liquid state h is a single vector (d_model floats) that compresses all past information:

h_t = α_t · h_{t-1} + (1-α_t) · v_t
      ───────────────   ──────────────
      fade the old       blend in the new

Different channels have different decay rates (multi-timescale initialization):

Fast channels (half-life ~1 token): track current word/syntax
Medium channels (half-life ~50 tokens): track sentence-level context
Slow channels (half-life ~4096 tokens): track document-level themes

This mimics biological memory — recent events are vivid, old events fade unless reinforced.

Quick Start

Install

git clone https://github.com/Ranjitbarnala0/LiquidNet.git
cd LiquidNet
pip install -r requirements.txt

Train (local GPU)

# Downloads TinyStories and trains a 30M model
python train_local.py \
  --d_model 384 --d_ff 1024 --n_layers 6 \
  --seq_len 512 --batch_size 48 \
  --total_tokens 100e6 \
  --peak_lr 1e-4 --grad_clip 0.5

# Tiny model for testing (fits on any GPU)
python train_local.py \
  --d_model 192 --d_ff 576 --n_layers 4 \
  --seq_len 256 --batch_size 16 \
  --total_tokens 50e6

Chat with trained model

# Command line
python chat.py --checkpoint checkpoints_local/step_4069.pt

# Web UI
python ui.py
# Open http://localhost:7860

Use in code

from liquidnet import LiquidNet, LiquidNetConfig

config = LiquidNetConfig(d_model=384, d_ff=1024, n_layers=6, max_seq_len=512)
model = LiquidNet(config)

print(f"Parameters: {model.num_parameters()/1e6:.1f}M")

# Forward pass
logits, states = model(input_ids)          # Training: parallel scan
logits, states = model(token, states=states) # Inference: O(1) state update

Architecture Details

Parallel Associative Scan

Training uses a parallel scan to compute all hidden states simultaneously:

Composition: (α₂,β₂) ∘ (α₁,β₁) = (α₂·α₁, α₂·β₁ + β₂)

This is associative, enabling O(T) parallel computation on GPU/TPU.

Two backends:

Triton kernels — custom GPU kernels for NVIDIA GPUs
Pure PyTorch fallback — works on any device

Precision Policy

Dense matmuls: bfloat16 (or float16)
Liquid scan accumulation: float32 (numerical stability)
Logits for loss: float32
State h_t: float32

Weight Initialization

Linear layers: N(0, 0.02)
Output projections (W_y, W_d): zero-initialized for residual stability
Decay bias (W_δ): multi-timescale init — softplus⁻¹(ln2/τ) with τ log-uniform in [1, 4096]
Residual scaling: 1/√(2L) per sub-layer

Dual Implementation

	PyTorch (`liquidnet/`)	JAX/Flax (`hybrid_liquid_dense_jax/`)
Training	DDP, GradScaler	Multi-host TPU sharding
Scan	Triton kernels + PyTorch fallback	`jax.lax.associative_scan`
Precision	`torch.amp.autocast`	`param_dtype=float32, dtype=bfloat16`
Checkpointing	`torch.save`	`flax.serialization.to_bytes`
Gradient checkpointing	Manual	`nn.remat` per block

Training Hardware Benchmarks

Hardware	Throughput	Time (100M tokens)	Time (20B tokens)
GTX 1650 (4GB)	2,320 tok/s	~12 hours	N/A
RTX A5000 (24GB)	56,668 tok/s	29 min	~98 hours
TPU v6e-16 (measured)	477,000 tok/s	3.5 min	~12 hours
H100 (estimated)	300,000+ tok/s	5.5 min	~18 hours

File Structure

LiquidNet/
├── liquidnet/              # Core architecture
│   ├── config.py           # LiquidNetConfig dataclass
│   ├── layers.py           # RMSNorm, LiquidMixer, SwiGLU, HybridBlock
│   ├── model.py            # Full autoregressive model
│   ├── scan.py             # Parallel prefix scan (Triton + PyTorch)
│   └── train.py            # Distributed training (DDP)
├── train_local.py          # Local GPU training with auto data download
├── chat.py                 # CLI chat interface
├── ui.py                   # Gradio web UI
└── requirements.txt

License

MIT

Citation

@software{liquidnet2026,
  title={LiquidNet: Linear-Time Language Model with Liquid Neural Mixing},
  author={Ranjit Barnala},
  year={2026},
  url={https://github.com/Ranjitbarnala0/LiquidNet}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
liquidnet		liquidnet
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chat.py		chat.py
requirements.txt		requirements.txt
train_local.py		train_local.py
ui.py		ui.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LiquidNet

Architecture

Model Configurations

Results

30M Model — Trained in 29 Minutes

Training Curve

Generated Samples (30M model, step 4069)

Why Liquid Mixing?

vs Self-Attention (Transformers)

How the State Works

Quick Start

Install

Train (local GPU)

Chat with trained model

Use in code

Architecture Details

Parallel Associative Scan

Precision Policy

Weight Initialization

Dual Implementation

Training Hardware Benchmarks

File Structure

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LiquidNet

Architecture

Model Configurations

Results

30M Model — Trained in 29 Minutes

Training Curve

Generated Samples (30M model, step 4069)

Why Liquid Mixing?

vs Self-Attention (Transformers)

How the State Works

Quick Start

Install

Train (local GPU)

Chat with trained model

Use in code

Architecture Details

Parallel Associative Scan

Precision Policy

Weight Initialization

Dual Implementation

Training Hardware Benchmarks

File Structure

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages