A from-scratch implementation of Google Research's Nested Learning paper and the HOPE (Hierarchical Optimizing Processing Ensemble) architecture.
Nested Learning treats neural networks as systems of nested optimization problems with different update frequencies - like how the human brain uses different brain waves for different types of memory.
Everything is Associative Memory!
- Attention = Associative Memory
- Momentum = Associative Memory
- Weights = Associative Memory
The HOPE block combines three components that cover the full frequency spectrum:
Frequency: β βββββββββββββββββββββββββββ 0
β β
Attention CMS modules MLP
(recompute) (various Ο) (static)
-
Attention - Infinite frequency (recomputes every time)
- Fast, immediate context
- O(nΒ²) standard or O(n) linear attention
-
CMS (Continuum Memory System) - Spectrum of frequencies
- Multiple timescales of memory (Ο from 0.01 to 0.99)
- Fast modules forget quickly, slow modules remember forever
- This is HOPE's special sauce!
-
MLP - Zero frequency (fixed after training)
- Static knowledge learned during pre-training
Core building block that stores key-value pairs and retrieves by similarity.
Multiple memory modules with different decay rates:
- Fast memories (Ο β 0.01): React quickly, forget quickly
- Slow memories (Ο β 0.99): React slowly, remember forever
- Medium memories: Various timescales in between
Complete transformer-like architecture with:
- Token and position embeddings
- Stacked HOPE blocks
- Autoregressive generation support
Optimizers that treat momentum as associative memory:
- Standard momentum = weighted average of past gradients
- Deep momentum = neural network processes gradient history
- Python 3.8+
- uv package manager
# Clone the repository
git clone <your-repo-url>
cd NL
# Install dependencies
uv syncuv run python train.pyuv run python evaluate.py# Test HOPE block
uv run python src/nested_learning/models/blocks.py
# Test attention mechanisms
uv run python src/nested_learning/models/attention.py
# Test CMS
uv run python src/nested_learning/memory/cms.py
# Test full HOPE model
uv run python src/nested_learning/models/hope.pyfrom nested_learning.models import HOPE
# Create model
model = HOPE(
vocab_size=32000,
dim=512,
num_layers=12,
num_heads=8,
num_memory_modules=5,
use_cms=True,
)
# Forward pass
import torch
tokens = torch.randint(0, 32000, (2, 128)) # batch=2, seq=128
logits, loss = model(tokens, tokens)
# Generation
prompt = torch.randint(0, 32000, (1, 10))
generated = model.generate(prompt, max_new_tokens=50, temperature=0.8)NL/
βββ src/nested_learning/ # Core implementation
β βββ memory/ # Associative Memory & CMS
β β βββ associative.py # Basic associative memory
β β βββ cms.py # Continuum Memory System
β βββ models/ # HOPE architecture
β β βββ hope.py # Full HOPE model
β β βββ blocks.py # HOPE block implementation
β β βββ attention.py # Multi-head & linear attention
β β βββ embeddings.py # Token & position embeddings
β βββ optimizers/ # Deep Momentum optimizer
β β βββ deep_momentum.py
β βββ data/ # Data loading utilities
β β βββ dataset.py # Dataset classes
β β βββ tokenizer.py # Tokenization
β βββ utils/ # Helper functions
β βββ config.py # Configuration management
β βββ helpers.py # Utility functions
βββ configs/ # Configuration files
β βββ default.yaml
βββ tests/ # Unit tests
βββ train.py # Training script
βββ evaluate.py # Evaluation script
βββ README.md # This file
- β From-scratch implementation - No external transformer libraries
- β Modular design - Each component can be used independently
- β Comprehensive tests - Each module includes test code
- β Memory system - CMS with multiple timescales
- β Flexible attention - Standard O(nΒ²) or linear O(n) attention
- β Generation support - Autoregressive text generation
- β Deep optimizers - Learnable momentum as associative memory
This implementation is based on the concept that neural networks can be viewed as nested optimization problems with different update frequencies, similar to how the brain processes information at different timescales.
- Gamma waves (fast): Sensory input - like attention
- Beta waves (medium): Active thinking - like CMS modules
- Theta waves (slow): Memory consolidation - like slower CMS modules
- Delta waves (slowest): Deep, permanent memory - like MLP weights
Run tests for individual components:
# Test memory systems
uv run python tests/test_memory.py
# Test individual modules (each has built-in tests)
uv run python src/nested_learning/models/blocks.py
uv run python src/nested_learning/memory/cms.pyEdit configs/default.yaml to customize:
- Model architecture (dimensions, layers, heads)
- Training hyperparameters
- Memory system settings
- Optimizer settings
Contributions are welcome! Please feel free to submit a Pull Request.
This is an educational implementation built from scratch for learning purposes.
- Google Research's Nested Learning paper
- The transformer architecture (Attention is All You Need)
- The PyTorch community
For questions or issues, please open an issue on GitHub.
Note: This is a proof-of-concept implementation for educational purposes. For production use, consider using established libraries like PyTorch's transformer modules or Hugging Face Transformers.