Skip to content

Abraheem13/Nested-Learning_-HOPE-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Nested Learning (HOPE) - From Scratch Implementation

A from-scratch implementation of Google Research's Nested Learning paper and the HOPE (Hierarchical Optimizing Processing Ensemble) architecture.

🎯 What is Nested Learning?

Nested Learning treats neural networks as systems of nested optimization problems with different update frequencies - like how the human brain uses different brain waves for different types of memory.

Key Insight

Everything is Associative Memory!

  • Attention = Associative Memory
  • Momentum = Associative Memory
  • Weights = Associative Memory

πŸ—οΈ Architecture Overview

HOPE Block

The HOPE block combines three components that cover the full frequency spectrum:

Frequency:  ∞ ←─────────────────────────→ 0
            β”‚                            β”‚
        Attention    CMS modules       MLP
        (recompute)  (various Ο„)    (static)
  1. Attention - Infinite frequency (recomputes every time)

    • Fast, immediate context
    • O(nΒ²) standard or O(n) linear attention
  2. CMS (Continuum Memory System) - Spectrum of frequencies

    • Multiple timescales of memory (Ο„ from 0.01 to 0.99)
    • Fast modules forget quickly, slow modules remember forever
    • This is HOPE's special sauce!
  3. MLP - Zero frequency (fixed after training)

    • Static knowledge learned during pre-training

πŸ“¦ Key Components

1. Associative Memory

Core building block that stores key-value pairs and retrieves by similarity.

2. Continuum Memory System (CMS)

Multiple memory modules with different decay rates:

  • Fast memories (Ο„ β‰ˆ 0.01): React quickly, forget quickly
  • Slow memories (Ο„ β‰ˆ 0.99): React slowly, remember forever
  • Medium memories: Various timescales in between

3. HOPE Model

Complete transformer-like architecture with:

  • Token and position embeddings
  • Stacked HOPE blocks
  • Autoregressive generation support

4. Deep Optimizers

Optimizers that treat momentum as associative memory:

  • Standard momentum = weighted average of past gradients
  • Deep momentum = neural network processes gradient history

πŸš€ Installation

Prerequisites

  • Python 3.8+
  • uv package manager

Setup

# Clone the repository
git clone <your-repo-url>
cd NL

# Install dependencies
uv sync

πŸ’» Usage

Training

uv run python train.py

Evaluation

uv run python evaluate.py

Testing Individual Components

# Test HOPE block
uv run python src/nested_learning/models/blocks.py

# Test attention mechanisms
uv run python src/nested_learning/models/attention.py

# Test CMS
uv run python src/nested_learning/memory/cms.py

# Test full HOPE model
uv run python src/nested_learning/models/hope.py

Using the Model

from nested_learning.models import HOPE

# Create model
model = HOPE(
    vocab_size=32000,
    dim=512,
    num_layers=12,
    num_heads=8,
    num_memory_modules=5,
    use_cms=True,
)

# Forward pass
import torch
tokens = torch.randint(0, 32000, (2, 128))  # batch=2, seq=128
logits, loss = model(tokens, tokens)

# Generation
prompt = torch.randint(0, 32000, (1, 10))
generated = model.generate(prompt, max_new_tokens=50, temperature=0.8)

πŸ“ Project Structure

NL/
β”œβ”€β”€ src/nested_learning/    # Core implementation
β”‚   β”œβ”€β”€ memory/             # Associative Memory & CMS
β”‚   β”‚   β”œβ”€β”€ associative.py  # Basic associative memory
β”‚   β”‚   └── cms.py          # Continuum Memory System
β”‚   β”œβ”€β”€ models/             # HOPE architecture
β”‚   β”‚   β”œβ”€β”€ hope.py         # Full HOPE model
β”‚   β”‚   β”œβ”€β”€ blocks.py       # HOPE block implementation
β”‚   β”‚   β”œβ”€β”€ attention.py    # Multi-head & linear attention
β”‚   β”‚   └── embeddings.py   # Token & position embeddings
β”‚   β”œβ”€β”€ optimizers/         # Deep Momentum optimizer
β”‚   β”‚   └── deep_momentum.py
β”‚   β”œβ”€β”€ data/               # Data loading utilities
β”‚   β”‚   β”œβ”€β”€ dataset.py       # Dataset classes
β”‚   β”‚   └── tokenizer.py    # Tokenization
β”‚   └── utils/              # Helper functions
β”‚       β”œβ”€β”€ config.py       # Configuration management
β”‚       └── helpers.py      # Utility functions
β”œβ”€β”€ configs/                # Configuration files
β”‚   └── default.yaml
β”œβ”€β”€ tests/                  # Unit tests
β”œβ”€β”€ train.py               # Training script
β”œβ”€β”€ evaluate.py            # Evaluation script
└── README.md              # This file

πŸ”¬ Key Features

  • βœ… From-scratch implementation - No external transformer libraries
  • βœ… Modular design - Each component can be used independently
  • βœ… Comprehensive tests - Each module includes test code
  • βœ… Memory system - CMS with multiple timescales
  • βœ… Flexible attention - Standard O(nΒ²) or linear O(n) attention
  • βœ… Generation support - Autoregressive text generation
  • βœ… Deep optimizers - Learnable momentum as associative memory

πŸ“š Theoretical Background

Nested Learning Paper

This implementation is based on the concept that neural networks can be viewed as nested optimization problems with different update frequencies, similar to how the brain processes information at different timescales.

Brain Analogy

  • Gamma waves (fast): Sensory input - like attention
  • Beta waves (medium): Active thinking - like CMS modules
  • Theta waves (slow): Memory consolidation - like slower CMS modules
  • Delta waves (slowest): Deep, permanent memory - like MLP weights

πŸ§ͺ Testing

Run tests for individual components:

# Test memory systems
uv run python tests/test_memory.py

# Test individual modules (each has built-in tests)
uv run python src/nested_learning/models/blocks.py
uv run python src/nested_learning/memory/cms.py

πŸ“ Configuration

Edit configs/default.yaml to customize:

  • Model architecture (dimensions, layers, heads)
  • Training hyperparameters
  • Memory system settings
  • Optimizer settings

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“„ License

This is an educational implementation built from scratch for learning purposes.

πŸ™ Acknowledgments

  • Google Research's Nested Learning paper
  • The transformer architecture (Attention is All You Need)
  • The PyTorch community

πŸ“§ Contact

For questions or issues, please open an issue on GitHub.


Note: This is a proof-of-concept implementation for educational purposes. For production use, consider using established libraries like PyTorch's transformer modules or Hugging Face Transformers.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages