Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
205 changes: 205 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -344,6 +344,211 @@ model = OdyssNet(..., debug=True)

With `debug=True` the model checks every critical forward-pass operation (linear recurrence, memory feedback, activation, StepNorm, Hebbian correlation and accumulation) and raises a `RuntimeError` at the first operation that produces a non-finite value, with the operation name and step index. `debug=True` also automatically enables `torch.autograd.set_detect_anomaly(True)`, so backward-pass NaN is caught with a full stack trace at no extra setup cost. Overhead is zero when `debug=False`.

### Training Not Converging

If your model trains but doesn't converge or gets stuck:

#### 1. Use TrainingHistory for Visual Diagnosis

Track and visualize all key metrics to identify patterns:

```python
from odyssnet import TrainingHistory

history = TrainingHistory()

for epoch in range(epochs):
loss = trainer.train_batch(x, y, thinking_steps=10)
acc = evaluate_accuracy(...)
lr = trainer.optimizer.param_groups[0]['lr'] if hasattr(trainer.optimizer, 'param_groups') else trainer.initial_lr

history.record(loss=loss, accuracy=acc, lr=lr)

# Visual inspection reveals patterns
history.plot(title="Training Diagnosis")
# Or save for later analysis
history.plot(save_path="diagnosis/training.png", title="Debug Run")
```

**What to look for:**
- **Flat loss:** May need more thinking steps, different initialization, or learning rate adjustment
- **Oscillating loss:** Reduce learning rate or enable gradient persistence
- **Sudden spikes:** Check for batch corruption or use anomaly_hook to catch them

#### 2. Use trainer.get_diagnostics() for Runtime Monitoring

Monitor training health in real-time:

```python
for epoch in range(epochs):
loss = trainer.train_batch(x, y, thinking_steps=10)

# Get comprehensive diagnostics
diag = trainer.get_diagnostics()

if epoch % 10 == 0:
print(f"Epoch {epoch}")
print(f" Step count: {diag['step_count']}")
print(f" Last loss: {diag['last_loss']:.6f}")
print(f" Using ChaosGrad: {diag['using_chaos_grad']}")

if diag['using_chaos_grad']:
# ChaosGrad-specific metrics
opt_diag = diag['optimizer']
print(f" Global step: {opt_diag['global_step']}")
print(f" Frustration: {opt_diag['frustration']:.4f}")
print(f" Best loss: {opt_diag['best_loss']:.6f}")
print(f" Avg effective LR: {opt_diag['avg_effective_lr']:.4f}")
print(f" Avg init LR: {opt_diag['avg_init_lr']:.6f}")
```

**Key metrics to monitor:**
- **frustration:** High values (>100) indicate the optimizer is struggling; may trigger plateau escape
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guidance for frustration value ranges appears incorrect. ChaosGrad._frustration is an EMA in the range ~[0, 1] with a burst threshold at _FRUST_THRESH = 0.75, so suggesting "High values (>100)" will mislead users. Update the text to reflect the actual scale/threshold (e.g., nearing/exceeding ~0.75).

Suggested change
- **frustration:** High values (>100) indicate the optimizer is struggling; may trigger plateau escape
- **frustration:** This is an EMA typically in the ~[0, 1] range; values nearing or exceeding ~0.75 indicate the optimizer is struggling and may trigger plateau escape

Copilot uses AI. Check for mistakes.
- **avg_effective_lr:** Values much less than 1.0 suggest the optimizer has reduced learning rates due to difficult landscape
- **best_loss:** If this hasn't improved in many steps, consider manual intervention

#### 3. Use optimizer.get_diagnostics() for Deep Analysis

For ChaosGrad-specific debugging:

```python
if trainer._using_chaos_grad:
from odyssnet import ChaosGrad
chaos_opt = trainer.optimizer

opt_diag = chaos_opt.get_diagnostics()

Comment on lines +414 to +420
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example checks trainer._using_chaos_grad, which is a private implementation detail. Prefer using public surface area (e.g., trainer.get_diagnostics()['using_chaos_grad'] or isinstance(trainer.optimizer, ChaosGrad)) so the docs don’t encourage relying on underscored attributes.

Copilot uses AI. Check for mistakes.
print(f"Optimizer Health:")
print(f" Global step: {opt_diag['global_step']}")
print(f" Frustration level: {opt_diag['frustration']}")
print(f" Best loss seen: {opt_diag['best_loss']}")
print(f" Avg LR drift: {opt_diag['avg_effective_lr']:.4f}")

# avg_effective_lr interpretation:
# > 1.0 : Training is going well, LRs increased from cold start
# ≈ 1.0 : Stable, no drift from initialization
# < 1.0 : Struggling, optimizer reduced LRs to navigate difficult terrain
```

#### 4. Use anomaly_hook for Automated Intervention

Set up intelligent automated responses to training anomalies:

```python
def handle_anomaly(anomaly_type, loss_val):
"""Called automatically on training anomalies."""

if anomaly_type == "spike":
# Violent loss surge (possible gradient explosion)
print(f"⚠️ SPIKE detected! Loss: {loss_val:.4f}")
# Could reduce LR, reload checkpoint, etc.

elif anomaly_type == "plateau":
# Loss stagnated over a window
print(f"🔄 PLATEAU detected. Triggering escape...")
trainer.trigger_plateau_escape()

elif anomaly_type == "increase":
# Loss increased from previous step (happens every time loss goes up)
# Useful for custom patience counters or early stopping
global patience_counter
patience_counter += 1
if patience_counter > 50:
print(f"⛔ 50 consecutive increases. Early stopping.")
raise KeyboardInterrupt

Comment on lines +454 to +459
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The anomaly hook example uses raise KeyboardInterrupt for early stopping. Raising KeyboardInterrupt programmatically is unconventional and can interfere with cleanup/exception handling; prefer breaking out of the loop, returning from the training function, or raising a dedicated exception type.

Copilot uses AI. Check for mistakes.
# Initialize trainer with hook
patience_counter = 0
trainer = OdyssNetTrainer(
model,
lr=1e-3,
anomaly_hook=handle_anomaly
)

# Now train — anomalies trigger automatic responses
for epoch in range(1000):
loss = trainer.train_batch(x, y, thinking_steps=10)
```

**Anomaly types:**
- **"spike":** Sudden violent surge in loss (exploding gradient)
- **"plateau":** Loss stagnated and barely moving over a window
- **"increase":** Loss strictly greater than previous step (fired every time, even 0.0001 increase)

### Loss Oscillating or Unstable

If loss oscillates or training is unstable:

1. **Enable gradient persistence** for smoother optimization:
```python
trainer = OdyssNetTrainer(model, lr=1e-3, gradient_persistence=0.1)
```

2. **Reduce learning rate:**
```python
trainer = OdyssNetTrainer(model, lr=1e-4) # Lower genesis LR
```

3. **Try different initialization** if using tiny networks:
```python
model = OdyssNet(..., weight_init='xavier_uniform', activation='gelu')
```

### Model Not Learning (Loss Stuck)

If loss doesn't decrease at all:

1. **Verify data preprocessing:** Check that inputs/targets are properly normalized and on correct device
2. **Increase thinking steps:** Model may need more temporal depth
```python
trainer.train_batch(x, y, thinking_steps=20) # Was 10
```
3. **Check initialization:** For very small networks (<10 neurons), try:
```python
model = OdyssNet(..., weight_init='xavier_uniform', activation='gelu')
```
4. **Trigger manual plateau escape:**
```python
if epoch > 50 and loss > initial_loss:
trainer.trigger_plateau_escape()
```

### Performance Issues

If training is too slow:

1. **Enable TF32 on Ampere+ GPUs:**
```python
import torch
torch.set_float32_matmul_precision('high')
```

2. **Compile the model** (PyTorch 2.0+):
```python
model.compile()
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OdyssNet.compile() returns a compiled model instance (see existing examples using model = model.compile()). Calling model.compile() without assignment here implies it compiles in-place, which is not generally true and may lead to users not actually using the compiled model.

Suggested change
model.compile()
model = model.compile()

Copilot uses AI. Check for mistakes.
```

3. **Use gradient accumulation** instead of larger batches:
```python
# Simulates batch_size=128 with batch_size=32
trainer.train_batch(x, y, thinking_steps=10, gradient_accumulation_steps=4)
```

### Memory Issues

If running out of VRAM:

1. **Reduce batch size** and use gradient accumulation
2. **Enable gradient checkpointing:**
```python
model = OdyssNet(..., gradient_checkpointing=True)
```
3. **Use vocab projection** for high-dimensional inputs:
```python
# Instead of num_neurons=784 for MNIST
model = OdyssNet(num_neurons=10, vocab_size=[784, 10])
```

## 🔧 Library Contributions (`odyssnet/`)

When modifying the library itself (not examples), follow these additional rules:
Expand Down
Loading