Reviewing and updating project documentation#22
Conversation
…tools coverage Agent-Logs-Url: https://github.com/theomgdev/OdyssNet/sessions/19256ec6-d27b-4833-a348-adf2f8504b57 Co-authored-by: theomgdev <29312699+theomgdev@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR expands the project’s contributor documentation by adding a detailed troubleshooting guide focused on diagnosing and addressing non-converging/unstable training runs.
Changes:
- Added a new “Training Not Converging” section with suggested diagnostics workflows (TrainingHistory, trainer/optimizer diagnostics, anomaly hooks).
- Added practical mitigation tips for oscillating loss, stuck loss, performance slowdowns, and VRAM pressure.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ``` | ||
|
|
||
| **Key metrics to monitor:** | ||
| - **frustration:** High values (>100) indicate the optimizer is struggling; may trigger plateau escape |
There was a problem hiding this comment.
The guidance for frustration value ranges appears incorrect. ChaosGrad._frustration is an EMA in the range ~[0, 1] with a burst threshold at _FRUST_THRESH = 0.75, so suggesting "High values (>100)" will mislead users. Update the text to reflect the actual scale/threshold (e.g., nearing/exceeding ~0.75).
| - **frustration:** High values (>100) indicate the optimizer is struggling; may trigger plateau escape | |
| - **frustration:** This is an EMA typically in the ~[0, 1] range; values nearing or exceeding ~0.75 indicate the optimizer is struggling and may trigger plateau escape |
| ```python | ||
| if trainer._using_chaos_grad: | ||
| from odyssnet import ChaosGrad | ||
| chaos_opt = trainer.optimizer | ||
|
|
||
| opt_diag = chaos_opt.get_diagnostics() | ||
|
|
There was a problem hiding this comment.
This example checks trainer._using_chaos_grad, which is a private implementation detail. Prefer using public surface area (e.g., trainer.get_diagnostics()['using_chaos_grad'] or isinstance(trainer.optimizer, ChaosGrad)) so the docs don’t encourage relying on underscored attributes.
|
|
||
| 2. **Compile the model** (PyTorch 2.0+): | ||
| ```python | ||
| model.compile() |
There was a problem hiding this comment.
OdyssNet.compile() returns a compiled model instance (see existing examples using model = model.compile()). Calling model.compile() without assignment here implies it compiles in-place, which is not generally true and may lead to users not actually using the compiled model.
| model.compile() | |
| model = model.compile() |
| global patience_counter | ||
| patience_counter += 1 | ||
| if patience_counter > 50: | ||
| print(f"⛔ 50 consecutive increases. Early stopping.") | ||
| raise KeyboardInterrupt | ||
|
|
There was a problem hiding this comment.
The anomaly hook example uses raise KeyboardInterrupt for early stopping. Raising KeyboardInterrupt programmatically is unconventional and can interfere with cleanup/exception handling; prefer breaking out of the loop, returning from the training function, or raising a dedicated exception type.
Pull request created by AI Agent