Skip to content

Comments

Feature Pack: LR Finder, Advanced Schedulers, Metric Logging & Run Management improvements#463

Open
fizista wants to merge 7 commits intotdrussell:mainfrom
fizista:main
Open

Feature Pack: LR Finder, Advanced Schedulers, Metric Logging & Run Management improvements#463
fizista wants to merge 7 commits intotdrussell:mainfrom
fizista:main

Conversation

@fizista
Copy link

@fizista fizista commented Nov 27, 2025

🚀 Summary
This PR introduces several significant enhancements aimed at improving the training workflow, debugging capabilities, and scheduler flexibility. It includes a native Learning Rate Finder, improved memory monitoring, dynamic scheduler loading, and better run management.

✨ Key Changes

  1. 🔍 Learning Rate Finder
    Implemented a fully functional LR Finder based on the "Linear/Exponential Range Test" method.
  • New CLI Arguments: --lr_finder, --lr_finder_steps, --lr_finder_start, --lr_finder_end.
  • Behavior: Runs a short simulation before the main training loop, increasing LR exponentially.
  • Visualization: Generates a matplotlib chart saved directly to TensorBoard (Images tab)
  • Safety: Exits automatically after the test without overwriting checkpoints.
  1. 📉 Scheduler Improvements
  • New Scheduler: Added CosineAnnealingWarmRestartsDecay. This fixes a limitation in the standard PyTorch implementation by adding a gamma parameter to decay the maximum LR after every restart cycle.
  • Dynamic Loading: Added support for type = 'external' in [lr_scheduler_config]. This allows loading any scheduler class dynamically (e.g., from timm, transformers, or standard torch) via a config string (e.g., class = 'torch.optim.lr_scheduler.StepLR').
  • Flexible Warmup: warmup_steps now accepts float values (0.0 - 1.0) to represent a percentage of total steps, in addition to absolute integer values.
  1. 📊 Enhanced Logging & Metrics
  • Memory Monitoring: Added logging for System RAM (memory/ram_main_process_gb) and GPU VRAM (memory/vram_allocated_gb, memory/vram_reserved_gb, memory/vram_peak_gb).
  • Parameter Norm: Added train/total_param_norm to monitor weight magnitudes during training.
  1. 📂 Run Management
  • Run ID: Added --run_id argument. This allows appending a custom suffix to the run_dir (e.g., 20241127_120000__my_experiment_v1), making it easier to identify specific experiments on disk and in TensorBoard.

⚙️ Configuration Examples

Using the new Cosine Decay Scheduler:

[lr_scheduler_config]
type = 'cosine_annealing_warm_restarts_decay'
T_0 = 1000
T_mult = 1
eta_min = 1e-7
gamma = 0.9  # Decay factor for restarts

Using an External Scheduler:

[lr_scheduler_config]
type = 'external'
class = 'torch.optim.lr_scheduler.StepLR'
step_size = 30
gamma = 0.1

🧪 Verification

  • Tested lr_finder on Wan 2.2 14B model; generated charts correctly in TensorBoard.
  • Verified that CosineAnnealingWarmRestartsDecay correctly reduces the base LR after cycle completion without spikes.
  • Confirmed VRAM logging works correctly on RTX 3090.
  • Verified that existing configs (using standard schedulers) continue to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant