Feature Pack: LR Finder, Advanced Schedulers, Metric Logging & Run Management improvements by fizista · Pull Request #463 · tdrussell/diffusion-pipe

fizista · 2025-11-27T16:10:55Z

🚀 Summary
This PR introduces several significant enhancements aimed at improving the training workflow, debugging capabilities, and scheduler flexibility. It includes a native Learning Rate Finder, improved memory monitoring, dynamic scheduler loading, and better run management.

✨ Key Changes

🔍 Learning Rate Finder
Implemented a fully functional LR Finder based on the "Linear/Exponential Range Test" method.

New CLI Arguments: --lr_finder, --lr_finder_steps, --lr_finder_start, --lr_finder_end.
Behavior: Runs a short simulation before the main training loop, increasing LR exponentially.
Visualization: Generates a matplotlib chart saved directly to TensorBoard (Images tab)
Safety: Exits automatically after the test without overwriting checkpoints.

📉 Scheduler Improvements

New Scheduler: Added CosineAnnealingWarmRestartsDecay. This fixes a limitation in the standard PyTorch implementation by adding a gamma parameter to decay the maximum LR after every restart cycle.
Dynamic Loading: Added support for type = 'external' in [lr_scheduler_config]. This allows loading any scheduler class dynamically (e.g., from timm, transformers, or standard torch) via a config string (e.g., class = 'torch.optim.lr_scheduler.StepLR').
Flexible Warmup: warmup_steps now accepts float values (0.0 - 1.0) to represent a percentage of total steps, in addition to absolute integer values.

📊 Enhanced Logging & Metrics

Memory Monitoring: Added logging for System RAM (memory/ram_main_process_gb) and GPU VRAM (memory/vram_allocated_gb, memory/vram_reserved_gb, memory/vram_peak_gb).
Parameter Norm: Added train/total_param_norm to monitor weight magnitudes during training.

📂 Run Management

Run ID: Added --run_id argument. This allows appending a custom suffix to the run_dir (e.g., 20241127_120000__my_experiment_v1), making it easier to identify specific experiments on disk and in TensorBoard.

⚙️ Configuration Examples

Using the new Cosine Decay Scheduler:

[lr_scheduler_config]
type = 'cosine_annealing_warm_restarts_decay'
T_0 = 1000
T_mult = 1
eta_min = 1e-7
gamma = 0.9  # Decay factor for restarts

Using an External Scheduler:

[lr_scheduler_config]
type = 'external'
class = 'torch.optim.lr_scheduler.StepLR'
step_size = 30
gamma = 0.1

🧪 Verification

Tested lr_finder on Wan 2.2 14B model; generated charts correctly in TensorBoard.
Verified that CosineAnnealingWarmRestartsDecay correctly reduces the base LR after cycle completion without spikes.
Confirmed VRAM logging works correctly on RTX 3090.
Verified that existing configs (using standard schedulers) continue to work.

fizista added 7 commits November 27, 2025 15:17

Change: schedules are now configurable

efc99a7

Add: CosineAnnealingWarmRestartsDecay

c7d740c

Add: Loading external scheduler

166dd73

Change: warmup_steps now has a percentage or absolute value

24d949c

Add: Lerning Rate finder

4693680

Add: run_id, suffix

d348991

Add: New metrics

25b4ea3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Feature Pack: LR Finder, Advanced Schedulers, Metric Logging & Run Management improvements#463

Feature Pack: LR Finder, Advanced Schedulers, Metric Logging & Run Management improvements#463
fizista wants to merge 7 commits intotdrussell:mainfrom
fizista:main

fizista commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

fizista commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant