Your PyTorch training is hiding problems. Here's what you're missing:
- π GPU sitting idle while DataLoader struggles
- πΎ Memory leaking 100MB per epoch
- π Loss about to explode (but you'll only know in 2 hours)
TrainWatch shows you in real-time. One line of code.
pip install trainwatchfrom trainwatch import Watcher
watcher = Watcher()
for epoch in range(epochs):
for images, labels in dataloader:
loss = train_step(images, labels)
watcher.step(loss=loss.item())
watcher.epoch_end()That's it. You now see:
- Step timing
- Loss trends (moving average)
- CPU/RAM usage
- GPU VRAM tracking
- Real-time warnings
Step 10 | loss=2.3045 | time=0.234s | CPU=45.2% | RAM=23.1% | VRAM=1024MB
Step 20 | loss=2.1234 | time=0.231s | CPU=46.1% | RAM=23.4% | VRAM=1025MB
Step 30 | loss=1.9876 | time=0.229s | CPU=44.8% | RAM=23.6% | VRAM=1026MB
β οΈ WARNING: Loss variance spike detected - training may be unstable
Step 40 | loss=2.5432 | time=0.235s | CPU=45.5% | RAM=23.8% | VRAM=1027MB
============================================================
Epoch 1 Summary:
Loss (avg): 2.1204 [decreasing]
VRAM delta: +3.2MB
============================================================
watcher = Watcher(
window=20, # Moving average window (default: 20)
print_every=10, # Print every N steps (default: 10)
show_gpu=True, # Show GPU metrics (default: True)
warn_on_leak=True, # Warn on memory leaks (default: True)
warn_on_bottleneck=True, # Warn on DataLoader issues (default: True)
warn_on_variance=True, # Warn on loss spikes (default: True)
device='cuda:0' # GPU device (default: 'cuda:0')
)| Metric | What It Tells You |
|---|---|
| Step Time | How fast your training loop runs |
| Loss (Moving Avg) | Smoothed loss trend, easier to read than raw loss |
| Loss Variance | Detects training instability before it crashes |
| CPU/RAM | System load - high RAM often means DataLoader issues |
| GPU VRAM | Memory usage - tracks leaks across epochs |
| VRAM Delta | Memory increase per epoch - positive = leak |
TrainWatch gives you 3 critical warnings:
Your loss is jumping around wildly. Training might diverge.
Likely cause: Learning rate too high, bad batch, or data issue
VRAM increasing +50MB per epoch.
Likely cause: Tensors not released, gradients accumulating, or Python refs
GPU idle while waiting for data.
Likely cause: num_workers=0, slow data augmentation, or I/O issues
β Use it when:
- Training a new model for the first time
- Debugging slow or unstable training
- Running long experiments overnight
- Want peace of mind your training is healthy
β Don't need it when:
- Training is working perfectly (lucky you!)
- You're using WandB/TensorBoard for detailed logging
- You want fancy visualizations (we're CLI-only for now)
- No cloud required - everything runs locally
- No fancy UI - just clean terminal output
- No metric logging - just real-time monitoring (v0.2 will add CSV export)
- No distributed training - single GPU only for now
See the examples/ directory for complete working examples:
Perfect for getting started. Shows basic TrainWatch integration with a simple 2-layer CNN.
python examples/cifar10_simple.pyTested on: Kaggle CPU, GPU T4, GPU P100
Training time: ~2 min (GPU)
Results: examples/cifar10_results.md
Real PyTorch model from torchvision.models, training from scratch.
python examples/densenet_cifar10.pyModel: DenseNet121 (weights=None, ~7M params)
Image size: 224Γ224 (CIFAR resized)
VRAM: ~850MB
Shows: Production architecture, gradient clipping
Production-ready example with ResNet-18, data augmentation, and LR scheduling.
python examples/resnet_fashion_mnist.pyModel: 11M parameters
Training time: ~5 min (GPU)
Interactive demo showing memory leak detection in action.
python examples/memory_leak_demo.pyShows: Intentional leak vs correct implementation
TrainWatch warns: Memory leak detected automatically!
π Full examples documentation: examples/README.md
All examples tested on Kaggle with real GPUs. Full results in examples/*_results.md.
| Example | GPU | Step Time | Accuracy | VRAM | Notes |
|---|---|---|---|---|---|
| Simple CNN | T4 | ~5ms | 75% | 25MB | 12x faster than CPU |
| P100 | ~4ms | 75% | 25MB | 15x faster than CPU | |
| DenseNet121 | T4 | 331ms | 81.76% | 115MB | 224Γ224 images |
| P100 | 175ms | 82.15% | 115MB | 1.9x faster than T4 | |
| ResNet-18 | T4 | 85ms | 92.28% | 147MB | Fashion-MNIST |
| P100 | 47ms | 91.86% | 148MB | 1.8x faster than T4 | |
| Memory Leak | Both | - | - | +1.2MB | Leak detected! |
β
TrainWatch Overhead: <1ms per step (negligible)
β
Memory Leak Detection: Perfect - caught +1.2MB leak in 3 epochs
β
VRAM Tracking: Accurate across all models (25MB - 4GB range)
β
Cross-GPU Consistency: Identical behavior on T4 and P100
β
No False Positives: 0 false alarms on healthy training runs
π Try it yourself: TrainWatch Examples on Kaggle
All examples ready to run with one click! Includes:
- Simple CNN (CPU, T4, P100 tested)
- DenseNet121 (production model)
- ResNet-18 (Fashion-MNIST)
- Memory Leak Demo (educational)
- Python 3.8+
- PyTorch 1.9+
- psutil
- numpy
From PyPI:
pip install trainwatchFrom source (for development):
git clone https://github.com/Hords01/trainwatch.git
cd trainwatch
pip install -e . # Editable installSee examples/cifar10_demo.py for a complete working example.
Found a bug? Have a feature request?
Open an issue or PR on GitHub
Emirkan Beyaz
- π§ Email: emirkanbeyaz01@gmail.com
- πΌ LinkedIn: linkedin.com/in/emirkan-beyaz-07732933b
- π GitHub: @Hords01
Built with β€οΈ for the PyTorch community
MIT License - see LICENSE file
Because watching loss=2.3456 scroll by for 3 hours, only to find out your DataLoader was the bottleneck all along, is painful.
TrainWatch catches problems while you can still fix them.