feat: Add auto-save checkpoint on training interruption#139
feat: Add auto-save checkpoint on training interruption#139facok wants to merge 2 commits intotdrussell:mainfrom
Conversation
facok
commented
Mar 8, 2025
- Add signal handlers for Ctrl+C(SIGINT) and SIGTERM
- Auto-save training progress on interruption
- Add exception handling for safe exit
- Support synchronized exit in distributed training
- Prevent duplicate save triggers
- Add signal handlers for Ctrl+C(SIGINT) and SIGTERM - Auto-save training progress on interruption - Add exception handling for safe exit - Support synchronized exit in distributed training - Prevent duplicate save triggers
- Add triggered flag to interrupt handler to avoid multiple saves when Ctrl+C is pressed - Ensure checkpoint is only saved once during program termination
| num_steps += 1 | ||
| train_dataloader.sync_epoch() | ||
|
|
||
| # Check if checkpoint save is needed |
There was a problem hiding this comment.
Theoretically you could still save a checkpoint twice on the same step, since saver.process_step() could checkpoint on this step. I don't think it matters much (it would just overwrite the checkpoint dir) but it would be nice to gracefully handle this case.
- Modify saver.process_step to return whether a checkpoint was saved.
- Move this block below saver.process_step.
- Add an extra condition to only save the checkpoint if process_step didn't. Make sure to still set the should_save flag to False regardless.
| setup_checkpoint_signal.should_save = False | ||
|
|
||
|
|
||
| def setup_interrupt_handler(saver): |
There was a problem hiding this comment.
I'm fine with always setting up a handler for the USR1 signal to checkpoint. But the interrupt handler should be guarded behind a new parameter in the TOML config file which is false by default. For testing things I am often ctrl+c killing the program. And sometimes you launch training and immediately realize something is wrong and just want to terminate it immediately. If my understanding is correct the current code will always wait until you get through the next step, and then save a checkpoint.
An alternative, maybe, would be to set up something where double ctrl+c instantly kills the program. I've seen other tools do that, and I think that would be an okay option as well. Or maybe ctrl+c blocks and waits for user input with a "want to checkpoint [y]/n?" option, and a second ctrl+c at that point instantly kills it?
|
Hi. I'm a noob here. I've successfully trained 5 LoRA with this diffusion pipe, but now I can't. IDK why. During my last successful training, I got interrupted in the middle of the training. Later, I continued to train with a check mark on continue from the previous checkpoint and finished successfully, but after that, I couldn't train any new LORA. I get stuck. What's happened? thx in advance _I got this error: Traceback (most recent call last): |