feat: Add auto-save checkpoint on training interruption by facok · Pull Request #139 · tdrussell/diffusion-pipe

facok · 2025-03-08T11:30:27Z

Add signal handlers for Ctrl+C(SIGINT) and SIGTERM
Auto-save training progress on interruption
Add exception handling for safe exit
Support synchronized exit in distributed training
Prevent duplicate save triggers

- Add signal handlers for Ctrl+C(SIGINT) and SIGTERM - Auto-save training progress on interruption - Add exception handling for safe exit - Support synchronized exit in distributed training - Prevent duplicate save triggers

- Add triggered flag to interrupt handler to avoid multiple saves when Ctrl+C is pressed - Ensure checkpoint is only saved once during program termination

tdrussell · 2025-03-09T19:35:39Z

train.py

        num_steps += 1
        train_dataloader.sync_epoch()

+        # Check if checkpoint save is needed


Theoretically you could still save a checkpoint twice on the same step, since saver.process_step() could checkpoint on this step. I don't think it matters much (it would just overwrite the checkpoint dir) but it would be nice to gracefully handle this case.

Modify saver.process_step to return whether a checkpoint was saved.

Move this block below saver.process_step.

Add an extra condition to only save the checkpoint if process_step didn't. Make sure to still set the should_save flag to False regardless.

tdrussell · 2025-03-09T19:40:25Z

train.py

+    setup_checkpoint_signal.should_save = False
+
+
+def setup_interrupt_handler(saver):


I'm fine with always setting up a handler for the USR1 signal to checkpoint. But the interrupt handler should be guarded behind a new parameter in the TOML config file which is false by default. For testing things I am often ctrl+c killing the program. And sometimes you launch training and immediately realize something is wrong and just want to terminate it immediately. If my understanding is correct the current code will always wait until you get through the next step, and then save a checkpoint.

An alternative, maybe, would be to set up something where double ctrl+c instantly kills the program. I've seen other tools do that, and I think that would be an okay option as well. Or maybe ctrl+c blocks and waits for user input with a "want to checkpoint [y]/n?" option, and a second ctrl+c at that point instantly kills it?

birunram · 2025-05-18T15:30:09Z

Hi. I'm a noob here. I've successfully trained 5 LoRA with this diffusion pipe, but now I can't. IDK why. During my last successful training, I got interrupted in the middle of the training. Later, I continued to train with a check mark on continue from the previous checkpoint and finished successfully, but after that, I couldn't train any new LORA. I get stuck. What's happened? thx in advance

_I got this error: Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/multiprocess/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.11/dist-packages/multiprocess/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/content/diffusion-pipe/utils/dataset.py", line 741, in _cache_fn
ds.cache_latents(latents_map_fn, regenerate_cache=regenerate_cache, caching_batch_size=caching_batch_size)
File "/content/diffusion-pipe/utils/dataset.py", line 687, in cache_latents
ds.cache_latents(map_fn, regenerate_cache=regenerate_cache, caching_batch_size=caching_batch_size)
File "/content/diffusion-pipe/utils/dataset.py", line 561, in cache_latents
ds.cache_latents(map_fn, regenerate_cache=regenerate_cache, caching_batch_size=caching_batch_size)
File "/content/diffusion-pipe/utils/dataset.py", line 243, in cache_latents
ds.cache_latents(map_fn, regenerate_cache=regenerate_cache, caching_batch_size=caching_batch_size)
File "/content/diffusion-pipe/utils/dataset.py", line 135, in cache_latents
for example in self.latent_dataset.select_columns(['image_file', 'caption']):
File "/usr/local/lib/python3.11/dist-packages/datasets/arrow_dataset.py", line 2384, in iter
formatted_output = format_table(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/datasets/formatting/formatting.py", line 629, in format_table
return formatter(pa_table, query_type=query_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/datasets/formatting/formatting.py", line 396, in call
return self.format_row(pa_table)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/datasets/formatting/torch_formatter.py", line 88, in format_row
row = self.numpy_arrow_extractor().extract_row(pa_table)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/datasets/formatting/formatting.py", line 158, in extract_row
return _unnest(self.extract_batch(pa_table))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/datasets/formatting/formatting.py", line 164, in extract_batch
return {col: self._arrow_array_to_numpy(pa_table[col]) for col in pa_table.column_names}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/datasets/formatting/formatting.py", line 164, in
return {col: self._arrow_array_to_numpy(pa_table[col]) for col in pa_table.column_names}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/datasets/formatting/formatting.py", line 196, in arrow_array_to_numpy
return np.array(array, copy=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Unable to avoid copy while creating an array as requested.
If using np.array(obj, copy=False) replace it with np.asarray(obj) to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.
====================================================

facok added 2 commits March 8, 2025 19:29

feat: Add auto-save checkpoint on training interruption

f8998ab

- Add signal handlers for Ctrl+C(SIGINT) and SIGTERM - Auto-save training progress on interruption - Add exception handling for safe exit - Support synchronized exit in distributed training - Prevent duplicate save triggers

fix: prevent duplicate checkpoint saves on interrupt signal

5153031

- Add triggered flag to interrupt handler to avoid multiple saves when Ctrl+C is pressed - Ensure checkpoint is only saved once during program termination

tdrussell reviewed Mar 9, 2025

View reviewed changes

tdrussell force-pushed the main branch from 469995b to 34249f2 Compare March 29, 2025 01:16

jstahl6387 mentioned this pull request Apr 26, 2025

Is there any hotkey or configuration option to save and quit the current run? #240

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: Add auto-save checkpoint on training interruption#139

feat: Add auto-save checkpoint on training interruption#139
facok wants to merge 2 commits intotdrussell:mainfrom
facok:main

facok commented Mar 8, 2025

Uh oh!

tdrussell Mar 9, 2025

Uh oh!

tdrussell Mar 9, 2025

Uh oh!

birunram commented May 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		setup_checkpoint_signal.should_save = False


		def setup_interrupt_handler(saver):

Comments

Conversation

facok commented Mar 8, 2025

Uh oh!

tdrussell Mar 9, 2025

Choose a reason for hiding this comment

Uh oh!

tdrussell Mar 9, 2025

Choose a reason for hiding this comment

Uh oh!

birunram commented May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

birunram commented May 18, 2025 •

edited

Loading