Dl hardening by jacksonjacobs1 · Pull Request #138 · choosehappy/QuickAnnotator

jacksonjacobs1 · 2026-05-11T16:27:07Z

QuickAnnotator will now save the last N model checkpoints for each class, and will automatically load the latest checkpoint.

Additionally, we ensure that all loss output is wrapped in safe_loss to prevent model corruption due to NaN/inf values.

- save the last N checkpoints - load latest checkpoint

choosehappy

overall good, but take a look at these thoughts - a good pattern to use if possible to simplify things and ensure better consistency. does it fit here?

choosehappy · 2026-05-11T18:01:06Z

+        if not checkpoint_files:
+            return None
+        latest_checkpoint = max(checkpoint_files, key=lambda f: os.path.getctime(os.path.join(savepath, f)))
+        return os.path.join(savepath, latest_checkpoint)


i see here you're using ctime to define the ordering of files. this feels more like an implicit definition than an explicit definition. generally, if e.g., things get restored from a backup, they may all get the same ctime, and then this measure becomes chaotic. however, you are also the one writing the filename, which allows for an explicit ordering - why not name the files in a way such that the explicit filename results in the correct ordering such as prefix = [yyyy,mm,dd,hh,mm], which should then sort organically correctly?

choosehappy · 2026-05-11T18:07:50Z

+        savepath = self.get_class_checkpoint_path(annotation_class_id)
+        if not os.path.exists(savepath):
+            return
+        checkpoint_files = [os.path.basename(f) for f in glob.glob(os.path.join(savepath, f"*{constants.CHECKPOINT_FILENAME}"))]


this pattern feels a bit clunky - why not use a deque?

it gets populated at startup, and then it maintains itself organically?

from collections import deque import os class RollingFileQueue: def __init__(self, maxlen=10): self.queue = deque(maxlen=maxlen) def push(self, filename): if len(self.queue) == self.queue.maxlen: # The oldest file is about to be evicted oldest = self.queue[0] os.remove(oldest) self.queue.append(filename) def pop(self): return self.queue.popleft()

btw, i realized that ifyou want to do it via strings, the strings self sort, easier way, similar concept

import bisect import os class RollingFilenameQueue: def __init__(self, maxlen=10): self.queue = [] self.maxlen = maxlen def push(self, filename): bisect.insort(self.queue, filename) # insert in lexographic sorted order if len(self.queue) > self.maxlen: oldest = self.queue.pop(0) # evict the lexographically smallest (oldest date) os.remove(oldest) def pop(self): return self.queue.pop(0) # FIFO — take the oldest (smallest) first def peek(self): return self.queue[0] if self.queue else None def __len__(self): return len(self.queue) def __repr__(self): return f"RollingFilenameQueue({self.queue})"

Why this works with your filenames
Since your filenames are YYYYMMDDHHMM, lexicographic order is chronological order — so bisect.insort naturally keeps them oldest → newest:
["202501010800", "202501011200", "202501011600"]
oldest → → → → → → → → newest
When the queue exceeds 10, pop(0) evicts the oldest file and deletes it from disk.

Definitely agree that sorting should be done by filename rather than ctime.

I'm not against the dequeue structure, but I think it should be a thin client for the file system rather than an in-memory structure. Otherwise it might fall out of sync if the file system changes while the application is still running (e.g., user deletes checkpoints manually)

Thoughts?

thats okay for me as well - just note that it comes with the overhead of having to do a sort everytime a checkpoint is created. that said, if there are only e.g., 5 or 10 checkpoints, the comptuational overhead is trivial , and your preference for direct alignment with reality is prefered.

however, in that case, a deque is overkill : ) instead it should be: save checkpoint -> glob - > sort -> delete all checkpoints checkpoins[max_checkpoints:]

choosehappy · 2026-05-11T18:08:47Z

@@ -314,7 +304,7 @@ def _to_scalar(val):
            #print ("losses:\t",loss_total,positive_mask.sum(),positive_loss,unlabeled_loss)

            last_save+=1


approach i mentioned would avoid this type of bookkeeping, which can be a bit fragile

jacksonjacobs1 added 4 commits May 11, 2026 08:42

added safe loss to seg_loss

726472f

Checkpoint functionality

260d31e

- save the last N checkpoints - load latest checkpoint

moved checkpoint management to fsmanager. Added unit tests.

a5c55b0

clean up

b273d48

jacksonjacobs1 requested a review from choosehappy May 11, 2026 16:29

choosehappy requested changes May 11, 2026

View reviewed changes

Merge branch 'v2.0' into dl-hardening

82df308

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dl hardening#138

Dl hardening#138
jacksonjacobs1 wants to merge 5 commits into
choosehappy:v2.0from
jacksonjacobs1:dl-hardening

jacksonjacobs1 commented May 11, 2026 •

edited

Loading

Uh oh!

choosehappy left a comment

Uh oh!

choosehappy May 11, 2026

Uh oh!

choosehappy May 11, 2026

Uh oh!

choosehappy May 11, 2026

Uh oh!

jacksonjacobs1 May 12, 2026

Uh oh!

choosehappy May 12, 2026

Uh oh!

choosehappy May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -314,7 +304,7 @@ def _to_scalar(val):
		#print ("losses:\t",loss_total,positive_mask.sum(),positive_loss,unlabeled_loss)

		last_save+=1

Conversation

jacksonjacobs1 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

choosehappy left a comment

Choose a reason for hiding this comment

Uh oh!

choosehappy May 11, 2026

Choose a reason for hiding this comment

Uh oh!

choosehappy May 11, 2026

Choose a reason for hiding this comment

Uh oh!

choosehappy May 11, 2026

Choose a reason for hiding this comment

Uh oh!

jacksonjacobs1 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

choosehappy May 12, 2026

Choose a reason for hiding this comment

Uh oh!

choosehappy May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jacksonjacobs1 commented May 11, 2026 •

edited

Loading