Skip to content

Dl hardening#138

Open
jacksonjacobs1 wants to merge 5 commits into
choosehappy:v2.0from
jacksonjacobs1:dl-hardening
Open

Dl hardening#138
jacksonjacobs1 wants to merge 5 commits into
choosehappy:v2.0from
jacksonjacobs1:dl-hardening

Conversation

@jacksonjacobs1
Copy link
Copy Markdown
Collaborator

@jacksonjacobs1 jacksonjacobs1 commented May 11, 2026

QuickAnnotator will now save the last N model checkpoints for each class, and will automatically load the latest checkpoint.

Additionally, we ensure that all loss output is wrapped in safe_loss to prevent model corruption due to NaN/inf values.

Copy link
Copy Markdown
Owner

@choosehappy choosehappy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall good, but take a look at these thoughts - a good pattern to use if possible to simplify things and ensure better consistency. does it fit here?

if not checkpoint_files:
return None
latest_checkpoint = max(checkpoint_files, key=lambda f: os.path.getctime(os.path.join(savepath, f)))
return os.path.join(savepath, latest_checkpoint)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see here you're using ctime to define the ordering of files. this feels more like an implicit definition than an explicit definition. generally, if e.g., things get restored from a backup, they may all get the same ctime, and then this measure becomes chaotic. however, you are also the one writing the filename, which allows for an explicit ordering - why not name the files in a way such that the explicit filename results in the correct ordering such as prefix = [yyyy,mm,dd,hh,mm], which should then sort organically correctly?

savepath = self.get_class_checkpoint_path(annotation_class_id)
if not os.path.exists(savepath):
return
checkpoint_files = [os.path.basename(f) for f in glob.glob(os.path.join(savepath, f"*{constants.CHECKPOINT_FILENAME}"))]
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this pattern feels a bit clunky - why not use a deque?

it gets populated at startup, and then it maintains itself organically?

from collections import deque
import os

class RollingFileQueue:
    def __init__(self, maxlen=10):
        self.queue = deque(maxlen=maxlen)

    def push(self, filename):
        if len(self.queue) == self.queue.maxlen:
            # The oldest file is about to be evicted
            oldest = self.queue[0]
            os.remove(oldest)
        self.queue.append(filename)

    def pop(self):
        return self.queue.popleft()

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, i realized that ifyou want to do it via strings, the strings self sort, easier way, similar concept

import bisect
import os

class RollingFilenameQueue:
    def __init__(self, maxlen=10):
        self.queue = []
        self.maxlen = maxlen

    def push(self, filename):
        bisect.insort(self.queue, filename)  # insert in lexographic sorted order
        if len(self.queue) > self.maxlen:
            oldest = self.queue.pop(0)       # evict the lexographically smallest (oldest date)
            os.remove(oldest)

    def pop(self):
        return self.queue.pop(0)             # FIFO — take the oldest (smallest) first

    def peek(self):
        return self.queue[0] if self.queue else None

    def __len__(self):
        return len(self.queue)

    def __repr__(self):
        return f"RollingFilenameQueue({self.queue})"

Why this works with your filenames
Since your filenames are YYYYMMDDHHMM, lexicographic order is chronological order — so bisect.insort naturally keeps them oldest → newest:
["202501010800", "202501011200", "202501011600"]
oldest → → → → → → → → newest
When the queue exceeds 10, pop(0) evicts the oldest file and deletes it from disk.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree that sorting should be done by filename rather than ctime.

I'm not against the dequeue structure, but I think it should be a thin client for the file system rather than an in-memory structure. Otherwise it might fall out of sync if the file system changes while the application is still running (e.g., user deletes checkpoints manually)

Thoughts?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats okay for me as well - just note that it comes with the overhead of having to do a sort everytime a checkpoint is created. that said, if there are only e.g., 5 or 10 checkpoints, the comptuational overhead is trivial , and your preference for direct alignment with reality is prefered.

however, in that case, a deque is overkill : ) instead it should be: save checkpoint -> glob - > sort -> delete all checkpoints checkpoins[max_checkpoints:]

@@ -314,7 +304,7 @@ def _to_scalar(val):
#print ("losses:\t",loss_total,positive_mask.sum(),positive_loss,unlabeled_loss)

last_save+=1
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approach i mentioned would avoid this type of bookkeeping, which can be a bit fragile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants