Dl hardening#138
Conversation
- save the last N checkpoints - load latest checkpoint
choosehappy
left a comment
There was a problem hiding this comment.
overall good, but take a look at these thoughts - a good pattern to use if possible to simplify things and ensure better consistency. does it fit here?
| if not checkpoint_files: | ||
| return None | ||
| latest_checkpoint = max(checkpoint_files, key=lambda f: os.path.getctime(os.path.join(savepath, f))) | ||
| return os.path.join(savepath, latest_checkpoint) |
There was a problem hiding this comment.
i see here you're using ctime to define the ordering of files. this feels more like an implicit definition than an explicit definition. generally, if e.g., things get restored from a backup, they may all get the same ctime, and then this measure becomes chaotic. however, you are also the one writing the filename, which allows for an explicit ordering - why not name the files in a way such that the explicit filename results in the correct ordering such as prefix = [yyyy,mm,dd,hh,mm], which should then sort organically correctly?
| savepath = self.get_class_checkpoint_path(annotation_class_id) | ||
| if not os.path.exists(savepath): | ||
| return | ||
| checkpoint_files = [os.path.basename(f) for f in glob.glob(os.path.join(savepath, f"*{constants.CHECKPOINT_FILENAME}"))] |
There was a problem hiding this comment.
this pattern feels a bit clunky - why not use a deque?
it gets populated at startup, and then it maintains itself organically?
from collections import deque
import os
class RollingFileQueue:
def __init__(self, maxlen=10):
self.queue = deque(maxlen=maxlen)
def push(self, filename):
if len(self.queue) == self.queue.maxlen:
# The oldest file is about to be evicted
oldest = self.queue[0]
os.remove(oldest)
self.queue.append(filename)
def pop(self):
return self.queue.popleft()There was a problem hiding this comment.
btw, i realized that ifyou want to do it via strings, the strings self sort, easier way, similar concept
import bisect
import os
class RollingFilenameQueue:
def __init__(self, maxlen=10):
self.queue = []
self.maxlen = maxlen
def push(self, filename):
bisect.insort(self.queue, filename) # insert in lexographic sorted order
if len(self.queue) > self.maxlen:
oldest = self.queue.pop(0) # evict the lexographically smallest (oldest date)
os.remove(oldest)
def pop(self):
return self.queue.pop(0) # FIFO — take the oldest (smallest) first
def peek(self):
return self.queue[0] if self.queue else None
def __len__(self):
return len(self.queue)
def __repr__(self):
return f"RollingFilenameQueue({self.queue})"Why this works with your filenames
Since your filenames are YYYYMMDDHHMM, lexicographic order is chronological order — so bisect.insort naturally keeps them oldest → newest:
["202501010800", "202501011200", "202501011600"]
oldest → → → → → → → → newest
When the queue exceeds 10, pop(0) evicts the oldest file and deletes it from disk.
There was a problem hiding this comment.
Definitely agree that sorting should be done by filename rather than ctime.
I'm not against the dequeue structure, but I think it should be a thin client for the file system rather than an in-memory structure. Otherwise it might fall out of sync if the file system changes while the application is still running (e.g., user deletes checkpoints manually)
Thoughts?
There was a problem hiding this comment.
thats okay for me as well - just note that it comes with the overhead of having to do a sort everytime a checkpoint is created. that said, if there are only e.g., 5 or 10 checkpoints, the comptuational overhead is trivial , and your preference for direct alignment with reality is prefered.
however, in that case, a deque is overkill : ) instead it should be: save checkpoint -> glob - > sort -> delete all checkpoints checkpoins[max_checkpoints:]
| @@ -314,7 +304,7 @@ def _to_scalar(val): | |||
| #print ("losses:\t",loss_total,positive_mask.sum(),positive_loss,unlabeled_loss) | |||
|
|
|||
| last_save+=1 | |||
There was a problem hiding this comment.
approach i mentioned would avoid this type of bookkeeping, which can be a bit fragile
QuickAnnotator will now save the last N model checkpoints for each class, and will automatically load the latest checkpoint.
Additionally, we ensure that all loss output is wrapped in safe_loss to prevent model corruption due to NaN/inf values.