transformerless_lm: continuous self-distillation + cycle checkpointing#5
Draft
RandomCoder-lab wants to merge 1 commit into
Draft
transformerless_lm: continuous self-distillation + cycle checkpointing#5RandomCoder-lab wants to merge 1 commit into
RandomCoder-lab wants to merge 1 commit into
Conversation
After PR #4 closed the train/inference omniweight asymmetry, the natural follow-up: don't stop at cycle 6. The active_base ratchet (seed + appended best refined outputs) is exactly the kind of process where compounding past a fixed budget might find regimes the 6-cycle window can't reach. --continuous: replaces `for cycle in range(n_cycles)` with an unbounded loop. n_cycles still controls steps_per_cycle (args.steps // n_cycles) so per-cycle training budget stays calibrated; the cycle counter just keeps going. K-shrink schedule clamps to K_min once global_step exceeds args.steps, which is the standard end state of the curriculum anyway. --checkpoint PATH: serializes the entire distillation state every cycle (model state_dict, FibAdamW optimizer state, active_base, cycle counter, global_step, best_creativity, best_val/step, cycle_summary, rejection counters, best_refined_seq). Atomic write via tmp+os.replace so an interrupt mid-save can't corrupt the file. If the checkpoint exists at startup, training resumes from the saved cycle+1 with the active_base fully intact -- the ratchet picks up exactly where it stopped. Default behavior unchanged: omitting both flags reproduces the v88 + omniweight-loss bounded 6-cycle run. Run a forever-distillation with omniweight-loss: python3 train_self_recursive.py --omniweight-loss \\ --continuous --checkpoint omniweight_distill.pt Resume after Ctrl-C: re-run the same command. Checkpoint state restored, next cycle is start_cycle.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to PR #4 (omniweight loss on training data). After closing the train/inference asymmetry, the natural question was whether the self-distillation ratchet (
active_base← seed + appended best refined outputs) compounds further than the bounded 6-cycle window allows. This PR makes the loop unbounded and Ctrl-C-resumable.Changes
Two new flags on
train_self_recursive.py:--continuous— replacesfor cycle in range(n_cycles)with an unbounded loop.n_cyclesstill controlssteps_per_cycle = args.steps // n_cyclesso per-cycle training budget stays calibrated; the cycle counter just keeps going. K-shrink schedule clamps toK_minonceglobal_stepexceedsargs.steps, which is the standard end state of the curriculum.--checkpoint PATH— serializes the entire distillation state every cycle:state_dict+FibAdamWoptimizer stateactive_basetensor (the growing self-distilled corpus)Atomic write via
tmp+os.replaceso an interrupt mid-save can't corrupt the file. If the checkpoint exists at startup, training resumes fromsaved_cycle + 1withactive_basefully intact.Behavior
--continuous --checkpoint X.pt: runs forever, saves after every cycle. Ctrl-C and resume works.Usage
Test plan
--steps 6000 --n-cycles 6and let it ride 20+ cycles, log creativity trajectoryGenerated by Claude Code