CheckpointManager
Decide WHEN to save and WRITE the manifest row when we do.
The decision policy is intentionally tiny - should_save(step) returns
True every save_every steps (after the optimizer step at that index).
The actual save call is dispatched by the loop, which has the live
Backend handle; the manager only records the resulting paths.
Final-step save is unconditional: the loop calls save_final(...) after
its for-loop completes, so even when save_every doesn't land exactly
on the last step the final sampler is always recorded - that's the URI
downstream eval consumes.
Attributes
attributelog_path= Path(log_path)attributesave_every= max(0, int(save_every))attributemanifest_path= self.log_path / MANIFEST_NAMEattributerowslist[ManifestRow]Functions
func__init__(self, *, log_path, save_every) -> Noneparamselfparamlog_pathPathparamsave_everyintReturns
Nonefuncshould_save(self, step) -> boolSave after the optimizer step at index step (zero-based).
The convention matches tinker_cookbook: (step + 1) % save_every == 0.
Disabled when save_every == 0.
paramselfparamstepintReturns
boolfuncrecord(self, row) -> NoneAppend one row to the manifest on disk and remember it.
paramselfparamrowManifestRowReturns
Nonefuncfind_resume(self) -> Checkpoint | NoneFind the most-recent recorded checkpoint to resume training from.
Reads the existing checkpoints.jsonl (if any) under log_path and
picks the last row that has a state_path. The loop hands the
state_path back to the backend to recreate the training client with
optimizer state intact.
Returns None when no resumable checkpoint is on disk - caller
should start fresh.
paramselfReturns
evsys_sdk.checkpoint.Checkpoint | None