CheckpointManager

Decide WHEN to save and WRITE the manifest row when we do.

The decision policy is intentionally tiny - should_save(step) returns True every save_every steps (after the optimizer step at that index). The actual save call is dispatched by the loop, which has the live Backend handle; the manager only records the resulting paths.

Final-step save is unconditional: the loop calls save_final(...) after its for-loop completes, so even when save_every doesn't land exactly on the last step the final sampler is always recorded - that's the URI downstream eval consumes.

Attributes

attributelog_path

= Path(log_path)

attributesave_every

= max(0, int(save_every))

attributemanifest_path

= self.log_path / MANIFEST_NAME

attributerowslist[ManifestRow]

Functions

func__init__(self, *, log_path, save_every) -> None

paramself

paramlog_pathPath

paramsave_everyint

Returns

None

funcshould_save(self, step) -> bool

Save after the optimizer step at index step (zero-based).

The convention matches tinker_cookbook: (step + 1) % save_every == 0. Disabled when save_every == 0.

paramself

paramstepint

Returns

bool

funcrecord(self, row) -> None

Append one row to the manifest on disk and remember it.

paramself

paramrowManifestRow

Returns

None

funcfind_resume(self) -> Checkpoint | None

Find the most-recent recorded checkpoint to resume training from.

Reads the existing checkpoints.jsonl (if any) under log_path and picks the last row that has a state_path. The loop hands the state_path back to the backend to recreate the training client with optimizer state intact.

Returns None when no resumable checkpoint is on disk - caller should start fresh.

paramself

Returns

evsys_sdk.checkpoint.Checkpoint | None

CheckpointManager

Attributes

Functions

On this page