Callbacks
Lifecycle + logging hooks fired through the experiment and training loop.
A callback is an object that gets called at fixed moments - when the
experiment starts, before/after each run, after every training step, on each
eval, on checkpoint, at the end. Callbacks are how logging, dashboards,
progress printing, and early stopping plug in without touching the loop. You
list them on ExperimentConfig.callbacks; you write your own when you need a
side effect (ship to S3, push to a service, stop early) the built-ins don't
cover.
Two safety properties matter: a raising callback never kills the run (the exception is logged at WARNING and the next callback runs), and every hook has a no-op default so you override only the ones you care about.
The contract
The base class is evsys_sdk.training.callbacks.Callback. There is no Config
requirement on the base, but built-ins carry name + Config ClassVars so they
work from YAML. Two kinds of state object are threaded in:
LogState(calledLoopStatein code) - live training-loop state:.step,.num_steps,.output_dir,.backend,.log_store,.checkpoint_mgr,.stop_requested,.ctx. Callstate.request_stop()from any loop hook to break the loop after the current step finishes.LogContext- experiment-wide context shared across all hooks:.output_dir,.config,.store,.run_key,.run_config,.group_name,.ids(dashboard ids populated by logger callbacks),.extras(scratch shared between callbacks within a run).
Here is every hook, with its exact signature and when it fires.
Loop-scope hooks (dispatched by the TrainingLoop):
on_train_start(self, state: LoopState) -> None- once, before the for-loop starts. Open log files, init a wandb run, snapshot config.on_step_end(self, state: LoopState, step_idx: int, batch: TrainingBatch, metrics: dict[str, float]) -> None- after every train step's metric row is written.
step_idxis the 0-based step;batchis the step's training batch;metricsis that step's metric dict. The universal per-step hook.
- after every train step's metric row is written.
on_eval(self, state: LoopState, step_idx: int, eval_name: str, metrics: dict[str, float]) -> None- once per evaluator after each in-loop eval.
eval_nameidentifies the evaluator;metricsis its scalar output. Drives early stopping and val curves.
- once per evaluator after each in-loop eval.
on_train_data(self, ctx: LogContext, rows: list[dict]) -> None- once in setup with the final training examples (after the chat template / rendering), so a logger can persist exactly what went into training.on_rollout(self, state: LoopState, step_idx: int, rollouts: list) -> None- per step with the algorithm's on-policy rollouts (RL/SDFTTrajectoryGroups) - fired only whenlog_rollouts/--dryis on and the algorithm setbatch.rollouts(SFT never does, so it never fires for SFT).on_checkpoint(self, state: LoopState, row: ManifestRow) -> None- after each checkpoint manifest row is recorded.rowcarries the checkpoint's name, step, and paths. Useful for shipping/pruning checkpoints.on_train_end(self, state: LoopState, artifacts: LoopArtifacts) -> None- once after the loop completes (including the final checkpoint save).artifactscarries totals (requested steps, train seconds, run dir, checkpoints). Flush summaries, close files.
Experiment-scope hooks (dispatched by the Experiment, not the loop - these
let one logger own the full lifecycle, threading dashboard ids through
ctx.ids):
on_experiment_start(self, ctx: LogContext) -> None- once at experiment start, before any arm. A logger creates the experiment record here and setsctx.ids['experiment_id'].on_group_start(self, ctx: LogContext, group_name: str) -> None- when a new run-group is needed (n_repeatsreplicates or continual stages). A logger creates the group →ctx.ids[f'group:{group_name}'].on_run_start(self, ctx: LogContext) -> None- per arm, before training.ctx.run_configis set. A logger opens its run-scoped sink (e.g.wandb.init) →ctx.ids['run_id'], reading the parent ids to link it.on_benchmark_eval(self, ctx: LogContext, eval_result: EvalResult, predictions: list[dict], *, step: int | None = None) -> None- per benchmark scored, in-loop (
step= the train step) or post-training (step=None).eval_resultcarries metrics + breakdowns + tags;predictionsis the per-task prediction rows.stepis keyword-only.
- per benchmark scored, in-loop (
on_run_end(self, ctx: LogContext, run_result: RunResult, arm: ArmResult) -> None- per arm, after eval, before the run is marked completed.
run_resultis the algorithm's result;armis the arm record. Loggers close their sink and record final status here.
- per arm, after eval, before the run is marked completed.
on_experiment_end(self, ctx: LogContext, result: ExperimentResult) -> None- once at experiment end.resultcarriesbest_arm,best_score,status,conclusion. Final summary / flush.
Use a built-in
callbacks:
- kind: local_logger
params: { print_every: 10 }
- kind: early_stopping
params: { metric: pass_rate, eval_name: val, patience: 3 }| Built-in | What it does / where it writes |
|---|---|
print_progress | Compact one-liner per step to stdout. Params: every (print every Nth step), keys (restrict printed metric keys). |
csv_metrics | Mirrors per-step metric writes into a CSV next to metrics.jsonl. Params: out_path (parent created if missing), delimiter (default ,). New metric keys append columns. |
early_stopping | Watches a metric on on_eval and calls state.request_stop() after N evals without improvement. Params: metric, eval_name (or None = any), patience (default 3), mode (max/min), min_delta. |
wandb_logger | One Weights & Biases run per arm (opened on_run_start, closed on_run_end). Logs per-step + eval + benchmark metrics and a predictions wandb.Table. wandb is imported lazily; if missing, every hook no-ops. Surfaces the run URL on ctx.extras['wandb_url']. Params: project, entity, name, mode, log_every, max_pred_rows. |
tensorboard_logger | One TensorBoard event dir per arm. Logs scalar + eval + benchmark metrics via SummaryWriter. Lazy torch import; no-ops if missing. Params: log_dir (default <output_dir>/tb/<run_key>), flush_secs. |
local_logger | Human-readable local mirror - see exact outputs below. |
evsys_logger | "Callbacks own the store" mode: persists experiment/group/run records, per-step metrics, checkpoints, benchmark evals + predictions to the evsys dashboard, keyed off ctx.ids. Disables itself if the Experiment already has a store= (to avoid double-writes). Params: project_id, flush_every. |
debug_logger | Pretty-prints every hook and its arguments to stdout - pure introspection, no persistence. Drop it in to see exactly what each logger callback receives and in what order. Params: max_len, max_pred_rows. |
What local_logger writes, exactly
LocalLoggerCallback is the single local writer (no duplicate metrics file).
It prints a per-step one-liner (cadence print_every, 0 = silent) and persists
a per-run tree under <output_dir>/<run_key>/logs/, organized by concern:
<output_dir>/<run_key>/
logs/
data/ training_data.jsonl # final examples fed to the model (post chat-template) - on_train_data
training/ metrics.jsonl # per-step train metrics - on_step_end
rollouts.jsonl # training rollouts (reward · usage · decoded text) - on_rollout, --dry only
validation/ metrics.jsonl # in-loop val scores - on_eval
rollouts.jsonl # validation predictions/rollouts
test/ metrics.jsonl # final benchmark scores - on_benchmark_eval (split=test)
rollouts.jsonl # test predictions
hypothesis.md conclusion.md # per-run mirror of the experiment hypothesis + conclusion
.harbor/{train,val,test}/ # harbor's verbose rollout workspace - hidden, OUTSIDE logs/Notes:
- Metric rows are
{step, split, metrics}; thesplit(train/val/test) routes the row to the right folder. rollouts.jsonlrows carry the trajectory's reward, token/cost usage, and its own decoded completion text (paired at harvest, so text always matches its reward). Training rollouts are written only under--dry/log_rollouts.hypothesis.md/conclusion.mdare written into every run'slogs/(hypothesis at experiment start, conclusion at end), so the local mirror carries them with no dashboard.- Harbor's raw rollout artifacts live in a hidden
.harbor/outsidelogs/, sologs/stays clean. output_dirdefaults to the directory of the config file (whereconfig.yaml/run.pylive), so logs land next to the experiment that produced them.
Create your own
Subclass Callback, override only the hooks you need, carry name + Config,
and decorate with @register_callback("<name>"):
from typing import ClassVar
from pydantic import BaseModel, ConfigDict
from evsys_sdk.training.callbacks import Callback
from evsys_sdk.registry import register_callback
class SlackNotifyConfig(BaseModel):
model_config = ConfigDict(extra="forbid")
webhook_url: str
@register_callback("slack_notify")
class SlackNotifyCallback(Callback):
name: ClassVar[str] = "slack_notify" # the YAML `kind`
Config: ClassVar[type] = SlackNotifyConfig
def __init__(self, *, webhook_url: str) -> None:
self.webhook_url = webhook_url
def on_experiment_end(self, ctx, result) -> None:
import requests
best = getattr(getattr(result, "best_arm", None), "name", None)
requests.post(self.webhook_url, json={
"text": f"Experiment done - best arm {best}, score {getattr(result, 'best_score', None)}",
})Then list it on the experiment:
callbacks:
- kind: slack_notify
params:
webhook_url: https://hooks.slack.com/services/...Ship it in a package
Callbacks are constructed from {kind, params} specs via build_callbacks,
which validates params against your Config and resolves kind through the
callback registry. To distribute a callback without copying code, import the
module that runs your @register_callback decorator at startup (e.g. from your
project's package __init__). Note callbacks are not among the entry-point
groups evsys_sdk auto-loads (those cover algorithms, verifiers, metrics, data
stores, log stores, backends, inference, transforms) - so ensure your
registering module is imported before the experiment runs.