Experiment
Bundle ExperimentConfig + dashboard writes + per-arm orchestration.
Attributes
attributeconfig= configattributestore= storeattributetrain_fn= train_fn or _default_train_fnattributeinference_factory= inference_factoryFunctions
func__init__(self, config, *, store=None, train_fn=None, benchmark=None, inference_factory=None) -> NoneparamselfparamconfigExperimentConfigparamstoreAny | None= Noneparamtrain_fnTrainFn | None= NoneparambenchmarkBenchmark | None= Noneparaminference_factoryInferenceFactory | None= NoneReturns
Nonefuncfrom_yaml(cls, path, **kwargs) -> Experimentparamclsparampathstr | PathparamkwargsAny= {}Returns
evsys_sdk.experiment.Experimentfuncrun(self) -> ExperimentResultparamselfReturns
evsys_sdk.experiment.ExperimentResultfunc_iter_runs(self) -> list[RunConfig]Expand sweep matrix / single-run / multi-run to a flat list of primaries.
Each entry is a "group" when n_repeats > 1. See
:meth:_replicates_for for the per-primary seeded replicates.
paramselfReturns
list[evsys_sdk.config.RunConfig]func_replicates_for(self, primary) -> list[tuple[RunConfig, str | None]]Per-primary seed replicates.
For n_repeats == 1: returns [(primary, None)] - no group.
For n_repeats > 1: returns N tuples of
(\<RunConfig with name=primary.name__s\<seed> and seed=\<seed>>, primary.name).
Seeds run [base_seed, base_seed+1, ...] when base_seed is set,
else [primary.seed, primary.seed+1, ...].
paramselfparamprimaryRunConfigReturns
list[tuple[evsys_sdk.config.RunConfig, str | None]]func_resolve_benchmarks(self, raw) -> list[tuple[Benchmark, dict]]Normalize metadata.benchmark to a list of (Benchmark, spec).
Accepts three shapes:
None/ empty →[](no eval).- single
dict(legacy single-benchmark form) → wrapped into a one-element list withnamedefaulting to"benchmark"(or the spec's ownnameif present). list[dict](new multi-benchmark form) → each entry must carry anameand may carrytagsandrun_every.
self._benchmark_override (test seam) bypasses everything and
returns a single-entry list.
paramselfparamrawdict | list | NoneReturns
list[tuple[evsys_sdk.benchmark.Benchmark, dict]]func_create_experiment(self, hypothesis, tags, meta) -> str | Noneparamselfparamhypothesisstr | Noneparamtagslist[str]parammetadictReturns
str | Nonefunc_execute_arm(self, experiment_id, run_cfg, benchmarks, meta, *, group_id=None, group_name=None, score_api_models=True) -> ArmResultparamselfparamexperiment_idstr | Noneparamrun_cfgRunConfigparambenchmarkslist[tuple[Benchmark, dict]]parammetadictparamgroup_idstr | None= Noneparamgroup_namestr | None= Noneparamscore_api_modelsbool= TrueReturns
evsys_sdk.experiment.ArmResultfunc_train_arm(self, arm, run_cfg) -> ArmResultparamselfparamarmArmResultparamrun_cfgRunConfigReturns
evsys_sdk.experiment.ArmResultfunc_forward_step_metrics(self, arm) -> NonePush the arm's local metrics.jsonl rows to the store.
Runner-time logging writes locally; this batch-forwards to the dashboard so the script doesn't have to call backfill_step_metrics manually after training.
paramselfparamarmArmResultReturns
Nonefunc_resolve_run_dir(self, arm) -> Path | NoneReconstruct the run output dir the runner wrote into.
paramselfparamarmArmResultReturns
pathlib.Path | Nonefunc_resolve_inference_factory(self, run_cfg) -> InferenceFactory | NoneUser-supplied factory wins; otherwise pick a default by backend kind.
Falls back to the registry's get_default_inference_factory so we
don't have to import backend-specific inference modules here - e.g.
tinker registers its own default at module load.
paramselfparamrun_cfgRunConfigReturns
evsys_sdk.experiment.InferenceFactory | Nonefunc_eval_arm(self, arm, run_cfg, benchmarks, meta, *, score_api_models=True) -> ArmResultScore each post-training benchmark and attach an EvalResult per entry.
score_api_models=False skips the closed/API-model (benchmark.models)
evals - used for continual stages after the first, since a closed model's
weights are fixed across stages so one scoring suffices (only the trained
checkpoint, which changes per stage, is re-scored).
Entries flagged with run_every are in-loop and skipped here (their
scoring happens during training in the algorithm wrapper - task
commit 2). Entries without run_every get a single post-training
EvalResult appended to arm.evals.
After all benchmarks score, the flat back-compat fields
(arm.eval_metrics / eval_breakdowns / eval_seconds)
mirror the primary eval - the first test-tagged
post-training row, else the first post-training row, else nothing.
paramselfparamarmArmResultparamrun_cfgRunConfigparambenchmarkslist[tuple[Benchmark, dict]]parammetadictparamscore_api_modelsbool= TrueReturns
evsys_sdk.experiment.ArmResultfunc_eval_arm_harbor(self, arm, run_cfg, bench, bench_meta, *, api_model=None) -> NoneScore one benchmark through harbor's rollout engine and upload the
eval rollouts (kind='eval'). Opt-in via benchmark.engine: harbor.
api_model (a litellm string) scores a closed / API model instead of
the trained checkpoint - same rollout path, model_client='litellm',
recorded as its own per-model eval. None → the trained checkpoint.
paramselfparamarmArmResultparamrun_cfgRunConfigparambenchBenchmarkparambench_metadictparamapi_modelstr | None= NoneReturns
Nonefunc_final_checkpoint(arm) -> str | NoneThe trained sampler checkpoint URI from the arm's artifacts.
paramarmArmResultReturns
str | Nonefunc_run_continual(self, experiment_id, benchmarks, meta) -> list[ArmResult]Train the base run once per dataset in continual.datasets, in
order, chaining each stage's final weights (fresh optimizer) into the
next. Each completed stage is scored on every benchmark via
:meth:_execute_arm; a chain stops at the first stage that does not
complete.
With n_repeats > 1 the whole chain is replicated once per seed
([base_seed, base_seed+1, ...] or [run.seed, ...]). Stage i
across all repeats shares one dashboard group, so variance is aggregated
per stage. Within a chain, weights are chained only between that chain's
own stages.
paramselfparamexperiment_idstr | Noneparambenchmarkslist[tuple[Benchmark, dict]]parammetadictReturns
list[evsys_sdk.experiment.ArmResult]func_final_state_checkpoint(arm) -> str | NoneThe full training-state path (weights + optimizer) of an arm's final
checkpoint. Continual learning loads weights only from this into the
next stage. Distinct from :meth:_final_checkpoint, which returns the
inference-only sampler path.
paramarmArmResultReturns
str | Nonefunc_create_group(self, experiment_id, name) -> str | NoneRegister a run group for variance studies; returns its id (or None).
paramselfparamexperiment_idstr | NoneparamnamestrReturns
str | Nonefunc_create_run(self, experiment_id, run_cfg, *, group_id=None) -> str | Noneparamselfparamexperiment_idstr | Noneparamrun_cfgRunConfigparamgroup_idstr | None= NoneReturns
str | Nonefunc_mark_run_completed(self, run_id, arm) -> Noneparamselfparamrun_idstr | NoneparamarmArmResultReturns
Nonefunc_mark_run_failed(self, run_id, error) -> Noneparamselfparamrun_idstr | NoneparamerrorstrReturns
Nonefunc_record_eval(self, arm, benchmark, bench_meta, score) -> NoneparamselfparamarmArmResultparambenchmarkBenchmarkparambench_metadictparamscoreBenchmarkScoreReturns
Nonefunc_finalize_experiment(self, experiment_id, status, best_score, conclusion) -> Noneparamselfparamexperiment_idstr | Noneparamstatusstrparambest_scorefloat | NoneparamconclusionstrReturns
Nonefunc_pick_best(self, arms, metric) -> ArmResult | Noneparamselfparamarmslist[ArmResult]parammetricstrReturns
evsys_sdk.experiment.ArmResult | Nonefunc_build_conclusion(self, arms, best_arm, success_metric) -> strparamselfparamarmslist[ArmResult]parambest_armArmResult | Noneparamsuccess_metricstr | NoneReturns
str