Experiments
The organizing unit - a hypothesis, one or more training runs, and an auto-synthesized conclusion.
The Experiment is the scientific container: a hypothesis, one or more
training runs, and an auto-synthesized conclusion. It's the top of the
spine - config.yaml → Experiment → arms → backend → eval → artifacts.
from evsys_sdk import Experiment
Experiment.from_yaml("config.yaml").run()Vocabulary
- Experiment - the top-level study.
Experiment.from_yaml(...).run()owns dashboard experiment/run creation, sweep expansion, per-arm failure isolation, post-train scoring, metric forwarding, and conclusion building. - Training run (arm) - one
RunConfig= one concrete training job (one cell of a sweep).runs/matrixproduce many arms. - Run group -
n_repeats > 1replicates an arm across seeds (sharedgroup_id) so variance is a config field, not a bespoke script. - Hypothesis → success_metric → conclusion - the hypothesis is the
question;
success_metricranks arms intobest_arm; theconclusionsummarizes the outcome. All recorded on the dashboard.
Config objects
| Object | Role |
|---|---|
ExperimentConfig | top level: name, output_dir, stores, one of run/runs/matrix, n_repeats/base_seed, parent_experiment_id, metadata |
RunConfig | one run: data, model, algorithm, backend, eval, validation, seed, tags - the cell the two surfaces plug into |
MatrixSpec / Sweep | cartesian expansion over dotted-path axes → many RunConfigs via one expand_runs() |
ExperimentResult / ArmResult | outputs: per-arm metrics + the experiment-level best_arm, conclusion, hypothesis |
Two runners
Experiment (OOP, with dashboard bookkeeping) wraps
run_experiment(cfg) (the inner per-arm runner). Use run_experiment
directly to train without bookkeeping.
Continual learning
continual is a modifier (not a fourth mode) on top of a single run: give
it an ordered list of datasets, and the base run trains once per dataset, in
order, with each stage starting from the previous stage's weights - a fresh
optimizer initialised via init_from_checkpoint. All stages live in one
experiment and each is scored on every benchmark, so you watch the model
accumulate skills dataset by dataset.
run: # the base recipe; its own `data` is ignored
model: { name: Qwen/Qwen3.5-4B }
algorithm: { kind: sft }
continual:
datasets: # one training stage per entry, chained in order
- { dataset_name: corpus_a, transforms: [...] }
- { dataset_name: corpus_b, transforms: [...] }
- { dataset_name: corpus_c, transforms: [...] }This is the SDK's built-in continually-learning primitive: weights chain
forward so each stage builds on the last instead of restarting from the base
model. With n_repeats > 1 the whole chain is replicated per seed (stage i's
replicates grouped for variance). It requires a single run base (not
runs/matrix).
Next: Data · Algorithms ·
full ExperimentConfig reference.