Experiment - the top-level OOP orchestrator.

What it replaces in researcher scripts: the manual create_experiment → create_group → per-arm create_run → run_experiment(cfg) → create_eval → set_conclusion choreography. Today every sweep script hand-rolls that loop. An Experiment collapses it to one declarative .run() call.

Usage:

run.py

from evsys_sdk import Experiment import src # registers custom verifiers/transforms

Experiment.from_yaml("config.yaml").run()

Per-arm failure isolation: if one sweep arm raises during training, it gets marked status=failed on the dashboard and the remaining arms continue. The experiment finishes completed if any arm succeeded.

Config carries the project-shaped fields under metadata:

metadata: hypothesis: "..." tags: ["sft", "qwen3_4b"] project_goal_id: "..." success_metric: "pass_rate" # which metric ranks arms for best_score benchmark: # post-training eval (optional) path: "data/benchmark/<name>" id: "<dashboard benchmark id>" breakdown_keys: ["toolkit"]

Dependencies are injectable for testing:

store: EvsysStore (None → skip dashboard records, run locally)
train_fn: (cfg) -> list[RunResult] (default: runner.run_experiment)
benchmark: Benchmark (overrides metadata.benchmark.path)
inference_factory: (RunResult, RunConfig) -> InferenceClient (called once per completed arm to build the eval client)

attributelogger

= logging.getLogger(__name__)

attributeTrainFn

= Callable[[ExperimentConfig], list[RunResult]]

attributeInferenceFactory

= Callable[[RunResult, RunConfig], InferenceClient]

attribute__all__

= ['ArmResult', 'EvalResult', 'Experiment', 'ExperimentResult', 'TrainFn', 'InferenceFactory']

EvalResult

ArmResult

ExperimentResult

Experiment

func_benchmark_models(bench_meta) -> list[str]

API models a benchmark should also be scored on, from models (list) or model (single string) on the benchmark spec. Empty when neither - i.e. checkpoint-only, the existing behavior.

parambench_metadict

Returns

list[str]

func_predictions_from_score(score) -> list[dict]

Prediction rows for the in-process (non-harbor) eval path, one per task, in the same shape harbor_eval.eval_predictions produces - so logger callbacks get the model output + reward regardless of engine.

paramscoreBenchmarkScore

Returns

list[dict]

func_default_train_fn(cfg) -> list[RunResult]

paramcfgExperimentConfig

Returns

list[evsys_sdk.protocols.RunResult]

experiment

experiments/<date>_<slug>/run.py

EvalResult

ArmResult

ExperimentResult

Experiment

On this page