experiment
Experiment - the top-level OOP orchestrator.
What it replaces in researcher scripts: the manual
create_experiment → create_group → per-arm create_run →
run_experiment(cfg) → create_eval → set_conclusion choreography.
Today every sweep script hand-rolls that loop. An Experiment collapses it
to one declarative .run() call.
Usage:
experiments/<date>_<slug>/run.py
from evsys_sdk import Experiment import src # registers custom verifiers/transforms
Experiment.from_yaml("config.yaml").run()
Per-arm failure isolation: if one sweep arm raises during training, it gets
marked status=failed on the dashboard and the remaining arms continue.
The experiment finishes completed if any arm succeeded.
Config carries the project-shaped fields under metadata:
metadata: hypothesis: "..." tags: ["sft", "qwen3_4b"] project_goal_id: "..." success_metric: "pass_rate" # which metric ranks arms for best_score benchmark: # post-training eval (optional) path: "data/benchmark/<name>" id: "<dashboard benchmark id>" breakdown_keys: ["toolkit"]
Dependencies are injectable for testing:
store: EvsysStore (None → skip dashboard records, run locally)train_fn:(cfg) -> list[RunResult](default:runner.run_experiment)benchmark:Benchmark(overrides metadata.benchmark.path)inference_factory:(RunResult, RunConfig) -> InferenceClient(called once per completed arm to build the eval client)
attributelogger= logging.getLogger(__name__)attributeTrainFn= Callable[[ExperimentConfig], list[RunResult]]attributeInferenceFactory= Callable[[RunResult, RunConfig], InferenceClient]attribute__all__= ['ArmResult', 'EvalResult', 'Experiment', 'ExperimentResult', 'TrainFn', 'InferenceFactory']func_benchmark_models(bench_meta) -> list[str]API models a benchmark should also be scored on, from models (list)
or model (single string) on the benchmark spec. Empty when neither -
i.e. checkpoint-only, the existing behavior.
parambench_metadictReturns
list[str]func_predictions_from_score(score) -> list[dict]Prediction rows for the in-process (non-harbor) eval path, one per task,
in the same shape harbor_eval.eval_predictions produces - so logger
callbacks get the model output + reward regardless of engine.
paramscoreBenchmarkScoreReturns
list[dict]func_default_train_fn(cfg) -> list[RunResult]paramcfgExperimentConfigReturns
list[evsys_sdk.protocols.RunResult]