EvSys

experiment

Experiment - the top-level OOP orchestrator.

What it replaces in researcher scripts: the manual create_experimentcreate_group → per-arm create_runrun_experiment(cfg)create_evalset_conclusion choreography. Today every sweep script hand-rolls that loop. An Experiment collapses it to one declarative .run() call.

Usage:

experiments/<date>_<slug>/run.py

from evsys_sdk import Experiment import src # registers custom verifiers/transforms

Experiment.from_yaml("config.yaml").run()

Per-arm failure isolation: if one sweep arm raises during training, it gets marked status=failed on the dashboard and the remaining arms continue. The experiment finishes completed if any arm succeeded.

Config carries the project-shaped fields under metadata:

metadata: hypothesis: "..." tags: ["sft", "qwen3_4b"] project_goal_id: "..." success_metric: "pass_rate" # which metric ranks arms for best_score benchmark: # post-training eval (optional) path: "data/benchmark/<name>" id: "<dashboard benchmark id>" breakdown_keys: ["toolkit"]

Dependencies are injectable for testing:

  • store: EvsysStore (None → skip dashboard records, run locally)
  • train_fn: (cfg) -> list[RunResult] (default: runner.run_experiment)
  • benchmark: Benchmark (overrides metadata.benchmark.path)
  • inference_factory: (RunResult, RunConfig) -> InferenceClient (called once per completed arm to build the eval client)
attributelogger
= logging.getLogger(__name__)
attributeTrainFn
= Callable[[ExperimentConfig], list[RunResult]]
attributeInferenceFactory
= Callable[[RunResult, RunConfig], InferenceClient]
attribute__all__
= ['ArmResult', 'EvalResult', 'Experiment', 'ExperimentResult', 'TrainFn', 'InferenceFactory']
func_benchmark_models(bench_meta) -> list[str]

API models a benchmark should also be scored on, from models (list) or model (single string) on the benchmark spec. Empty when neither - i.e. checkpoint-only, the existing behavior.

parambench_metadict

Returns

list[str]
func_predictions_from_score(score) -> list[dict]

Prediction rows for the in-process (non-harbor) eval path, one per task, in the same shape harbor_eval.eval_predictions produces - so logger callbacks get the model output + reward regardless of engine.

paramscoreBenchmarkScore

Returns

list[dict]
func_default_train_fn(cfg) -> list[RunResult]
paramcfgExperimentConfig

Returns

list[evsys_sdk.protocols.RunResult]

On this page