EvSys

benchmark_run

Standalone benchmark run - score a benchmark on a model, no training.

Reuses the same harbor rollout path as in-training eval (:func:evsys_sdk.training.harbor_eval.score_via_harbor) plus the same metrics and eval-rollout upload. The only differences from a post-training eval: there's no checkpoint, and the rollout LLM is a closed / API model via litellm (model_client="litellm"). Rollouts persist under the local .evsys/ workspace and, when a store + run_id are given, push to the dashboard.

from evsys_sdk import run_benchmark

by local path, dashboard id, or name - same resolver as the config

metrics = run_benchmark(path="data/benchmark/tool-search", model="anthropic/claude-opus-4-1") metrics = run_benchmark(id="bench_abc123", model="openai/gpt-4o", store=store)

API keys come from the standard provider env vars (ANTHROPIC_API_KEY, OPENAI_API_KEY, …).

attribute__all__
= ['run_benchmark']
funcrun_benchmark(benchmark=None, *, model, path=None, id=None, name=None, num_samples=1, max_tokens=512, temperature=0.0, system_prompt=None, limit=None, n_concurrent=8, workspace_dir=None, store=None, run_id=None, eval_id=None) -> dict[str, float]

Score a benchmark on a closed / API model through harbor - no training.

The benchmark is given directly (benchmark=) or resolved by path / id / name via the shared :meth:Benchmark.load resolver (same references the config accepts). model is a litellm string, e.g. "anthropic/claude-opus-4-1" / "openai/gpt-4o"; repeats use the per-task num_samples in one async harbor job. Returns the eval metric dict (mean_reward, pass_rate, n_tasks, and time/tokens/cost per task), and uploads the per-task eval rollouts when both store and run_id are set.

parambenchmarkBenchmark | None
= None
parammodelstr
parampathstr | None
= None
paramidstr | None
= None
paramnamestr | None
= None
paramnum_samplesint
= 1
parammax_tokensint
= 512
paramtemperaturefloat
= 0.0
paramsystem_promptstr | None
= None
paramlimitint | None
= None
paramn_concurrentint
= 8
paramworkspace_dirstr | Path | None
= None
paramstoreAny
= None
paramrun_idstr | None
= None
parameval_idstr | None
= None

Returns

dict[str, float]

On this page