Standalone benchmark run - score a benchmark on a model, no training.

Reuses the same harbor rollout path as in-training eval (:func:evsys_sdk.training.harbor_eval.score_via_harbor) plus the same metrics and eval-rollout upload. The only differences from a post-training eval: there's no checkpoint, and the rollout LLM is a closed / API model via litellm (model_client="litellm"). Rollouts persist under the local .evsys/ workspace and, when a store + run_id are given, push to the dashboard.

from evsys_sdk import run_benchmark

by local path, dashboard id, or name - same resolver as the config

metrics = run_benchmark(path="data/benchmark/tool-search", model="anthropic/claude-opus-4-1") metrics = run_benchmark(id="bench_abc123", model="openai/gpt-4o", store=store)

API keys come from the standard provider env vars (ANTHROPIC_API_KEY, OPENAI_API_KEY, …).

attribute__all__

= ['run_benchmark']

funcrun_benchmark

(benchmark=None, *, model, path=None, id=None, name=None, num_samples=1, max_tokens=512, temperature=0.0, system_prompt=None, limit=None, n_concurrent=8, workspace_dir=None, store=None, run_id=None, eval_id=None) -> dict[str, float]

Score a benchmark on a closed / API model through harbor - no training.

The benchmark is given directly (benchmark=) or resolved by path / id / name via the shared :meth:Benchmark.load resolver (same references the config accepts). model is a litellm string, e.g. "anthropic/claude-opus-4-1" / "openai/gpt-4o"; repeats use the per-task num_samples in one async harbor job. Returns the eval metric dict (mean_reward, pass_rate, n_tasks, and time/tokens/cost per task), and uploads the per-task eval rollouts when both store and run_id are set.

parambenchmarkBenchmark | None

= None

parammodelstr

parampathstr | None

= None

paramidstr | None

= None

paramnamestr | None

= None

paramnum_samplesint

= 1

parammax_tokensint

= 512

paramtemperaturefloat

= 0.0

paramsystem_promptstr | None

= None

paramlimitint | None

= None

paramn_concurrentint

= 8

paramworkspace_dirstr | Path | None

= None

paramstoreAny

= None

paramrun_idstr | None

= None

parameval_idstr | None

= None

Returns

dict[str, float]

benchmark_run

by local path, dashboard id, or name - same resolver as the config

On this page