benchmark_run
Standalone benchmark run - score a benchmark on a model, no training.
Reuses the same harbor rollout path as in-training eval
(:func:evsys_sdk.training.harbor_eval.score_via_harbor) plus the same metrics
and eval-rollout upload. The only differences from a post-training eval: there's
no checkpoint, and the rollout LLM is a closed / API model via litellm
(model_client="litellm"). Rollouts persist under the local .evsys/
workspace and, when a store + run_id are given, push to the dashboard.
from evsys_sdk import run_benchmark
by local path, dashboard id, or name - same resolver as the config
metrics = run_benchmark(path="data/benchmark/tool-search", model="anthropic/claude-opus-4-1") metrics = run_benchmark(id="bench_abc123", model="openai/gpt-4o", store=store)
API keys come from the standard provider env vars (ANTHROPIC_API_KEY,
OPENAI_API_KEY, …).
attribute__all__= ['run_benchmark']funcrun_benchmark(benchmark=None, *, model, path=None, id=None, name=None, num_samples=1, max_tokens=512, temperature=0.0, system_prompt=None, limit=None, n_concurrent=8, workspace_dir=None, store=None, run_id=None, eval_id=None) -> dict[str, float]Score a benchmark on a closed / API model through harbor - no training.
The benchmark is given directly (benchmark=) or resolved by
path / id / name via the shared :meth:Benchmark.load resolver
(same references the config accepts). model is a litellm string, e.g.
"anthropic/claude-opus-4-1" / "openai/gpt-4o"; repeats use the
per-task num_samples in one async harbor job. Returns the eval metric
dict (mean_reward, pass_rate, n_tasks, and time/tokens/cost per task), and
uploads the per-task eval rollouts when both store and run_id are set.
parambenchmarkBenchmark | None= Noneparammodelstrparampathstr | None= Noneparamidstr | None= Noneparamnamestr | None= Noneparamnum_samplesint= 1parammax_tokensint= 512paramtemperaturefloat= 0.0paramsystem_promptstr | None= Noneparamlimitint | None= Noneparamn_concurrentint= 8paramworkspace_dirstr | Path | None= NoneparamstoreAny= Noneparamrun_idstr | None= Noneparameval_idstr | None= NoneReturns
dict[str, float]