Benchmark
A harbor-format eval suite loaded into memory.
Two builders:
from_dir(path)-data/benchmark/\<name>/\{tasks.jsonl,metadata.yaml\}from_iterable(name, rows, metadata=...)- for tests / programmatic.
score(client) runs inference + verification and returns a BenchmarkScore.
Attributes
attributenamestrattributetaskslist[HarborTask]attributemetadatadict= field(default_factory=dict)attributerootPath | None= NoneFilesystem dir the benchmark was loaded from (None if in-memory).
Functions
funcfrom_dir(cls, path) -> Benchmarkparamclsparampathstr | PathReturns
evsys_sdk.benchmark.Benchmarkfuncfrom_iterable(cls, name, rows, *, metadata=None) -> Benchmarkparamclsparamnamestrparamrowslist[dict] | list[HarborTask]parammetadatadict | None= NoneReturns
evsys_sdk.benchmark.Benchmarkfuncload(cls, spec, *, store=None) -> Benchmark | NoneResolve a benchmark spec \{path | id | name\} to a Benchmark.
The single resolver shared by the experiment config, the standalone
run_benchmark, and the CLI - so all accept the same references:
path→ local harbor dir (delegates tofrom_dir); offline / dev.id→ dashboard benchmark id, pulled into the local.evsys/.name→ resolved to the latest version's id, then pulled.
Returns None when the spec carries none of those. store is
needed only for the id / name paths.
paramclsparamspecdict[str, Any]paramstoreAny= NoneReturns
evsys_sdk.benchmark.Benchmark | Nonefuncscore(self, client, *, max_tokens=512, temperature=0.0, stop=None, prompt_builder=None, breakdown_keys=None, limit=None, metrics=None, num_samples=1) -> BenchmarkScoreRun each task through client and score the completion.
Sequential - wrap in a thread/process pool externally if you need concurrency. (Most local clients are GPU-bound and don't benefit.)
prompt_builder(task) -> str lets callers shape the model input;
default is task.instruction verbatim.
breakdown_keys are dotted attribute paths into task.metadata. Each
key produces \{value -> \{n, mean_reward, pass_rate\}\} in the result.
limit caps how many tasks are scored - the first limit in
self.tasks (deterministic, in benchmark order). Useful for fast
smoke-runs on large benchmarks. None means score everything.
paramselfparamclientInferenceClientparammax_tokensint= 512paramtemperaturefloat= 0.0paramstoplist[str] | None= Noneparamprompt_builder'callable | None'= Noneparambreakdown_keyslist[str] | None= Noneparamlimitint | None= Noneparammetricslist[str] | None= Noneparamnum_samplesint= 1Returns
evsys_sdk.benchmark.BenchmarkScorefuncscore_via_harbor(self, *, model_name, model_path=None, model_client='tinker', workspace_dir, renderer_name=None, num_samples=1, max_tokens=512, temperature=0.0, system_prompt=None, limit=None, breakdown_keys=None, metrics=None, n_concurrent=8, agent_import_path=None, max_retries=2, _job_factory=None) -> BenchmarkScoreScore this benchmark through harbor's rollout engine - the harbor
counterpart of :meth:score, returning the same :class:BenchmarkScore.
Each task is rolled out num_samples times (one verifier reward per
sample); metrics (registry names like pass@3) reduce the per-task
sample rewards, and time/tokens/cost_per_task come from harbor usage.
model_client is "tinker" (on-policy checkpoint, needs
model_path) or "litellm" (closed/API model; model_name a
litellm string). The result's :attr:BenchmarkScore.rollouts carries the
raw per-(task, sample) rollouts so callers can upload eval predictions
without re-running.
paramselfparammodel_namestrparammodel_pathstr | None= Noneparammodel_clientstr= 'tinker'paramworkspace_dirPathparamrenderer_namestr | None= Noneparamnum_samplesint= 1parammax_tokensint= 512paramtemperaturefloat= 0.0paramsystem_promptstr | None= Noneparamlimitint | None= Noneparambreakdown_keyslist[str] | None= Noneparammetricslist[str] | None= Noneparamn_concurrentint= 8paramagent_import_pathstr | None= Noneparammax_retriesint= 2param_job_factoryAny | None= NoneReturns
evsys_sdk.benchmark.BenchmarkScorefunc__init__(self, name, tasks, metadata=dict(), root=None) -> Noneparamselfparamnamestrparamtaskslist[HarborTask]parammetadatadict= dict()paramrootPath | None= NoneReturns
None