Benchmark

A harbor-format eval suite loaded into memory.

Two builders:

from_dir(path) - data/benchmark/\<name>/\{tasks.jsonl,metadata.yaml\}
from_iterable(name, rows, metadata=...) - for tests / programmatic.

score(client) runs inference + verification and returns a BenchmarkScore.

Attributes

attributenamestr

attributetaskslist[HarborTask]

attributemetadatadict

= field(default_factory=dict)

attributerootPath | None

= None

Filesystem dir the benchmark was loaded from (None if in-memory).

Functions

funcfrom_dir(cls, path) -> Benchmark

paramcls

parampathstr | Path

Returns

evsys_sdk.benchmark.Benchmark

funcfrom_iterable(cls, name, rows, *, metadata=None) -> Benchmark

paramcls

paramnamestr

paramrowslist[dict] | list[HarborTask]

parammetadatadict | None

= None

Returns

evsys_sdk.benchmark.Benchmark

funcload(cls, spec, *, store=None) -> Benchmark | None

Resolve a benchmark spec \{path | id | name\} to a Benchmark.

The single resolver shared by the experiment config, the standalone run_benchmark, and the CLI - so all accept the same references:

path → local harbor dir (delegates to from_dir); offline / dev.
id → dashboard benchmark id, pulled into the local .evsys/.
name → resolved to the latest version's id, then pulled.

Returns None when the spec carries none of those. store is needed only for the id / name paths.

paramcls

paramspecdict[str, Any]

paramstoreAny

= None

Returns

evsys_sdk.benchmark.Benchmark | None

funcscore

(self, client, *, max_tokens=512, temperature=0.0, stop=None, prompt_builder=None, breakdown_keys=None, limit=None, metrics=None, num_samples=1) -> BenchmarkScore

Run each task through client and score the completion.

Sequential - wrap in a thread/process pool externally if you need concurrency. (Most local clients are GPU-bound and don't benefit.)

prompt_builder(task) -> str lets callers shape the model input; default is task.instruction verbatim.

breakdown_keys are dotted attribute paths into task.metadata. Each key produces \{value -> \{n, mean_reward, pass_rate\}\} in the result.

limit caps how many tasks are scored - the first limit in self.tasks (deterministic, in benchmark order). Useful for fast smoke-runs on large benchmarks. None means score everything.

paramself

paramclientInferenceClient

parammax_tokensint

= 512

paramtemperaturefloat

= 0.0

paramstoplist[str] | None

= None

paramprompt_builder'callable | None'

= None

parambreakdown_keyslist[str] | None

= None

paramlimitint | None

= None

parammetricslist[str] | None

= None

paramnum_samplesint

= 1

Returns

evsys_sdk.benchmark.BenchmarkScore

funcscore_via_harbor

(self, *, model_name, model_path=None, model_client='tinker', workspace_dir, renderer_name=None, num_samples=1, max_tokens=512, temperature=0.0, system_prompt=None, limit=None, breakdown_keys=None, metrics=None, n_concurrent=8, agent_import_path=None, max_retries=2, _job_factory=None) -> BenchmarkScore

Score this benchmark through harbor's rollout engine - the harbor counterpart of :meth:score, returning the same :class:BenchmarkScore.

Each task is rolled out num_samples times (one verifier reward per sample); metrics (registry names like pass@3) reduce the per-task sample rewards, and time/tokens/cost_per_task come from harbor usage. model_client is "tinker" (on-policy checkpoint, needs model_path) or "litellm" (closed/API model; model_name a litellm string). The result's :attr:BenchmarkScore.rollouts carries the raw per-(task, sample) rollouts so callers can upload eval predictions without re-running.

paramself

parammodel_namestr

parammodel_pathstr | None

= None

parammodel_clientstr

= 'tinker'

paramworkspace_dirPath

paramrenderer_namestr | None

= None

paramnum_samplesint

= 1

parammax_tokensint

= 512

paramtemperaturefloat

= 0.0

paramsystem_promptstr | None

= None

paramlimitint | None

= None

parambreakdown_keyslist[str] | None

= None

parammetricslist[str] | None

= None

paramn_concurrentint

= 8

paramagent_import_pathstr | None

= None

parammax_retriesint

= 2

param_job_factoryAny | None

= None

Returns

evsys_sdk.benchmark.BenchmarkScore

func__init__(self, name, tasks, metadata=dict(), root=None) -> None

paramself

paramnamestr

paramtaskslist[HarborTask]

parammetadatadict

= dict()

paramrootPath | None

= None

Returns

None

Benchmark

Attributes

Functions

On this page