EvSys

Benchmark

A harbor-format eval suite loaded into memory.

Two builders:

  • from_dir(path) - data/benchmark/\<name>/\{tasks.jsonl,metadata.yaml\}
  • from_iterable(name, rows, metadata=...) - for tests / programmatic.

score(client) runs inference + verification and returns a BenchmarkScore.

Attributes

attributenamestr
attributetaskslist[HarborTask]
attributemetadatadict
= field(default_factory=dict)
attributerootPath | None
= None

Filesystem dir the benchmark was loaded from (None if in-memory).

Functions

funcfrom_dir(cls, path) -> Benchmark
paramcls
parampathstr | Path

Returns

evsys_sdk.benchmark.Benchmark
funcfrom_iterable(cls, name, rows, *, metadata=None) -> Benchmark
paramcls
paramnamestr
paramrowslist[dict] | list[HarborTask]
parammetadatadict | None
= None

Returns

evsys_sdk.benchmark.Benchmark
funcload(cls, spec, *, store=None) -> Benchmark | None

Resolve a benchmark spec \{path | id | name\} to a Benchmark.

The single resolver shared by the experiment config, the standalone run_benchmark, and the CLI - so all accept the same references:

  • path → local harbor dir (delegates to from_dir); offline / dev.
  • id → dashboard benchmark id, pulled into the local .evsys/.
  • name → resolved to the latest version's id, then pulled.

Returns None when the spec carries none of those. store is needed only for the id / name paths.

paramcls
paramspecdict[str, Any]
paramstoreAny
= None

Returns

evsys_sdk.benchmark.Benchmark | None
funcscore(self, client, *, max_tokens=512, temperature=0.0, stop=None, prompt_builder=None, breakdown_keys=None, limit=None, metrics=None, num_samples=1) -> BenchmarkScore

Run each task through client and score the completion.

Sequential - wrap in a thread/process pool externally if you need concurrency. (Most local clients are GPU-bound and don't benefit.)

prompt_builder(task) -> str lets callers shape the model input; default is task.instruction verbatim.

breakdown_keys are dotted attribute paths into task.metadata. Each key produces \{value -> \{n, mean_reward, pass_rate\}\} in the result.

limit caps how many tasks are scored - the first limit in self.tasks (deterministic, in benchmark order). Useful for fast smoke-runs on large benchmarks. None means score everything.

paramself
paramclientInferenceClient
parammax_tokensint
= 512
paramtemperaturefloat
= 0.0
paramstoplist[str] | None
= None
paramprompt_builder'callable | None'
= None
parambreakdown_keyslist[str] | None
= None
paramlimitint | None
= None
parammetricslist[str] | None
= None
paramnum_samplesint
= 1

Returns

evsys_sdk.benchmark.BenchmarkScore
funcscore_via_harbor(self, *, model_name, model_path=None, model_client='tinker', workspace_dir, renderer_name=None, num_samples=1, max_tokens=512, temperature=0.0, system_prompt=None, limit=None, breakdown_keys=None, metrics=None, n_concurrent=8, agent_import_path=None, max_retries=2, _job_factory=None) -> BenchmarkScore

Score this benchmark through harbor's rollout engine - the harbor counterpart of :meth:score, returning the same :class:BenchmarkScore.

Each task is rolled out num_samples times (one verifier reward per sample); metrics (registry names like pass@3) reduce the per-task sample rewards, and time/tokens/cost_per_task come from harbor usage. model_client is "tinker" (on-policy checkpoint, needs model_path) or "litellm" (closed/API model; model_name a litellm string). The result's :attr:BenchmarkScore.rollouts carries the raw per-(task, sample) rollouts so callers can upload eval predictions without re-running.

paramself
parammodel_namestr
parammodel_pathstr | None
= None
parammodel_clientstr
= 'tinker'
paramworkspace_dirPath
paramrenderer_namestr | None
= None
paramnum_samplesint
= 1
parammax_tokensint
= 512
paramtemperaturefloat
= 0.0
paramsystem_promptstr | None
= None
paramlimitint | None
= None
parambreakdown_keyslist[str] | None
= None
parammetricslist[str] | None
= None
paramn_concurrentint
= 8
paramagent_import_pathstr | None
= None
parammax_retriesint
= 2
param_job_factoryAny | None
= None

Returns

evsys_sdk.benchmark.BenchmarkScore
func__init__(self, name, tasks, metadata=dict(), root=None) -> None
paramself
paramnamestr
paramtaskslist[HarborTask]
parammetadatadict
= dict()
paramrootPath | None
= None

Returns

None

On this page