EvSys

benchmark

Benchmark - load a harbor-format eval suite and score a model against it.

A benchmark on disk is a directory:

data/benchmark/<name>/ tasks.jsonl # required - one HarborTask per line metadata.yaml # optional - description, version, source, splits, ... images/ # optional - referenced by relative path from tasks raw/ # optional - pre-harbor source for traceability

Benchmark.from_dir(path) loads the suite. bench.score(client) runs each task's instruction through an InferenceClient, then scores the completion via the SDK's in-process verifier-fn registry (verifiers/fns.py). The result collects per-task rows, top-level aggregates (mean_reward, pass_rate, n_tasks), and optional breakdown buckets (e.g. per-toolkit pass rate) keyed by an attribute path into metadata.

E2B and LLM-judge verifiers are recognized but not executed here - they need network / sandboxes and live behind their own runners. Tasks carrying those verifier kinds raise a clear error so callers don't silently mis-score.

attributelogger
= logging.getLogger(__name__)
attribute__all__
= ['Benchmark', 'BenchmarkScore', 'BenchmarkTaskResult']
func_score_task(task, completion) -> tuple[float, Any]
paramtaskHarborTask
paramcompletionstr

Returns

tuple[float, typing.Any]
func_read_metadata_yaml(path) -> dict
parampathPath

Returns

dict
func_compute_breakdowns(per_task, breakdown_keys) -> dict[str, dict[str, dict[str, float]]]

Bucket per-task rewards by each dotted metadata key → \{key: \{value: \{n, mean_reward, pass_rate\}\}\}. Shared by the in-process and harbor scoring paths.

paramper_tasklist[BenchmarkTaskResult]
parambreakdown_keyslist[str]

Returns

dict[str, dict[str, dict[str, float]]]
func_dotted_get(d, dotted_key, default) -> Any
paramddict
paramdotted_keystr
paramdefaultAny

Returns

typing.Any