benchmark
Benchmark - load a harbor-format eval suite and score a model against it.
A benchmark on disk is a directory:
data/benchmark/<name>/ tasks.jsonl # required - one HarborTask per line metadata.yaml # optional - description, version, source, splits, ... images/ # optional - referenced by relative path from tasks raw/ # optional - pre-harbor source for traceability
Benchmark.from_dir(path) loads the suite. bench.score(client) runs each
task's instruction through an InferenceClient, then scores the completion
via the SDK's in-process verifier-fn registry (verifiers/fns.py). The result
collects per-task rows, top-level aggregates (mean_reward, pass_rate,
n_tasks), and optional breakdown buckets (e.g. per-toolkit pass rate) keyed
by an attribute path into metadata.
E2B and LLM-judge verifiers are recognized but not executed here - they need network / sandboxes and live behind their own runners. Tasks carrying those verifier kinds raise a clear error so callers don't silently mis-score.
attributelogger= logging.getLogger(__name__)attribute__all__= ['Benchmark', 'BenchmarkScore', 'BenchmarkTaskResult']func_score_task(task, completion) -> tuple[float, Any]paramtaskHarborTaskparamcompletionstrReturns
tuple[float, typing.Any]func_read_metadata_yaml(path) -> dictparampathPathReturns
dictfunc_compute_breakdowns(per_task, breakdown_keys) -> dict[str, dict[str, dict[str, float]]]Bucket per-task rewards by each dotted metadata key →
\{key: \{value: \{n, mean_reward, pass_rate\}\}\}. Shared by the in-process
and harbor scoring paths.
paramper_tasklist[BenchmarkTaskResult]parambreakdown_keyslist[str]Returns
dict[str, dict[str, dict[str, float]]]func_dotted_get(d, dotted_key, default) -> Anyparamddictparamdotted_keystrparamdefaultAnyReturns
typing.Any