runner

High-level eval runner: loads dataset, runs model eval, computes summary.

Generic, domain-agnostic. Project-specific eval harnesses (e.g. an API search eval) live in their own repos and reuse this infra (score_rows, AliasMatcher, load_eval_dataset, …).

EvalArtifacts

funcload_eval_dataset(path) -> list[dict[str, Any]]

Load an eval JSON file. Supports both:

list of rows [\{tool_slug, toolkit, queries\}, ...] (v2 shape), or
dict with results: [...] (older 3-query shape).

parampathstr | Path

Returns

list[dict[str, typing.Any]]

funcevaluate_model

(*, dataset_path, aliases_path, client, secondary_aliases_path=None, config=None, output_dir=None, progress=True) -> EvalArtifacts

paramdataset_pathstr | Path

paramaliases_pathstr | Path

paramclientInferenceClient

paramsecondary_aliases_pathstr | Path | None

= None

paramconfigModelEvalConfig | None

= None

paramoutput_dirstr | Path | None

= None

paramprogressbool

= True

Returns

evsys_sdk.eval.runner.EvalArtifacts

func_write_artifacts(artifacts, output_dir, *, kind) -> None

paramartifactsEvalArtifacts

paramoutput_dirstr | Path

paramkindstr

Returns

None