EvSys

BenchmarkScore

Aggregate output of Benchmark.score.

Attributes

attributemetricsdict[str, float]

Top-level aggregates: mean_reward, pass_rate, n_tasks.

attributeper_tasklist[BenchmarkTaskResult]

One entry per task in the same order as Benchmark.tasks.

attributebreakdownsdict[str, dict[str, dict[str, float]]]
= field(default_factory=dict)

\{bucket_field: \{bucket_value: \{n, mean_reward, pass_rate\}\}\}.

Populated when score(..., breakdown_keys=[...]) is passed. Each bucket field is an attribute path into a task's metadata (e.g. "toolkit").

attributerolloutslist['TrajectoryGroup']
= field(default_factory=list)

Raw per-(task, sample) harbor rollouts (token ids + reward + usage), in task order; populated by score_via_harbor, empty for in-process score(). Lets callers upload per-sample eval predictions without re-running.

Functions

func__init__(self, metrics, per_task, breakdowns=dict(), rollouts=list()) -> None
paramself
parammetricsdict[str, float]
paramper_tasklist[BenchmarkTaskResult]
parambreakdownsdict[str, dict[str, dict[str, float]]]
= dict()
paramrolloutslist['TrajectoryGroup']
= list()

Returns

None

On this page