BenchmarkScore
Aggregate output of Benchmark.score.
Attributes
attributemetricsdict[str, float]Top-level aggregates: mean_reward, pass_rate, n_tasks.
attributeper_tasklist[BenchmarkTaskResult]One entry per task in the same order as Benchmark.tasks.
attributebreakdownsdict[str, dict[str, dict[str, float]]]= field(default_factory=dict)\{bucket_field: \{bucket_value: \{n, mean_reward, pass_rate\}\}\}.
Populated when score(..., breakdown_keys=[...]) is passed. Each bucket
field is an attribute path into a task's metadata (e.g. "toolkit").
attributerolloutslist['TrajectoryGroup']= field(default_factory=list)Raw per-(task, sample) harbor rollouts (token ids + reward + usage), in
task order; populated by score_via_harbor, empty for in-process score().
Lets callers upload per-sample eval predictions without re-running.
Functions
func__init__(self, metrics, per_task, breakdowns=dict(), rollouts=list()) -> Noneparamselfparammetricsdict[str, float]paramper_tasklist[BenchmarkTaskResult]parambreakdownsdict[str, dict[str, dict[str, float]]]= dict()paramrolloutslist['TrajectoryGroup']= list()Returns
None