basic
Built-in benchmark metrics - reduce per-task rollout rewards to a scalar.
A benchmark scores each task by running its verifier on num_samples rollouts,
yielding a list of per-sample rewards per task. A metric reduces that
list[list[float]] (one inner list per task, holding that task's sample
rewards) to a single number. Metrics are referenced by string name on a
benchmark's metrics: list and registered with @register_metric; add your
own the same way in a project.
Built-ins:
mean_reward/avg- macro mean reward (mean over tasks of each task's mean sample reward).pass_rate- micro pass rate (passing samples / total samples, pooled).pass@k- a task is solved if any of its firstksamples passes.pass^k- a task is solved only if all of its firstksamples pass (consistency / "pass-hat-k").
The interface is one method::
def compute(self, task_rewards: Sequence[Sequence[float]]) -> float
attributePASS_THRESHOLD= 1.0attribute__all__= ['MeanReward', 'Avg', 'PassRate', 'PassAt1', 'PassAt3', 'PassHat3']