EvSys

basic

Built-in benchmark metrics - reduce per-task rollout rewards to a scalar.

A benchmark scores each task by running its verifier on num_samples rollouts, yielding a list of per-sample rewards per task. A metric reduces that list[list[float]] (one inner list per task, holding that task's sample rewards) to a single number. Metrics are referenced by string name on a benchmark's metrics: list and registered with @register_metric; add your own the same way in a project.

Built-ins:

  • mean_reward / avg - macro mean reward (mean over tasks of each task's mean sample reward).
  • pass_rate - micro pass rate (passing samples / total samples, pooled).
  • pass@k - a task is solved if any of its first k samples passes.
  • pass^k - a task is solved only if all of its first k samples pass (consistency / "pass-hat-k").

The interface is one method::

def compute(self, task_rewards: Sequence[Sequence[float]]) -> float

attributePASS_THRESHOLD
= 1.0
attribute__all__
= ['MeanReward', 'Avg', 'PassRate', 'PassAt1', 'PassAt3', 'PassHat3']
func_passes(reward) -> bool
paramrewardfloat

Returns

bool
func_nonempty(task_rewards) -> list[Sequence[float]]
paramtask_rewardsSequence[Sequence[float]]

Returns

list[typing.Sequence[float]]