harbor_eval

Benchmark / validation evaluation through harbor's rollout engine.

Eval reuses the same engine as training: a benchmark is a set of :class:~evsys_sdk.data_types.HarborTask\s (instruction + verifier), so scoring it is just :func:~evsys_sdk.training.harbor_engine.run_harbor_rollouts over those tasks - the verifier reward is the eval score.

Unlike training, eval rollouts are uploaded to the dashboard (Supabase) with kind='eval' via :func:upload_eval_rollouts. (Training rollouts stay on disk in the run workspace and are never uploaded.)

The metrics / prediction builders are pure functions over :class:TrajectoryGroup\s - harbor-free and directly testable.

attributelogger

= logging.getLogger(__name__)

attribute__all__

= ['eval_metrics', 'eval_predictions', 'upload_eval_rollouts']

funceval_metrics(groups, *, metrics=None) -> dict[str, float]

Reduce per-task rollout rewards to the benchmark's declared metrics, plus per-task economics.

metrics is a list of registered metric names (e.g. ["pass@3", "pass^3", "avg"]); each is looked up via :func:get_metric and applied to the per-task sample rewards (one inner list per task, holding that task's num_samples rewards). n_tasks is always included; when no metrics are declared it defaults to mean_reward + pass_rate.

Independently, \{time_per_task, tokens_per_task, cost_per_task\} are added whenever harbor reported the underlying usage (cost is omitted for runs with no API price, e.g. on-policy tinker).

paramgroupsSequence[TrajectoryGroup]

parammetricsSequence[str] | None

= None

Returns

dict[str, float]

func_task_usage_means(group) -> dict[str, float | None]

Per-task mean latency / token count / cost over the group's trajectories, reading the metadata['usage'] harbor_engine stamps on each rollout. A field is None when no trajectory reported it.

paramgroupTrajectoryGroup

Returns

dict[str, float | None]

funceval_predictions(tasks, groups, *, eval_id=None, step=None) -> list[dict]

Build dashboard prediction rows (kind='eval') - one per (task, sample). Carries the token-level rollout + reward for the eval.

paramtasksSequence[HarborTask]

paramgroupsSequence[TrajectoryGroup]

parameval_idstr | None

= None

paramstepint | None

= None

Returns

list[dict]

funcupload_eval_rollouts(store, run_id, predictions) -> None

Upload eval predictions to the dashboard. Accepts either a DashboardClient (log_predictions) or an EvsysStore (add_prediction per row). No-op when store/run_id is falsy.

paramstoreAny

paramrun_idstr

parampredictionslist[dict]

Returns

None