harbor_eval
Benchmark / validation evaluation through harbor's rollout engine.
Eval reuses the same engine as training: a benchmark is a set of
:class:~evsys_sdk.data_types.HarborTask\s (instruction + verifier), so
scoring it is just :func:~evsys_sdk.training.harbor_engine.run_harbor_rollouts
over those tasks - the verifier reward is the eval score.
Unlike training, eval rollouts are uploaded to the dashboard (Supabase)
with kind='eval' via :func:upload_eval_rollouts. (Training rollouts stay
on disk in the run workspace and are never uploaded.)
The metrics / prediction builders are pure functions over
:class:TrajectoryGroup\s - harbor-free and directly testable.
attributelogger= logging.getLogger(__name__)attribute__all__= ['eval_metrics', 'eval_predictions', 'upload_eval_rollouts']funceval_metrics(groups, *, metrics=None) -> dict[str, float]Reduce per-task rollout rewards to the benchmark's declared metrics, plus per-task economics.
metrics is a list of registered metric names (e.g. ["pass@3", "pass^3", "avg"]); each is looked up via :func:get_metric and applied
to the per-task sample rewards (one inner list per task, holding that task's
num_samples rewards). n_tasks is always included; when no metrics
are declared it defaults to mean_reward + pass_rate.
Independently, \{time_per_task, tokens_per_task, cost_per_task\} are added
whenever harbor reported the underlying usage (cost is omitted for runs with
no API price, e.g. on-policy tinker).
paramgroupsSequence[TrajectoryGroup]parammetricsSequence[str] | None= NoneReturns
dict[str, float]func_task_usage_means(group) -> dict[str, float | None]Per-task mean latency / token count / cost over the group's
trajectories, reading the metadata['usage'] harbor_engine stamps on
each rollout. A field is None when no trajectory reported it.
paramgroupTrajectoryGroupReturns
dict[str, float | None]funceval_predictions(tasks, groups, *, eval_id=None, step=None) -> list[dict]Build dashboard prediction rows (kind='eval') - one per
(task, sample). Carries the token-level rollout + reward for the eval.
paramtasksSequence[HarborTask]paramgroupsSequence[TrajectoryGroup]parameval_idstr | None= Noneparamstepint | None= NoneReturns
list[dict]funcupload_eval_rollouts(store, run_id, predictions) -> NoneUpload eval predictions to the dashboard. Accepts either a
DashboardClient (log_predictions) or an EvsysStore
(add_prediction per row). No-op when store/run_id is falsy.
paramstoreAnyparamrun_idstrparampredictionslist[dict]Returns
None