EvSys

harbor_engine

Harbor rollout runner - hands rollouts to harbor's Job engine (0.13.2).

Import-safe without the harbor package: all harbor imports are lazy (inside the runners), and the agent / environment classes are referenced only by string import path (they live in :mod:evsys_sdk.training.harbor_agents, which harbor loads at trial runtime). So rl / sdft can import this module, and tests can mock the runners, with no harbor install.

Flow (harbor 0.13.2): a producer's adapter (:class:HarborTaskAdapter for scored rollouts, :class:PromptAdapter for generation) writes each task dir (instruction.md + task.toml [+ evsys_verifier.json spec + a dummy tests/test.sh when scored]) and returns harbor-native TaskConfig\s → :func:run_harbor_rollouts builds a JobConfig over those TaskConfig\s × one agent, n_attempts = num_samplesJob.run() → harvest each trial's agent_result (rollout_details + completion + token/cost usage) and verifier_result (reward) into a :class:Trajectory.

The reward is produced by harbor running our :class:~evsys_sdk.training.harbor_agents.EvsysVerifier (the job-level verifier) host-side, no container: it wraps the task's registered verifier fn over the completion the agent wrote. SHARED verifier mode (the default) keeps it in the agent's no-op environment; the dummy tests/test.sh only satisfies harbor's task-load check and is never executed. (Generation-only rollouts disable the verifier and use environment_mode="separate" so no test.sh is needed.)

attribute__all__
= ['HarborTaskAdapter', 'PromptAdapter', 'run_harbor_rollouts']
func_agent_import_and_kwargs(model_client, *, agent_import_path, model_name, model_path, renderer_name, max_tokens, temperature, max_turns, system_prompt) -> tuple[str, dict[str, Any]]

Pick the harbor agent + its kwargs for a rollout. Pure + harbor-free so the agent-selection logic is unit-testable.

An explicit agent_import_path wins (fully self-configured agent, no kwargs). Otherwise it's always :class:BasicLoopAgent, parameterized by model_client: "tinker" (on-policy TinkerLLM, needs model_path) or "litellm" (closed/API model; model_name is a litellm string, the tinker-only model_path/renderer_name are ignored).

parammodel_clientstr
paramagent_import_pathstr | None
parammodel_namestr
parammodel_pathstr | None
paramrenderer_namestr | None
parammax_tokensint
paramtemperaturefloat
parammax_turnsint
paramsystem_promptstr | None

Returns

tuple[str, dict[str, typing.Any]]
func_to_agent_config(AgentConfig, import_path, kwargs) -> Any

Map (import_path, agent kwargs) → a harbor AgentConfig.

harbor 0.13.2 passes model_name to the agent constructor from the top-level AgentConfig.model_name field, so it must NOT also live in kwargs (else the agent gets model_name twice). Lift it out here so the agent-selection logic above can stay a flat kwargs dict.

paramAgentConfigAny
paramimport_pathstr
paramkwargsdict[str, Any]

Returns

typing.Any
funcrun_harbor_rollouts(items, *, outcome_reward=True, model_name, model_path, workspace_dir, model_client='tinker', renderer_name=None, num_samples=1, max_turns=1, max_tokens=512, temperature=1.0, system_prompt=None, agent_import_path=None, n_concurrent=4, max_retries=2, _job_factory=None) -> list[TrajectoryGroup]

Roll out itemsnum_samples) through harbor's Job engine - one :class:TrajectoryGroup per item, in order.

outcome_reward is the agent-meaningful knob - does the rollout get scored by an outcome verifier? The runner is adapter-aware (it runs the matching adapter to write the task dirs, where materialize_task used to be), so no caller ever touches an adapter:

  • outcome_reward=True (default) - items are :class:HarborTask\s; :class:HarborTaskAdapter writes scored task dirs and the host-side :class:EvsysVerifier produces each outcome reward. (RL + benchmark eval.)
  • outcome_reward=False - items are prompt strings; :class:PromptAdapter writes generation-only dirs (no verifier, reward=0). (SDFT students.)

model_client - "tinker" (on-policy TinkerLLM, needs model_path) or "litellm" (closed/API model; model_name a litellm string, e.g. "anthropic/claude-opus-4-1").

_job_factory is the test seam: async (job_config) -> job_result. When None, harbor is imported and Job.create(...).run() is used.

paramitemsSequence[Any]
paramoutcome_rewardbool
= True
parammodel_namestr
parammodel_pathstr | None
paramworkspace_dirPath
parammodel_clientstr
= 'tinker'
paramrenderer_namestr | None
= None
paramnum_samplesint
= 1
parammax_turnsint
= 1
parammax_tokensint
= 512
paramtemperaturefloat
= 1.0
paramsystem_promptstr | None
= None
paramagent_import_pathstr | None
= None
paramn_concurrentint
= 4
parammax_retriesint
= 2
param_job_factoryAny | None
= None

Returns

list[evsys_sdk.training.trajectory.TrajectoryGroup]
func_run_job(Job, config) -> Any
paramJobAny
paramconfigAny

Returns

typing.Any
func_trials_by_task(job_result) -> dict[str, list[Any]]

Group a job's trial results by task_name (the materialized dir's basename = _safe(task_id)). n_attempts trials share a task_name.

paramjob_resultAny

Returns

dict[str, list[typing.Any]]
func_harvest(job_result, task_configs) -> list[TrajectoryGroup]
paramjob_resultAny
paramtask_configsSequence[Any]

Returns

list[evsys_sdk.training.trajectory.TrajectoryGroup]
func_trial_to_trajectory(tr) -> Trajectory | None

Convert a harbor TrialResult → our multi-turn :class:Trajectory, reading the rollout off agent_result (AgentContext).

Token-level turns come from rollout_details (tinker on-policy rollouts). Closed/API models (litellm) return no token ids, so for an eval trial - one that produced a verifier reward - we still build a token-less Trajectory carrying the reward + usage so it isn't dropped from scoring. Errored trials, and generation-only trials with neither tokens nor a reward, return None.

paramtrAny

Returns

evsys_sdk.training.trajectory.Trajectory | None
func_trial_usage(tr) -> dict[str, Any]

Pull harbor's native cost / token / timing info off a trial result.

Harbor records cost_usd + token counts on agent_result (AgentContext) and per-phase wall-clock timing on the trial (agent_execution, whole-trial span as fallback). Any field harbor didn't populate stays None - on-policy tinker has no API cost_usd, and the caller backfills token counts from the turns. Pure + harbor-free.

paramtrAny

Returns

dict[str, typing.Any]
func_phase_seconds(phase) -> float | None

Wall-clock seconds for a harbor timing phase - anything carrying started_at / finished_at datetimes. None when either is missing.

paramphaseAny

Returns

float | None
func_safe(name) -> str
paramnamestr

Returns

str