harbor_engine
Harbor rollout runner - hands rollouts to harbor's Job engine (0.13.2).
Import-safe without the harbor package: all harbor imports are lazy
(inside the runners), and the agent / environment classes are referenced only by
string import path (they live in :mod:evsys_sdk.training.harbor_agents,
which harbor loads at trial runtime). So rl / sdft can import this
module, and tests can mock the runners, with no harbor install.
Flow (harbor 0.13.2): a producer's adapter (:class:HarborTaskAdapter for
scored rollouts, :class:PromptAdapter for generation) writes each task dir
(instruction.md + task.toml [+ evsys_verifier.json spec + a dummy
tests/test.sh when scored]) and returns harbor-native TaskConfig\s →
:func:run_harbor_rollouts builds a JobConfig over those TaskConfig\s ×
one agent, n_attempts = num_samples → Job.run() → harvest each
trial's agent_result (rollout_details + completion + token/cost usage)
and verifier_result (reward) into a :class:Trajectory.
The reward is produced by harbor running our
:class:~evsys_sdk.training.harbor_agents.EvsysVerifier (the job-level verifier)
host-side, no container: it wraps the task's registered verifier fn over the
completion the agent wrote. SHARED verifier mode (the default) keeps it in the
agent's no-op environment; the dummy tests/test.sh only satisfies harbor's
task-load check and is never executed. (Generation-only rollouts disable the
verifier and use environment_mode="separate" so no test.sh is needed.)
attribute__all__= ['HarborTaskAdapter', 'PromptAdapter', 'run_harbor_rollouts']func_agent_import_and_kwargs(model_client, *, agent_import_path, model_name, model_path, renderer_name, max_tokens, temperature, max_turns, system_prompt) -> tuple[str, dict[str, Any]]Pick the harbor agent + its kwargs for a rollout. Pure + harbor-free so the agent-selection logic is unit-testable.
An explicit agent_import_path wins (fully self-configured agent, no
kwargs). Otherwise it's always :class:BasicLoopAgent, parameterized by
model_client: "tinker" (on-policy TinkerLLM, needs model_path)
or "litellm" (closed/API model; model_name is a litellm string, the
tinker-only model_path/renderer_name are ignored).
parammodel_clientstrparamagent_import_pathstr | Noneparammodel_namestrparammodel_pathstr | Noneparamrenderer_namestr | Noneparammax_tokensintparamtemperaturefloatparammax_turnsintparamsystem_promptstr | NoneReturns
tuple[str, dict[str, typing.Any]]func_to_agent_config(AgentConfig, import_path, kwargs) -> AnyMap (import_path, agent kwargs) → a harbor AgentConfig.
harbor 0.13.2 passes model_name to the agent constructor from the
top-level AgentConfig.model_name field, so it must NOT also live in
kwargs (else the agent gets model_name twice). Lift it out here so
the agent-selection logic above can stay a flat kwargs dict.
paramAgentConfigAnyparamimport_pathstrparamkwargsdict[str, Any]Returns
typing.Anyfuncrun_harbor_rollouts(items, *, outcome_reward=True, model_name, model_path, workspace_dir, model_client='tinker', renderer_name=None, num_samples=1, max_turns=1, max_tokens=512, temperature=1.0, system_prompt=None, agent_import_path=None, n_concurrent=4, max_retries=2, _job_factory=None) -> list[TrajectoryGroup]Roll out items (× num_samples) through harbor's Job engine -
one :class:TrajectoryGroup per item, in order.
outcome_reward is the agent-meaningful knob - does the rollout get scored
by an outcome verifier? The runner is adapter-aware (it runs the matching
adapter to write the task dirs, where materialize_task used to be), so no
caller ever touches an adapter:
outcome_reward=True(default) -itemsare :class:HarborTask\s; :class:HarborTaskAdapterwrites scored task dirs and the host-side :class:EvsysVerifierproduces each outcome reward. (RL + benchmark eval.)outcome_reward=False-itemsare prompt strings; :class:PromptAdapterwrites generation-only dirs (no verifier,reward=0). (SDFT students.)
model_client - "tinker" (on-policy TinkerLLM, needs model_path)
or "litellm" (closed/API model; model_name a litellm string, e.g.
"anthropic/claude-opus-4-1").
_job_factory is the test seam: async (job_config) -> job_result.
When None, harbor is imported and Job.create(...).run() is used.
paramitemsSequence[Any]paramoutcome_rewardbool= Trueparammodel_namestrparammodel_pathstr | Noneparamworkspace_dirPathparammodel_clientstr= 'tinker'paramrenderer_namestr | None= Noneparamnum_samplesint= 1parammax_turnsint= 1parammax_tokensint= 512paramtemperaturefloat= 1.0paramsystem_promptstr | None= Noneparamagent_import_pathstr | None= Noneparamn_concurrentint= 4parammax_retriesint= 2param_job_factoryAny | None= NoneReturns
list[evsys_sdk.training.trajectory.TrajectoryGroup]func_run_job(Job, config) -> AnyparamJobAnyparamconfigAnyReturns
typing.Anyfunc_trials_by_task(job_result) -> dict[str, list[Any]]Group a job's trial results by task_name (the materialized dir's
basename = _safe(task_id)). n_attempts trials share a task_name.
paramjob_resultAnyReturns
dict[str, list[typing.Any]]func_harvest(job_result, task_configs) -> list[TrajectoryGroup]paramjob_resultAnyparamtask_configsSequence[Any]Returns
list[evsys_sdk.training.trajectory.TrajectoryGroup]func_trial_to_trajectory(tr) -> Trajectory | NoneConvert a harbor TrialResult → our multi-turn :class:Trajectory,
reading the rollout off agent_result (AgentContext).
Token-level turns come from rollout_details (tinker on-policy rollouts).
Closed/API models (litellm) return no token ids, so for an eval trial - one
that produced a verifier reward - we still build a token-less Trajectory
carrying the reward + usage so it isn't dropped from scoring. Errored trials,
and generation-only trials with neither tokens nor a reward, return None.
paramtrAnyReturns
evsys_sdk.training.trajectory.Trajectory | Nonefunc_trial_usage(tr) -> dict[str, Any]Pull harbor's native cost / token / timing info off a trial result.
Harbor records cost_usd + token counts on agent_result
(AgentContext) and per-phase wall-clock timing on the trial
(agent_execution, whole-trial span as fallback). Any field harbor didn't
populate stays None - on-policy tinker has no API cost_usd, and the
caller backfills token counts from the turns. Pure + harbor-free.
paramtrAnyReturns
dict[str, typing.Any]func_phase_seconds(phase) -> float | NoneWall-clock seconds for a harbor timing phase - anything carrying
started_at / finished_at datetimes. None when either is missing.
paramphaseAnyReturns
float | Nonefunc_safe(name) -> strparamnamestrReturns
str