BenchmarkEvaluator
Score a :class:~evsys_sdk.Benchmark against the live sampler.
The :class:~evsys_sdk.training.loop.TrainingLoop checks
run_every per evaluator (see :meth:TrainingLoop._is_due);
a value of 0 disables the evaluator (it never fires).
chat_template mirrors the post-training eval spec
(system_prompt + user_template + enable_thinking) so the
same YAML knob configures both the in-loop val and the final test set.
Attributes
attributenamestrattributebenchmarkBenchmarkattributetokenizerAnyattributerun_everyint= 0attributemax_tokensint= 256attributetemperaturefloat= 0.0attributebreakdown_keyslist[str]= field(default_factory=list)attributemetricslist[str]= field(default_factory=list)Registered metric names to compute (harbor engine), e.g.
["pass@3", "pass^3", "avg"]. Empty → mean_reward + pass_rate.
attributechat_templatedict[str, Any]= field(default_factory=dict)attributelimitint | None= NoneCap the number of tasks scored per eval - useful when the benchmark is large and you want quick in-loop snapshots.
attributeenginestr= ''"harbor" → score through harbor's rollout engine (off the eval
checkpoint). Anything else → the live-sampler InferenceClient path.
attributemodel_namestr | None= Noneattributeworkspace_dirAny= Noneattributenum_samplesint= 1attributen_concurrentint= 8Concurrent harbor trials (harbor engine only). Higher = more rollouts in flight against the sampler; all share one cached sampling client.
attributestoreAny= Noneattributerun_idstr | None= Noneattributebenchmark_idstr | None= NoneFunctions
funcevaluate(self, sampler, *, model_path=None, step=None) -> dict[str, float]paramselfparamsamplerAnyparammodel_pathstr | None= Noneparamstepint | None= NoneReturns
dict[str, float]func_evaluate_harbor(self, model_path, *, step=None) -> dict[str, float]Score the validation benchmark through harbor (same engine as
training); reward = each task's verifier. Returns the metric dict and,
when store + run_id are set, uploads the eval rollouts.
paramselfparammodel_pathstrparamstepint | None= NoneReturns
dict[str, float]func_upload(self, tasks, groups, metrics, step) -> NoneRecord one eval (per step) + its per-task rollout predictions on
the dashboard. Best-effort: a dashboard hiccup must not kill training.
paramselfparamtaskslist[Any]paramgroupslist[Any]parammetricsdict[str, float]paramstepint | NoneReturns
Nonefunc__init__(self, name, benchmark, tokenizer, run_every=0, max_tokens=256, temperature=0.0, breakdown_keys=list(), metrics=list(), chat_template=dict(), limit=None, engine='', model_name=None, workspace_dir=None, num_samples=1, n_concurrent=8, store=None, run_id=None, benchmark_id=None) -> NoneparamselfparamnamestrparambenchmarkBenchmarkparamtokenizerAnyparamrun_everyint= 0parammax_tokensint= 256paramtemperaturefloat= 0.0parambreakdown_keyslist[str]= list()parammetricslist[str]= list()paramchat_templatedict[str, Any]= dict()paramlimitint | None= Noneparamenginestr= ''parammodel_namestr | None= Noneparamworkspace_dirAny= Noneparamnum_samplesint= 1paramn_concurrentint= 8paramstoreAny= Noneparamrun_idstr | None= Noneparambenchmark_idstr | None= NoneReturns
None