EvSys

BenchmarkEvaluator

Score a :class:~evsys_sdk.Benchmark against the live sampler.

The :class:~evsys_sdk.training.loop.TrainingLoop checks run_every per evaluator (see :meth:TrainingLoop._is_due); a value of 0 disables the evaluator (it never fires).

chat_template mirrors the post-training eval spec (system_prompt + user_template + enable_thinking) so the same YAML knob configures both the in-loop val and the final test set.

Attributes

attributenamestr
attributebenchmarkBenchmark
attributetokenizerAny
attributerun_everyint
= 0
attributemax_tokensint
= 256
attributetemperaturefloat
= 0.0
attributebreakdown_keyslist[str]
= field(default_factory=list)
attributemetricslist[str]
= field(default_factory=list)

Registered metric names to compute (harbor engine), e.g. ["pass@3", "pass^3", "avg"]. Empty → mean_reward + pass_rate.

attributechat_templatedict[str, Any]
= field(default_factory=dict)
attributelimitint | None
= None

Cap the number of tasks scored per eval - useful when the benchmark is large and you want quick in-loop snapshots.

attributeenginestr
= ''

"harbor" → score through harbor's rollout engine (off the eval checkpoint). Anything else → the live-sampler InferenceClient path.

attributemodel_namestr | None
= None
attributeworkspace_dirAny
= None
attributenum_samplesint
= 1
attributen_concurrentint
= 8

Concurrent harbor trials (harbor engine only). Higher = more rollouts in flight against the sampler; all share one cached sampling client.

attributestoreAny
= None
attributerun_idstr | None
= None
attributebenchmark_idstr | None
= None

Functions

funcevaluate(self, sampler, *, model_path=None, step=None) -> dict[str, float]
paramself
paramsamplerAny
parammodel_pathstr | None
= None
paramstepint | None
= None

Returns

dict[str, float]
func_evaluate_harbor(self, model_path, *, step=None) -> dict[str, float]

Score the validation benchmark through harbor (same engine as training); reward = each task's verifier. Returns the metric dict and, when store + run_id are set, uploads the eval rollouts.

paramself
parammodel_pathstr
paramstepint | None
= None

Returns

dict[str, float]
func_upload(self, tasks, groups, metrics, step) -> None

Record one eval (per step) + its per-task rollout predictions on the dashboard. Best-effort: a dashboard hiccup must not kill training.

paramself
paramtaskslist[Any]
paramgroupslist[Any]
parammetricsdict[str, float]
paramstepint | None

Returns

None
func__init__(self, name, benchmark, tokenizer, run_every=0, max_tokens=256, temperature=0.0, breakdown_keys=list(), metrics=list(), chat_template=dict(), limit=None, engine='', model_name=None, workspace_dir=None, num_samples=1, n_concurrent=8, store=None, run_id=None, benchmark_id=None) -> None
paramself
paramnamestr
parambenchmarkBenchmark
paramtokenizerAny
paramrun_everyint
= 0
parammax_tokensint
= 256
paramtemperaturefloat
= 0.0
parambreakdown_keyslist[str]
= list()
parammetricslist[str]
= list()
paramchat_templatedict[str, Any]
= dict()
paramlimitint | None
= None
paramenginestr
= ''
parammodel_namestr | None
= None
paramworkspace_dirAny
= None
paramnum_samplesint
= 1
paramn_concurrentint
= 8
paramstoreAny
= None
paramrun_idstr | None
= None
parambenchmark_idstr | None
= None

Returns

None

On this page