BenchmarkEvaluator

Score a :class:~evsys_sdk.Benchmark against the live sampler.

The :class:~evsys_sdk.training.loop.TrainingLoop checks run_every per evaluator (see :meth:TrainingLoop._is_due); a value of 0 disables the evaluator (it never fires).

chat_template mirrors the post-training eval spec (system_prompt + user_template + enable_thinking) so the same YAML knob configures both the in-loop val and the final test set.

Attributes

attributenamestr

attributebenchmarkBenchmark

attributetokenizerAny

attributerun_everyint

= 0

attributemax_tokensint

= 256

attributetemperaturefloat

= 0.0

attributebreakdown_keyslist[str]

= field(default_factory=list)

attributemetricslist[str]

= field(default_factory=list)

Registered metric names to compute (harbor engine), e.g. ["pass@3", "pass^3", "avg"]. Empty → mean_reward + pass_rate.

attributechat_templatedict[str, Any]

= field(default_factory=dict)

attributelimitint | None

= None

Cap the number of tasks scored per eval - useful when the benchmark is large and you want quick in-loop snapshots.

attributeenginestr

= ''

"harbor" → score through harbor's rollout engine (off the eval checkpoint). Anything else → the live-sampler InferenceClient path.

attributemodel_namestr | None

= None

attributeworkspace_dirAny

= None

attributenum_samplesint

= 1

attributen_concurrentint

= 8

Concurrent harbor trials (harbor engine only). Higher = more rollouts in flight against the sampler; all share one cached sampling client.

attributestoreAny

= None

attributerun_idstr | None

= None

attributebenchmark_idstr | None

= None

Functions

funcevaluate(self, sampler, *, model_path=None, step=None) -> dict[str, float]

paramself

paramsamplerAny

parammodel_pathstr | None

= None

paramstepint | None

= None

Returns

dict[str, float]

func_evaluate_harbor(self, model_path, *, step=None) -> dict[str, float]

Score the validation benchmark through harbor (same engine as training); reward = each task's verifier. Returns the metric dict and, when store + run_id are set, uploads the eval rollouts.

paramself

parammodel_pathstr

paramstepint | None

= None

Returns

dict[str, float]

func_upload(self, tasks, groups, metrics, step) -> None

Record one eval (per step) + its per-task rollout predictions on the dashboard. Best-effort: a dashboard hiccup must not kill training.

paramself

paramtaskslist[Any]

paramgroupslist[Any]

parammetricsdict[str, float]

paramstepint | None

Returns

None

func__init__

(self, name, benchmark, tokenizer, run_every=0, max_tokens=256, temperature=0.0, breakdown_keys=list(), metrics=list(), chat_template=dict(), limit=None, engine='', model_name=None, workspace_dir=None, num_samples=1, n_concurrent=8, store=None, run_id=None, benchmark_id=None) -> None

paramself

paramnamestr

parambenchmarkBenchmark

paramtokenizerAny

paramrun_everyint

= 0

parammax_tokensint

= 256

paramtemperaturefloat

= 0.0

parambreakdown_keyslist[str]

= list()

parammetricslist[str]

= list()

paramchat_templatedict[str, Any]

= dict()

paramlimitint | None

= None

paramenginestr

= ''

parammodel_namestr | None

= None

paramworkspace_dirAny

= None

paramnum_samplesint

= 1

paramn_concurrentint

= 8

paramstoreAny

= None

paramrun_idstr | None

= None

parambenchmark_idstr | None

= None

Returns

None

BenchmarkEvaluator

Attributes

Functions

On this page