EvSys
Concepts

Algorithms & Evaluation

One contract - train(ctx) -> RunResult - plus an optional gradient toolkit and a test/validation firewall.

The algorithm surface is one required contract (train(ctx) -> RunResult) plus an optional gradient toolkit and pluggable evaluation. It runs over any tinker-compatible Backend - mock for tests, local for TRL+peft, tinker for hosted.

Algorithm

Algorithm (protocol): name, Config, train(ctx) -> RunResult. The only required contract - train() may do anything. The optional toolkit: TrainingLoop drives the gradient loop; a StepBuilder (build_batch -> TrainingBatch) is the unit of "a new gradient method"; losses are a named string or a LossCallable (client-side, via forward_backward_custom).

Evaluation - a test/validation firewall

  • Benchmark - the test set, scored once after training. Model selection must never key off it.
  • Validation - scored in-loop every N steps to drive selection.
  • Both are harbor-format and scored via the Metric / Verifier registries.

Metrics & Verifiers

  • Metric: compute(predictions, targets) -> float.
  • Verifier: verify(prompt, completion, target) -> reward (RL reward or per-task scoring).
  • InferenceClient: generate(...) - how eval/RL query a model.

Keep the firewall intact

A Benchmark is your held-out test. If you select the best arm using benchmark scores, you've leaked the test set - use Validation for selection.

Next: Plugins · algorithms API.

On this page