EvSys

Introduction

A declarative framework for LLM training experiments - SFT, RL, and distillation from a single YAML, built so coding agents can launch experiments, learn from them, and train models continuously.

We believe there will not be a few generally intelligent models that everyone uses, but thousands of models adapted for every task, continually learning from every interaction. To enable this we need infrastructure that lets coding agents launch experiments, learn from them, and train models that learn continuously. evsys-sdk is the first step toward that.

This is the whole point of evsys-sdk: turn your favourite coding agent into a model lab.

To enable autoresearch we need standardisation of experiments - and evsys-sdk standardises every experiment while staying flexible enough to run any algorithm on any data.

Tinker-protocol training infrastructure has made the hard part - provisioning GPUs, sharding, the training loop - someone else's problem. What's left is the research: which data and which recipe produce a model that wins on your task. evsys-sdk makes that research the only thing you (and your agent) touch, and standardizes + centralizes every experiment so the agent always has the context of what's been tried.

So the loop becomes: your coding agent - Claude Code, Codex, anything - writes a config, launches it on the tinker backend, reads the structured result, and writes the next one. Run a series of educated experiments and you land a small, task-specific model that beats frontier on your task at a fraction of the cost - driven entirely by the agent.

Continually learning - built in

The mission isn't just a slogan. The continual modifier trains a model across an ordered list of datasets, each stage resuming from the previous stage's weights - so the model accumulates skills instead of restarting from base. See Continual learning.

An experiment is a single YAML file.

editeditDataraw → data format rowsTrainingbuild_batch()Evaluationvalidation + benchmarkCoding agent

🧑‍💻 Coding agents customise

  • Data ablations - what data goes into training (sources, mixes, transforms).
  • Algorithmic changes - only the heavy math: override build_batch() and the loss.

⚙️ The SDK handles everything else

  • Experiment orchestration - sweeps, arms, seeds, failure isolation, conclusions.
  • Data - loading, caching & lineage → data format rows.
  • The training loop - forward/backward, optimizer steps, save cadence.
  • Checkpointing & artifacts.
  • Evaluation - in-loop validation + benchmark scoring, metrics & verifiers.
  • Backends & compute - mock / local / tinker, rollouts via Harbor.
  • Logging, dashboard & storage.

What you can plug in

Every moving part is a registry: reference a built-in by name in the YAML, or register your own with a one-line @register_* decorator - no fork.

  • Tinker-compatible backends - train on anything that exposes the Tinker training server: Fireworks, TML, SkyRL, … one line in the config switches provider.
  • Algorithms - any recipe under the sun (SFT, RL, distillation, or your own), LoRA on any model - defined by overriding a single build_batch().
  • Agent harnesses - train a model inside your own multi-turn harness (any harbor BaseAgent) so it learns to use your tools - the way Opus is trained inside the Claude Code harness.
  • Transforms - turn any raw data into the standard data-format rows; chain built-ins or register your own.
  • Verifiers - the reward / correctness check for RL and eval (exact_match, contains, LLM-judge, or custom).
  • Metrics - score rollouts however you measure success: pass@1, pass@3, cost, latency, your own.
  • Validation & evaluation - score against benchmarks via metadata.benchmark: run_every: N scores in-loop every N steps (omit it to score once after training); the split tag (val/test) just separates the metrics.
  • Inference clients - how eval / RL query a model to generate.
  • Data & log stores - where data is read and where metrics/artifacts are written (local files, the dashboard, TensorBoard).
  • Callbacks - hook the training lifecycle for logging, early stopping, or dashboard upload.
  • Deployers - ship the trained adapter (e.g. Fireworks) with one line.

Next steps

On this page