Introduction
A declarative framework for LLM training experiments - SFT, RL, and distillation from a single YAML, built so coding agents can launch experiments, learn from them, and train models continuously.
We believe there will not be a few generally intelligent models that everyone
uses, but thousands of models adapted for every task, continually learning
from every interaction. To enable this we need infrastructure that lets
coding agents launch experiments, learn from them, and train models that
learn continuously. evsys-sdk is the first step toward that.
This is the whole point of evsys-sdk: turn your favourite coding agent into
a model lab.
To enable autoresearch we need standardisation of experiments - and
evsys-sdk standardises every experiment while staying flexible enough to run
any algorithm on any data.
Tinker-protocol training infrastructure has made the hard part - provisioning
GPUs, sharding, the training loop - someone else's problem. What's left is the
research: which data and which recipe produce a model that wins on
your task. evsys-sdk makes that research the only thing you (and your agent)
touch, and standardizes + centralizes every experiment so the agent always has
the context of what's been tried.
So the loop becomes: your coding agent - Claude Code, Codex, anything - writes a config, launches it on the tinker backend, reads the structured result, and writes the next one. Run a series of educated experiments and you land a small, task-specific model that beats frontier on your task at a fraction of the cost - driven entirely by the agent.
Continually learning - built in
The mission isn't just a slogan. The continual modifier trains a model across
an ordered list of datasets, each stage resuming from the previous stage's
weights - so the model accumulates skills instead of restarting from base.
See Continual learning.
An experiment is a single YAML file.
🧑💻 Coding agents customise
- Data ablations - what data goes into training (sources, mixes, transforms).
- Algorithmic changes - only the heavy math: override
build_batch()and the loss.
⚙️ The SDK handles everything else
- Experiment orchestration - sweeps, arms, seeds, failure isolation, conclusions.
- Data - loading, caching & lineage → data format rows.
- The training loop - forward/backward, optimizer steps, save cadence.
- Checkpointing & artifacts.
- Evaluation - in-loop validation + benchmark scoring, metrics & verifiers.
- Backends & compute - mock / local / tinker, rollouts via Harbor.
- Logging, dashboard & storage.
What you can plug in
Every moving part is a registry: reference a built-in by name in the YAML, or
register your own with a one-line @register_* decorator - no fork.
- Tinker-compatible backends - train on anything that exposes the Tinker training server: Fireworks, TML, SkyRL, … one line in the config switches provider.
- Algorithms - any recipe under the sun (SFT, RL, distillation, or your
own), LoRA on any model - defined by overriding a single
build_batch(). - Agent harnesses - train a model inside your own multi-turn harness (any
harbor
BaseAgent) so it learns to use your tools - the way Opus is trained inside the Claude Code harness. - Transforms - turn any raw data into the standard data-format rows; chain built-ins or register your own.
- Verifiers - the reward / correctness check for RL and eval
(
exact_match,contains, LLM-judge, or custom). - Metrics - score rollouts however you measure success:
pass@1,pass@3, cost, latency, your own. - Validation & evaluation - score against benchmarks via
metadata.benchmark:run_every: Nscores in-loop every N steps (omit it to score once after training); thesplittag (val/test) just separates the metrics. - Inference clients - how eval / RL query a model to generate.
- Data & log stores - where data is read and where metrics/artifacts are written (local files, the dashboard, TensorBoard).
- Callbacks - hook the training lifecycle for logging, early stopping, or dashboard upload.
- Deployers - ship the trained adapter (e.g. Fireworks) with one line.
Next steps
🚀 Quickstart
Train your first model in five minutes - no GPU, no network.
📦 Installation
Install evsys-sdk and the optional backends.
📐 Architecture
A guided tour of the five layers and the eight registries.
🔁 Continual learning
Chain a model across datasets - each stage builds on the last stage's weights.