Design & Architecture
Layout, protocols, and the rationale behind the SDK.
evsys-sdk - design notes
Goals
- Single declarative YAML drives a full experiment (data → train → eval), so an evolutionary algorithm can mutate it without writing Python.
- Modular - adding a new algorithm/verifier/metric is a decorator + a Pydantic Config class. No library fork.
- Backend-pluggable - same YAML can run locally on TRL or remotely on Tinker; backends are interchangeable.
- Decoupled storage - the library imports zero Supabase code by default; Supabase is an optional adapter.
Why protocols, not ABCs
PEP 544 protocols mean any class with the right methods satisfies the contract
- no inheritance from us. This is critical for third-party extensions: if you
have to subclass
evsys_sdk.algorithms.BaseAlgorithm, you've imported the world. With protocols, yourMyDPOclass is just plain Python.
Why a registry per kind
Each extension point has its own registry (_algorithms, _verifiers, …).
Reasons:
- The YAML loader knows which registry to look up
kind:in based on the surrounding context. No string-prefix tricks. schema_for(kind, name)exposes the per-extension JSON schema, which is exactly what an evolution algorithm needs to mutate the YAML safely.- Per-extension entry-point groups mean external packages declare their contributions cleanly.
YAML schema
Strict Pydantic v2: every field has a type, extra='forbid' everywhere, no
defaults that hide misspellings. The kind: discriminator selects which
registered class's Config validates the corresponding params: block.
The matrix: shorthand is a convenience that expands at load-time into
runs: - the result is the same runs[] shape, so the runner doesn't care.
Lifecycle of a run
run_experiment(cfg)→ for eachRunConfig:- Build
data_store,log_storefrom top-level specs. - Read raw rows via
data_store. - Apply
data.transforms[]in order. - Build
backend, callbackend.prepare(model=..., run_dir=...)→ handles dict. - Build
algorithmwithparams. - Construct
RunContextcarryingdata_store,log_store,backend,extras. algorithm.train(ctx)→RunResult.backend.teardown(handles).- Best-effort eval (skips on failure - eval errors don't fail the run).
- Persist
run_result.json.
- Build
Researcher-project layout
Every project the training-decider agent bootstraps follows the same shape so
scripts, benchmarks, and extensions land in predictable places. Scaffold a new
project with evsys init-project <name>; the tree is:
<project>/
├── pyproject.toml # declares src/ as an importable pkg
├── README.md
├── data/
│ ├── raw/ # untouched source dumps (gitignored)
│ ├── fetch/ # Python scripts that populate raw/
│ ├── process/ # raw → datasets/<name>/v<N>/
│ ├── datasets/ # versioned train/test JSONL
│ │ └── <name>/v1/{train,test}.jsonl + metadata.yaml
│ ├── benchmark/ # harbor-format TEST suites (final goal)
│ │ └── <name>/tasks.jsonl + metadata.yaml [+ images/ + raw/]
│ └── validation/ # harbor-format VALIDATION sets (in-loop)
│ └── <name>/tasks.jsonl + metadata.yaml
├── src/ # project-specific SDK extensions
│ ├── __init__.py # imports verifiers/metrics/transforms
│ ├── verifiers.py # @register_verifier(_fn) classes/fns
│ ├── metrics.py # @register_metric
│ └── transforms.py # @register_transform
├── experiments/
│ └── <yyyymmdd>_<slug>/ # `evsys new-experiment <slug>`
│ ├── config.yaml # ExperimentConfig - model, data, sweep, metadata
│ └── run.py # Experiment.from_yaml("config.yaml").run()
└── .evsys/ # local mirror + checkpoints + log_store outputEach config.yaml is self-contained - there is no project-root yaml that
experiments inherit from. The experiment-level fields (hypothesis, tags,
success_metric, benchmark) live under metadata: and are read by
Experiment.run():
metadata:
hypothesis: "Higher LoRA rank improves pass@1"
tags: [sft, qwen3_4b]
success_metric: pass_rate
benchmark: # TEST set - scored once, after training
id: <dashboard benchmark id from `evsys benchmark upload`> # preferred
# name: composio_eval_v2 # alt: resolves to the latest version's id
# path: data/benchmark/composio_eval_v2 # offline / dev fallback
breakdown_keys: [toolkit]Referencing data by id / name (preferred over local paths)
Stored experiment scripts should reference data by dashboard id (or
name, which resolves to the latest version's id) rather than a local path.
The SDK pulls the rows once into the local .evsys/ workspace cache and
trains/scores from there, so a committed script is portable and doesn't depend
on anyone's local file layout. path stays as an offline / dev fallback.
run:
data:
dataset_id: <dashboard dataset id> # or: dataset_name: sft_overdose
transforms: [...] # applied to the pulled rows
# source_kind/path are ignored when dataset_id/name is setSame for the benchmark block above (id / name / path). Under the hood
both go through Workspace.pull_dataset / pull_benchmark, which cache to
.evsys/<datasets|benchmarks>/<id>.jsonl (version-immutable, so a given
id never changes).
Benchmark (test) vs. validation (in-loop)
Both test and validation are entries in the same metadata.benchmark list;
tags declare the role and run_every declares the cadence.
A test benchmark is the final goal: scored once after training (no
run_every), tagged [test]. Model selection must never key off it.
A validation benchmark is scored during training to drive model
selection - same harbor format, tagged [val] with a positive
run_every:
metadata:
benchmark:
- name: val_set # in-loop - scored every N steps
id: <benchmark id> # or path: data/benchmark/<name>
tags: [val]
run_every: 50 # score every 50 training steps off the live model
engine: harbor
- name: full_test # post-training only
id: <benchmark id>
tags: [test]
engine: harborEvery run_every steps the training loop scores the live model on the
[val] entry and records val/<name>/<metric> curves under
split="val" - separate from the post-training [test] eval.
OOP entry points
The high-level path is one class with declarative inputs:
from evsys_sdk import Experiment
import src # registers project verifiers / metrics / transforms
Experiment.from_yaml("config.yaml").run()Experiment owns:
- creating the dashboard experiment + per-arm run records;
- expanding
matrix/Sweepinto oneRunConfigper arm; - per-arm failure isolation (one arm raising doesn't kill the sweep);
- post-train benchmark scoring via
Benchmark.score(client); - auto-forwarding the local
metrics.jsonlto the store (no manualbackfill_step_metricscall); - aggregating
best_score+conclusionand finalizing the experiment.
The legacy run_experiment(cfg) is still the inner runner that
Experiment calls per arm - bypass Experiment only when you need to do
training without dashboard bookkeeping.
What's NOT in v0.1
- Supabase adapters (planned:
evsys_sdk.adapters.supabase). - Evolutionary loop (kept in
backend/api/experiments/loop.pyfor now). - Distributed launchers (Modal, Slurm).
- Streaming / checkpoint resumption beyond what tinker_cookbook provides.
These all extend the same protocol surface and should arrive incrementally without breaking changes to the public API.
Backwards compatibility
version: 1in the YAML root is currently advisory; bumped on schema breaks.- Public symbols re-exported from
evsys_sdk/__init__.pyare the stable surface. Anything else may move.