SFT - Supervised Fine-Tuning
A complete, run-it-yourself SFT walkthrough - JSONL data, a transform, the built-in sft algorithm, in-loop validation, and local logging, explained line by line.
This is a full walkthrough for someone who has never used the SDK. We'll train a
small model to pick the right tool for a user's query - emitting
<answer>TOOL_SLUG</answer> - and explain every moving part: the raw data,
the transform that shapes it, the built-in algorithm, in-loop validation, and
where the logs land.
Everything here is a real, runnable example in the repo at
examples/sft_walkthrough/. You can run it directly (see step 6).
1. The raw data - JSONL rows you point the config at
Training data is a JSONL file: one JSON object per line. You bring whatever shape your data is in - here, each line is a query and the tool that answers it:
{"query": "save a contact from this email", "tool_slug": "OUTLOOK_CREATE_CONTACT"}
{"query": "edit a slack message I sent earlier", "tool_slug": "SLACK_UPDATES_A_SLACK_MESSAGE"}
{"query": "create a new event on my calendar", "tool_slug": "GOOGLECALENDAR_CREATE_EVENT"}You don't load this yourself. You point the config at the file - source_kind: jsonl tells the SDK it's a JSONL file, and path is where to find it (relative
to where you run from, or absolute):
data:
source_kind: jsonl
path: examples/sft_walkthrough/data/train.jsonlThe SDK reads the file through a data store and hands the rows to the next
stage. At this point they're still raw dicts - {"query": ..., "tool_slug": ...}.
2. The transform - shape raw rows into a chat conversation
An algorithm doesn't train on arbitrary dicts. SFT trains on a
ChatMessagesRow - a conversation of {role, content} messages. A
transform bridges the two.
A transform is just a function with one contract: rows -> rows. You list
transforms in the config and they run in order. The built-in jsonl_to_chat
renders each raw dict into a chat turn using templates you provide:
transforms:
- kind: jsonl_to_chat
params:
system_prompt: "You select the single best tool for the user's query."
user_template: "Query: {query}\nReply with the tool slug inside <answer></answer>."
assistant_template: "<answer>{tool_slug}</answer>"So this raw row:
{"query": "edit a slack message I sent earlier", "tool_slug": "SLACK_UPDATES_A_SLACK_MESSAGE"}becomes this ChatMessagesRow:
{ "messages": [
{ "role": "system", "content": "You select the single best tool for the user's query." },
{ "role": "user", "content": "Query: edit a slack message I sent earlier\nReply with the tool slug inside <answer></answer>." },
{ "role": "assistant", "content": "<answer>SLACK_UPDATES_A_SLACK_MESSAGE</answer>" }
] }The conversation carries only data - it does not say which tokens to learn from. That's the algorithm's decision (next step).
Registering your own transform
When no built-in fits, write your own and reference it by kind - no SDK fork:
from pydantic import BaseModel
from evsys_sdk import register_transform
@register_transform("strip_pii") # ← the name you use in the config
class StripPII:
name = "strip_pii"
class Config(BaseModel, extra="forbid"):
field: str = "query"
def __init__(self, **kw): self.cfg = self.Config(**kw)
def __call__(self, rows): # ← the rows -> rows contract
for r in rows:
r[self.cfg.field] = redact(r[self.cfg.field])
return rowsThen add { kind: strip_pii, params: { field: query } } to transforms. The
SDK finds it in the registry by that name. See Plugins.
3. The built-in sft algorithm
You don't implement training - you select a built-in algorithm by kind and
pass its knobs. sft tokenizes each chat row (masking the loss to the assistant
turns) and runs the SDK training loop on the hosted tinker backend:
model:
name: Qwen/Qwen3.5-4B
backend:
kind: tinker
algorithm:
kind: sft
params:
learning_rate: 1.0e-4
max_steps: 2 # bump to 100-500 for a real run
batch_size: 1
lora_rank: 1
supervise: all_assistant # which turns carry a loss - the ALGORITHM's callsupervise: all_assistant is what makes this supervised - the loss is on the
assistant tokens (the <answer>…</answer>), not the prompt. Note this lives on
the algorithm, not the data: the same chat rows could be used differently by
another algorithm.
4. Benchmarks & validation - score the model during training
To know if training is working, you score the model against a held-out
benchmark while it trains. A benchmark is a directory containing a
tasks.jsonl - each task has an instruction (the full prompt) and a
verifier that says whether an answer is correct:
{"task_id": "val_0", "instruction": "Query: schedule a meeting on my calendar\nReply with the tool slug inside <answer></answer>.", "verifier": {"kind": "in_process", "fn_name": "contains", "expected": "<answer>GOOGLECALENDAR_CREATE_EVENT</answer>"}}You attach it under metadata.benchmark and give it a run_every so it's
scored in-loop, repeatedly, every N steps:
metadata:
benchmark:
- name: val
path: examples/sft_walkthrough/data/val # the DIRECTORY (holds tasks.jsonl)
run_every: 1 # score every step. Use 50 / 100 for longer runs.
metrics: [pass@1] # fraction of tasks the model gets right
split: val # tags the metrics as validation (vs a `test` benchmark)
max_tokens: 64Two independent knobs to understand:
run_everycontrols when it's scored.run_every: Nscores the benchmark in-loop every N steps during training; omitrun_everyand it's scored once, after training. (This is true of any entry - atestbenchmark can have arun_everyand run in-loop too.)split(val/test) is just a label - it namespaces the metrics so different benchmarks' scores stay apart in the logs; it does not decide when an entry runs.pass@1here = the fraction of tasks whose answer the verifier accepts (contains the right<answer>SLUG</answer>).
5. Local logging - where the metrics show up
Logging is a callback on the experiment. local_logger prints a per-step
line and writes the metrics to disk:
callbacks:
- kind: local_logger
params:
print_every: 1It writes to <output_dir>/<run_name>/, which the runner sets to
examples/sft_walkthrough/outputs/sft_tool_selection/:
| File | What's in it |
|---|---|
metrics.jsonl | one row per event - {step, split, metrics}. Train steps are tagged split: "train"; the in-loop validation scores are tagged split: "val" (so you can tell them apart). |
predictions/val.jsonl | the model's actual answer for each val task each time it's scored |
summary.md | final status + the per-eval metric lines, written at the end |
A typical metrics.jsonl looks like:
Each row is {step, split, metrics} - training steps are tagged
split: "train" and the in-loop validation scores are tagged split: "val", so
both live in one file:
{"step": 1, "split": "train", "metrics": {"train_mean_nll": ..., "optim/lr": ...}}
{"step": 1, "split": "val", "metrics": {"val/val/pass@1": ...}}You read training loss and validation score side by side, scored in-loop.
6. Run it
Everything above is the file examples/sft_walkthrough/config.yaml. The runner
just loads it and calls the SDK:
# offline - checks every kind/params block, no GPU/network/cost:
evsys validate examples/sft_walkthrough/config.yaml --deep
# real hosted training (needs a Tinker key, charges a small amount):
export TINKER_API_KEY=...
python examples/sft_walkthrough/run.pyStatus: completed
Artifacts: ['run_dir', 'checkpoint-final', 'state-final', ...]
Conclusion: 1/1 arms completed.
Logs: examples/sft_walkthrough/outputs - metrics.jsonl, predictions/, summary.md, experiment.mdThe runner uses Experiment(cfg).run() (not the bare run_experiment) so the
local_logger callback fires and writes the files above - including
experiment.md with the hypothesis and conclusion.
You get a LoRA adapter that emits the tool slug, plus the metrics.jsonl /
predictions/ / summary.md from step 5. Bump max_steps and grow the val set
for a run that actually moves the needle.