EvSys
Concepts

Data

Raw sources → ordered transforms → standardized typed rows that carry only data.

The data surface standardizes anything into a few typed shapes, then hands those to the algorithm. Tokenization and supervision live below this boundary - the typed rows carry only data; the algorithm decides what's a target token.

  • You define: a DataConfig (source + transforms), and a custom Transform when the built-ins don't fit. You pick which typed format the algorithm consumes.
  • The SDK handles: loading, pull/cache-by-id with lineage (Workspace), running transforms in order, and the strict parse_rows conversion.
ClassImplementable?Contract
DataConfigauthor in YAMLsource + transforms[]
Transformyes (@register_transform)__call__(rows) -> rows + Config
ChatMessagesRow / PromptExample / HarborTaskchoose shapedata only - no supervision encoded
DataStorerarelyread_jsonl / write_jsonl / read_json / write_json / exists / list
Workspace / MaterializedDatasetno (SDK)pull / cache / lineage

The dataset-format rule

Data formats hold only data. The algorithm decides which tokens are targets - the same ChatMessagesRow can be used for SFT or as an RL prompt.

Next: Algorithms · data types reference.