Data stores
The read/write layer - where the SDK reads inputs from and writes outputs to.
A data store is the SDK's abstract file layer: every time an algorithm reads
a dataset or writes rendered rows / predictions, it goes through a DataStore.
The default is your local filesystem, so you rarely think about it - you make
your own when data lives somewhere else (a database, object storage, an
in-process cache for tests).
The contract
The contract is evsys_sdk.protocols.DataStore (a typing.Protocol, so you
satisfy it by implementing the methods - no subclassing). An implementation
declares one ClassVar and six methods:
-
name: ClassVar[str]- the registry key, the string you put in YAML askind. (Data stores have noConfigClassVar requirement in the protocol, but the built-ins carry one - see below.) -
read_jsonl(self, path: str) -> list[dict[str, Any]]- read a JSONL file atpathand return its rows as a list of dicts, one dict per line. This is the primary way datasets are loaded. -
write_jsonl(self, path: str, rows: Iterable[dict[str, Any]]) -> None- write an iterable of dicts topath, one JSON object per line. Used for rendered training data, predictions, and other row-shaped outputs. Returns nothing. -
read_json(self, path: str) -> Any- read a single JSON document atpathand return the parsed value (any JSON type - dict, list, scalar). -
write_json(self, path: str, value: Any) -> None- serializevalueto JSON and write it topath. Used for manifests, summaries, config snapshots. -
exists(self, path: str) -> bool- returnTrueif something is present atpath,Falseotherwise. Callers use this to skip work or guard reads. -
list(self, prefix: str) -> list[str]- return the paths underprefix. For the local store, a directory prefix is walked recursively and a glob pattern is expanded; paths come back relative to the store root.
Use a built-in
The local filesystem store is the default; you almost never name it explicitly. When you do, it looks like this:
data_store:
kind: local
params:
root: ./data # relative paths resolve against this; default "."| Built-in | What it does / where it writes |
|---|---|
local | LocalDataStore (src/evsys_sdk/data_stores/local.py). Reads/writes JSONL and JSON on the filesystem, no network. Relative paths resolve against root (default "."); absolute paths pass through. write_* create parent dirs; list walks a directory recursively or expands a glob, returning paths relative to root. |
in_memory | InMemoryDataStore (src/evsys_sdk/data_stores/in_memory.py). Keeps JSONL and JSON in two dicts keyed by path - nothing touches disk. read_* raise FileNotFoundError for unknown paths. For tests. |
Create your own
Implement the six methods, carry name + a Config Pydantic model
(extra="forbid" so YAML typos fail loudly), and decorate with
@register_data_store("<name>"):
from typing import Any, ClassVar, Iterable
from pydantic import BaseModel, ConfigDict
from evsys_sdk.registry import register_data_store
class S3DataStoreConfig(BaseModel):
model_config = ConfigDict(extra="forbid")
bucket: str
prefix: str = ""
@register_data_store("s3")
class S3DataStore:
name: ClassVar[str] = "s3" # the YAML `kind`
Config: ClassVar[type] = S3DataStoreConfig
def __init__(self, *, bucket: str, prefix: str = "") -> None:
import boto3
self._s3 = boto3.client("s3")
self.bucket = bucket
self.prefix = prefix
def read_jsonl(self, path: str) -> list[dict[str, Any]]:
import json
body = self._s3.get_object(Bucket=self.bucket, Key=self.prefix + path)["Body"].read()
return [json.loads(line) for line in body.splitlines() if line.strip()]
def write_jsonl(self, path: str, rows: Iterable[dict[str, Any]]) -> None:
import json
body = "\n".join(json.dumps(r) for r in rows).encode()
self._s3.put_object(Bucket=self.bucket, Key=self.prefix + path, Body=body)
def read_json(self, path: str) -> Any:
import json
return json.loads(self._s3.get_object(Bucket=self.bucket, Key=self.prefix + path)["Body"].read())
def write_json(self, path: str, value: Any) -> None:
import json
self._s3.put_object(Bucket=self.bucket, Key=self.prefix + path, Body=json.dumps(value).encode())
def exists(self, path: str) -> bool:
from botocore.exceptions import ClientError
try:
self._s3.head_object(Bucket=self.bucket, Key=self.prefix + path)
return True
except ClientError:
return False
def list(self, prefix: str) -> list[str]:
resp = self._s3.list_objects_v2(Bucket=self.bucket, Prefix=self.prefix + prefix)
return sorted(o["Key"] for o in resp.get("Contents", []))Then reference it by kind in YAML - no SDK edit:
data_store:
kind: s3
params:
bucket: my-research-bucket
prefix: datasets/Ship it in a package
To make your store importable from any project without copying code, expose it
as a Python entry point under the group evsys_sdk.data_stores in your
package's pyproject.toml:
[project.entry-points."evsys_sdk.data_stores"]
s3 = "my_pkg.stores:S3DataStore"On import, evsys_sdk walks that group and imports the target, running its
@register_data_store decorator - your kind is available everywhere with no
fork.