data_processing
Turn :class:~evsys_sdk.training.env.TrajectoryGroup\s into
list[tinker.Datum] ready for IS-loss training.
Port of tinker_cookbook.rl.data_processing.compute_advantages +
assemble_training_data. Both are pure functions - no I/O, no
tinker_cookbook dependency.
The two-step pipeline:
- :func:
compute_advantages- group-normalize rewards within each :class:TrajectoryGroup(subtract the group mean → reduces variance without bias). - :func:
assemble_training_data- flatten to alist[tinker.Datum]withloss_fn_inputs=target_tokens+ per-positionadvantages(0 off-completion → masks the loss) +logprobsfrom the sampler (the "old" logprobs for the IS ratio). These are the only keys tinker'simportance_samplingloss accepts.
attributelogger= logging.getLogger(__name__)attribute__all__= ['DatumMetadata', 'assemble_training_data', 'compute_advantages', 'compute_trajectory_metrics']funccompute_advantages(trajectory_groups, *, normalize=True) -> list[list[float]]Return advantages[group][traj] - per-trajectory scalar advantages.
normalize=True subtracts the group mean from each trajectory's reward
(standard variance-reduction baseline). When a group has only one
trajectory, advantage = reward.
paramtrajectory_groupslist[TrajectoryGroup]paramnormalizebool= TrueReturns
list[list[float]]funcassemble_training_data(trajectory_groups, advantages) -> tuple[list[tinker.Datum], list[DatumMetadata]]Flatten (group, trajectory, turn) → (Datum, DatumMetadata) pairs.
One Datum per assistant turn - a single-turn trajectory yields one Datum, a multi-turn one yields one per turn, all carrying the same trajectory-level advantage. Each Datum carries:
model_input: turn prompt + completion[:-1] (left-shifted).loss_fn_inputs["target_tokens"]: the shifted next-token targets.loss_fn_inputs["logprobs"]: sampler's per-position logprobs (zero on prompt positions). The "old" logprobs for the IS ratio.loss_fn_inputs["advantages"]: the trajectory's group-normalized reward on completion positions; 0 on prompt positions - which is what masks the loss to the completion (tinker'simportance_samplingtakes no mask/weights key, only target_tokens + logprobs + advantages).
Turns with no completion tokens are dropped silently - IS loss is a no-op.
paramtrajectory_groupslist[TrajectoryGroup]paramadvantageslist[list[float]]Returns
tuple[list[tinker.tinker.Datum], list[evsys_sdk.training.data_processing.DatumMetadata]]func_turn_to_datum(turn, *, advantage) -> tinker.Datum | NoneparamturnTurnparamadvantagefloatReturns
tinker.tinker.Datum | Nonefunccompute_trajectory_metrics(groups) -> dict[str, float]Roll up group-level reward stats into a flat metrics dict.
paramgroupslist[TrajectoryGroup]Returns
dict[str, float]