model_eval

Model eval - generate completions over the eval set via an InferenceClient.

Each row in the eval set has 3 queries; the model predicts a tool slug for each. Pass@1/pass@3/pass^3 are computed in :mod:.report.

Generation calls are wrapped in the retry helper too - Tinker / remote inference can occasionally raise transient errors that shouldn't kill the whole run.

attributeDEFAULT_SYSTEM

= 'You are a tool search engine. Match user queries to the correct API tool. Think step by step inside <think></think> tags, then give your answer inside <answer></answer> tags.'

attributeDEFAULT_SYSTEM_NO_THINK

= 'You are a tool search engine. Match user queries to the correct API tool. Give your answer inside <answer></answer> tags.'

ModelEvalResult

funcextract_predicted_slug(text) -> str

Pull the predicted slug from \<answer>...\</answer> tags; fall back to the longest ALL_CAPS_TOKEN if tags are missing.

paramtextstr

Returns

str

funcqwen_chat_prompt(*, query, toolkit='', expected_slug='') -> str

Qwen2/Qwen3 chat-template wrapper.

DEPRECATED hand-built form (missing the auto-injected \<think> scaffold that Qwen3.5 adds via its chat template). Retained for backwards compatibility with older eval runs. New code should use :func:qwen3_chat_template_prompt which round-trips through apply_chat_template and supports enable_thinking.

paramquerystr

paramtoolkitstr

= ''

paramexpected_slugstr

= ''

Returns

str

func_get_qwen_tokenizer(model_name)

Cached lookup of a HF tokenizer for chat-template rendering.

parammodel_namestr

Returns

None

funcqwen3_chat_template_prompt(*, query, toolkit='', expected_slug='', model_name='Qwen/Qwen3.5-4B', enable_thinking=True, system_prompt=None) -> str

Build the inference prompt via the model's official chat template.

For Qwen3-family models, this correctly emits the \<think>-scaffold suffix matching how the model was trained:

enable_thinking=True → prompt ends with \<|im_start|>assistant\n\<think>\n (model continues from inside the think block).
enable_thinking=False → prompt ends with \<|im_start|>assistant\n\<think>\n\n\</think>\n\n (model emits \<answer> directly).

paramquerystr

paramtoolkitstr

= ''

paramexpected_slugstr

= ''

parammodel_namestr

= 'Qwen/Qwen3.5-4B'

paramenable_thinkingbool

= True

paramsystem_promptstr | None

= None

Returns

str

func_row_qkeys(eval_rows) -> list[tuple[int, int, str]]

parameval_rowslist[dict[str, Any]]

Returns

list[tuple[int, int, str]]

funcrun_model_eval(eval_rows, *, client, config=None, progress=True) -> ModelEvalResult

parameval_rowslist[dict[str, Any]]

paramclientInferenceClient

paramconfigModelEvalConfig | None

= None

paramprogressbool

= True

Returns

evsys_sdk.eval.model_eval.ModelEvalResult

model_eval

PromptBuilder

ModelEvalConfig

ModelEvalResult