model_eval
Model eval - generate completions over the eval set via an InferenceClient.
Each row in the eval set has 3 queries; the model predicts a tool slug for
each. Pass@1/pass@3/pass^3 are computed in :mod:.report.
Generation calls are wrapped in the retry helper too - Tinker / remote inference can occasionally raise transient errors that shouldn't kill the whole run.
attributeDEFAULT_SYSTEM= 'You are a tool search engine. Match user queries to the correct API tool. Think step by step inside <think></think> tags, then give your answer inside <answer></answer> tags.'attributeDEFAULT_SYSTEM_NO_THINK= 'You are a tool search engine. Match user queries to the correct API tool. Give your answer inside <answer></answer> tags.'funcextract_predicted_slug(text) -> strPull the predicted slug from \<answer>...\</answer> tags;
fall back to the longest ALL_CAPS_TOKEN if tags are missing.
paramtextstrReturns
strfuncqwen_chat_prompt(*, query, toolkit='', expected_slug='') -> strQwen2/Qwen3 chat-template wrapper.
DEPRECATED hand-built form (missing the auto-injected \<think> scaffold
that Qwen3.5 adds via its chat template). Retained for backwards
compatibility with older eval runs. New code should use
:func:qwen3_chat_template_prompt which round-trips through
apply_chat_template and supports enable_thinking.
paramquerystrparamtoolkitstr= ''paramexpected_slugstr= ''Returns
strfunc_get_qwen_tokenizer(model_name)Cached lookup of a HF tokenizer for chat-template rendering.
parammodel_namestrReturns
Nonefuncqwen3_chat_template_prompt(*, query, toolkit='', expected_slug='', model_name='Qwen/Qwen3.5-4B', enable_thinking=True, system_prompt=None) -> strBuild the inference prompt via the model's official chat template.
For Qwen3-family models, this correctly emits the \<think>-scaffold
suffix matching how the model was trained:
enable_thinking=True→ prompt ends with\<|im_start|>assistant\n\<think>\n(model continues from inside the think block).enable_thinking=False→ prompt ends with\<|im_start|>assistant\n\<think>\n\n\</think>\n\n(model emits\<answer>directly).
paramquerystrparamtoolkitstr= ''paramexpected_slugstr= ''parammodel_namestr= 'Qwen/Qwen3.5-4B'paramenable_thinkingbool= Trueparamsystem_promptstr | None= NoneReturns
strfunc_row_qkeys(eval_rows) -> list[tuple[int, int, str]]parameval_rowslist[dict[str, Any]]Returns
list[tuple[int, int, str]]funcrun_model_eval(eval_rows, *, client, config=None, progress=True) -> ModelEvalResultparameval_rowslist[dict[str, Any]]paramclientInferenceClientparamconfigModelEvalConfig | None= Noneparamprogressbool= TrueReturns
evsys_sdk.eval.model_eval.ModelEvalResult