EvSys

model_eval

Model eval - generate completions over the eval set via an InferenceClient.

Each row in the eval set has 3 queries; the model predicts a tool slug for each. Pass@1/pass@3/pass^3 are computed in :mod:.report.

Generation calls are wrapped in the retry helper too - Tinker / remote inference can occasionally raise transient errors that shouldn't kill the whole run.

attributeDEFAULT_SYSTEM
= 'You are a tool search engine. Match user queries to the correct API tool. Think step by step inside <think></think> tags, then give your answer inside <answer></answer> tags.'
attributeDEFAULT_SYSTEM_NO_THINK
= 'You are a tool search engine. Match user queries to the correct API tool. Give your answer inside <answer></answer> tags.'
funcextract_predicted_slug(text) -> str

Pull the predicted slug from \<answer>...\</answer> tags; fall back to the longest ALL_CAPS_TOKEN if tags are missing.

paramtextstr

Returns

str
funcqwen_chat_prompt(*, query, toolkit='', expected_slug='') -> str

Qwen2/Qwen3 chat-template wrapper.

DEPRECATED hand-built form (missing the auto-injected \<think> scaffold that Qwen3.5 adds via its chat template). Retained for backwards compatibility with older eval runs. New code should use :func:qwen3_chat_template_prompt which round-trips through apply_chat_template and supports enable_thinking.

paramquerystr
paramtoolkitstr
= ''
paramexpected_slugstr
= ''

Returns

str
func_get_qwen_tokenizer(model_name)

Cached lookup of a HF tokenizer for chat-template rendering.

parammodel_namestr

Returns

None
funcqwen3_chat_template_prompt(*, query, toolkit='', expected_slug='', model_name='Qwen/Qwen3.5-4B', enable_thinking=True, system_prompt=None) -> str

Build the inference prompt via the model's official chat template.

For Qwen3-family models, this correctly emits the \<think>-scaffold suffix matching how the model was trained:

  • enable_thinking=True → prompt ends with \<|im_start|>assistant\n\<think>\n (model continues from inside the think block).
  • enable_thinking=False → prompt ends with \<|im_start|>assistant\n\<think>\n\n\</think>\n\n (model emits \<answer> directly).
paramquerystr
paramtoolkitstr
= ''
paramexpected_slugstr
= ''
parammodel_namestr
= 'Qwen/Qwen3.5-4B'
paramenable_thinkingbool
= True
paramsystem_promptstr | None
= None

Returns

str
func_row_qkeys(eval_rows) -> list[tuple[int, int, str]]
parameval_rowslist[dict[str, Any]]

Returns

list[tuple[int, int, str]]
funcrun_model_eval(eval_rows, *, client, config=None, progress=True) -> ModelEvalResult
parameval_rowslist[dict[str, Any]]
paramclientInferenceClient
paramconfigModelEvalConfig | None
= None
paramprogressbool
= True

Returns

evsys_sdk.eval.model_eval.ModelEvalResult