Methodology

How each model is served, how hermes‑agent talks to it, what prompts we ask, and how scores are assigned.

Serving

Every model runs in vLLM inside the same container image, with the same flags except for the model repo and tool-call parser:

docker run -d --runtime nvidia --gpus all --ipc=host \
  -e MODEL=<hf_repo> \
  -e PORT=8000 \
  -e MAX_MODEL_LEN=131072 \
  -e GPU_MEMORY_UTIL=0.85 \
  -e HF_HUB_OFFLINE=1 \
  -e TRANSFORMERS_OFFLINE=1 \
  -e VLLM_EXTRA_ARGS="--trust-remote-code \
    --served-model-name <name> \
    --enable-auto-tool-choice \
    --tool-call-parser <parser>" \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8001:8000 \
  --name bench-vllm \
  avarok/vllm-dgx-spark:v14 serve

Hermes-agent config

The client points hermes at the Spark via a custom provider entry in ~/.hermes/config.yaml:

model:
  default: <served_model_name>
  provider: custom
  base_url: http://dgx-spark:8001/v1
  api_key: EMPTY

Two evals, different purposes

We layer two harnesses. The smoke test runs in ~2 min per model and is meant to catch obvious brokenness (model refuses, model hallucinates, tool-call parser misconfigured). The TBLite benchmark is the real agent eval — Nous Research's own Terminal-Bench 2.0 subset, each task in a Docker sandbox.

Smoke-test probes

Five probes designed to stress different capabilities, each with a fixed turn budget:

simple — literal-instruction following (2 turns)
math — one-shot arithmetic (2 turns)
reasoning — "all-but-9" sheep trick (2 turns)
tool_ls — must call a shell tool to answer (4 turns)
code — Python one-liner (2 turns)

Each response is hand-graded 0–5 against the rubric in bench/probes.yaml. Grades live in site/src/lib/grading.ts so every change diffs in git.

TBLite (the real deal)

NousResearch/openthoughts-tblite is a 100-task calibrated subset of Terminal-Bench 2.0. It ships with hermes-agent — the harness lives at ~/.hermes/hermes-agent/environments/benchmarks/tblite/tblite_env.py. Each task spins up its own pre-built Docker image (`nousresearch/tblite-<task>:latest`), the agent gets a 60-turn budget, and pass/fail comes from a task-specific grader script. Categories: system-administration, scientific-computing, networking, coding, data-processing, devops, etc.

Our pilot runs 20 tasks × 8-concurrent. Wall-clock varies wildly by model speed and per-model task timeout: Hermes‑4.3‑36B finishes in ~25 min, GLM‑4.5‑Air and Hermes‑4‑70B in ~90 min (2400 s task timeout), Nemotron‑3‑Super in ~2 h (3600 s budget for its long <think> traces), MiniMax‑M2.7‑REAP (saricles) in ~2 h where most of that is 14 tasks hitting the 2400 s wall.

We classify every task's outcome into one of four buckets: pass, grader_fail (agent completed but the task- specific grader scored it wrong), timeout (per-model budget fired), no_tool_calls (agent never emitted a valid tool call — parser mismatch, safety refusal, or in the MiniMax-M2.7 dervig case, literal null bytes from an NVFP4 kernel deadlock). This distinction matters: a 0/20 from grader_fail (Kimi-Linear) is a very different signal than 0/20 from no_tool_calls (MiniMax dervig).

Invocation (from the repo root):

./bench/run_tblite.sh <model_key> [n_tasks]

Config differences from the stock local_vllm.yaml: server_type is set to openai (not vllm — stock vLLM exposes /v1/chat/completions, not Atropos's custom /generate), base URL gets the /v1 suffix, and the tokenizer is resolved from the model's actual HF repo rather than the served-model-name alias.

What this isn't

This isn't BFCL V4 or τ²-bench — both are on the roadmap. For now TBLite is our strongest signal because it's Nous's own harness, shipped and used inside hermes-agent's development loop. See the roadmap for a full gap analysis of what TBLite-alone does not measure (format brittleness, multi-turn dialogue, cost, reliability under re-runs, contamination, long-context degradation, safety).