Methodology
How each model is served, how hermes‑agent talks to it, what prompts we ask, and how scores are assigned.
Serving
Every model runs in vLLM inside the same container image, with the same flags except for the model repo and tool-call parser:
docker run -d --runtime nvidia --gpus all --ipc=host \
-e MODEL=<hf_repo> \
-e PORT=8000 \
-e MAX_MODEL_LEN=131072 \
-e GPU_MEMORY_UTIL=0.85 \
-e HF_HUB_OFFLINE=1 \
-e TRANSFORMERS_OFFLINE=1 \
-e VLLM_EXTRA_ARGS="--trust-remote-code \
--served-model-name <name> \
--enable-auto-tool-choice \
--tool-call-parser <parser>" \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8001:8000 \
--name bench-vllm \
avarok/vllm-dgx-spark:v14 serve Hermes-agent config
The client points hermes at the Spark via a custom provider entry in
~/.hermes/config.yaml:
model:
default: <served_model_name>
provider: custom
base_url: http://dgx-spark:8001/v1
api_key: EMPTY Two evals, different purposes
We layer two harnesses. The smoke test runs in ~2 min per model and is meant to catch obvious brokenness (model refuses, model hallucinates, tool-call parser misconfigured). The TBLite benchmark is the real agent eval — Nous Research's own Terminal-Bench 2.0 subset, each task in a Docker sandbox.
Smoke-test probes
Five probes designed to stress different capabilities, each with a fixed turn budget:
- simple — literal-instruction following (2 turns)
- math — one-shot arithmetic (2 turns)
- reasoning — "all-but-9" sheep trick (2 turns)
- tool_ls — must call a shell tool to answer (4 turns)
- code — Python one-liner (2 turns)
Each response is hand-graded 0–5 against the rubric in bench/probes.yaml.
Grades live in site/src/lib/grading.ts so every change diffs in git.
TBLite (the real deal)
NousResearch/openthoughts-tblite
is a 100-task calibrated subset of Terminal-Bench 2.0. It ships with hermes-agent — the
harness lives at
~/.hermes/hermes-agent/environments/benchmarks/tblite/tblite_env.py. Each
task spins up its own pre-built Docker image (`nousresearch/tblite-<task>:latest`),
the agent gets a 60-turn budget, and pass/fail comes from a task-specific grader
script. Categories: system-administration, scientific-computing, networking, coding,
data-processing, devops, etc.
Our pilot runs 20 tasks × 8-concurrent. Wall-clock varies wildly by model speed and per-model task timeout: Hermes‑4.3‑36B finishes in ~25 min, GLM‑4.5‑Air and Hermes‑4‑70B in ~90 min (2400 s task timeout), Nemotron‑3‑Super in ~2 h (3600 s budget for its long <think> traces), MiniMax‑M2.7‑REAP (saricles) in ~2 h where most of that is 14 tasks hitting the 2400 s wall.
We classify every task's outcome into one of four buckets: pass, grader_fail (agent completed but the task- specific grader scored it wrong), timeout (per-model budget fired), no_tool_calls (agent never emitted a valid tool call — parser mismatch, safety refusal, or in the MiniMax-M2.7 dervig case, literal null bytes from an NVFP4 kernel deadlock). This distinction matters: a 0/20 from grader_fail (Kimi-Linear) is a very different signal than 0/20 from no_tool_calls (MiniMax dervig).
Invocation (from the repo root):
./bench/run_tblite.sh <model_key> [n_tasks]
Config differences from the stock local_vllm.yaml: server_type
is set to openai (not vllm — stock vLLM exposes
/v1/chat/completions, not Atropos's custom /generate), base URL
gets the /v1 suffix, and the tokenizer is resolved from the model's actual
HF repo rather than the served-model-name alias.
What this isn't
This isn't BFCL V4 or τ²-bench — both are on the roadmap. For now TBLite is our strongest signal because it's Nous's own harness, shipped and used inside hermes-agent's development loop. See the roadmap for a full gap analysis of what TBLite-alone does not measure (format brittleness, multi-turn dialogue, cost, reliability under re-runs, contamination, long-context degradation, safety).