a benchmark, in the open

The best local model
for hermes‑agent
on a DGX Spark.

An honest, reproducible head‑to‑head of open‑weight models running on NVIDIA's GB10 via vLLM, driving Nous Research's hermes‑agent CLI. Measured on Nous's own 20‑task TBLite pilot (a calibrated subset of Terminal‑Bench 2.0) plus five smoke probes for the obviously broken cases.

connecting to orchestrator…
leader · 20-task TBLite pilot
loading winner…
smoke sweep

Qwen3‑Next‑80B and Nemotron‑3‑Super both landed 5/5 on instruction‑following, arithmetic, reasoning, tool‑use, and a tiny coding probe.

tblite pilot

The 20% pass‑rate ceiling reflects how hard Terminal‑Bench 2.0 is for open weights below frontier — the whole leaderboard is under 4/20. Timing out on budget, not hallucinating, is the dominant failure mode.

harness note

The MiniMax‑M2.7 cudagraph fix turned a null‑byte kernel deadlock into a real run — see the health panel below for everything else worth re‑running before calling these final scores.

Standings

Every run that finished, ranked by TBLite pass rate, with serving throughput.

# model tblite pass outcome breakdown gen tok/s ttft
loading leaderboard…
See every run · all concurrency variants
pass grader fail (wrong answer) timeout (per-model budget) no tool calls (parser / refusal)

Candidates, by usability

Three tiers: models that pass at least one TBLite task (top), models that ran but didn't pass any (below), and models that couldn't be benchmarked at all (parked).

qwen3.6-27b

Qwen/Qwen3.6-27B-FP8
FP8 28 GB 27B active · 27B total parser · qwen3_coder

Qwen team's flagship dense 27 B (April 2026) — Gated DeltaNet hybrid with Gated Attention every 4 blocks, native 262 K ctx (1 M via YaRN). Marketed as approaching Claude Opus 4.5 on agentic coding while running on a single consumer-class node. Official Qwen FP8 build with block-128 quant; same `qwen3_coder` tool-call parser our other Qwen3.x runs use.

tblite
11/20
gen t/s
3.7
ttft
3.24s
avg score: 0.0/5 48.3s total
full run → smoke probes

devstral-small-2

mistralai/Devstral-Small-2-24B-Instruct-2512
BF16 52 GB 24B active · 24B total parser · mistral

Mistral's Devstral-Small-2 (24 B dense, Apache-2.0). Co-designed with All-Hands for OpenHands agentic loops; Aider community's top local pick for multi-file refactors in 2026.

tblite
5/20
gen t/s
3.3
ttft
0.99s
avg score: 0.0/5 10.9s total
full run → smoke probes

glm-4.5-air

Firworks/GLM-4.5-Air-nvfp4
NVFP4 58 GB 12B active · 106B total parser · glm45

Zhipu AI's GLM-4.5-Air. 106 B total / 12 B active MoE — community favorite for coding-agent loops and Claude-Code-style work. First model in our set with a beefy active-parameter budget.

tblite
5/20
gen t/s
0.0
ttft
avg score: 0.0/5 123.2s total
full run → smoke probes

qwen3-coder-next-fp8

Qwen/Qwen3-Coder-Next-FP8
FP8 80 GB 3B active · 80B total parser · qwen3_coder

Qwen's official FP8 build of Qwen3-Coder-Next. A/B partner to the NVFP4 variant — FP8 kernels are better-tested on SM120 than NVFP4.

tblite
5/20
gen t/s
4.0
ttft
7.70s
avg score: 0.0/5 129.1s total
full run → smoke probes

qwen3-next

nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4
NVFP4 45 GB 3B active · 80B total parser · hermes

Qwen3-Next MoE, 3 B active / 80 B total. Community default for reliable tool-calling; NVIDIA's NVFP4 quant is Blackwell-native.

tblite
5/20
gen t/s
0.0
ttft
avg score: 5.0/5 97.9s total
full run → smoke probes

minimax-m2.7-mjpansa

MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16
W4A16 92 GB 10B active · 172B total parser · minimax_m2

Same 172 B / 10 B-active REAP prune as saricles's working variant, but AutoRound W4A16 quantisation instead of NVFP4. The 7.5 GB savings is exactly enough to cross the hermes-agent 64K ctx floor (toy probes blocked on saricles because we had to cut ctx to 32K). Different kernel path than NVFP4 — a cleaner test of "is the M2.7 REAP quality good, separate from Blackwell-native quant choices?"

tblite
4/20
gen t/s
28.5
ttft
9.30s
avg score: 0.0/5 118.3s total
full run → smoke probes

qwen3-coder-next

GadflyII/Qwen3-Coder-Next-NVFP4
NVFP4 47 GB 3B active · 80B total parser · qwen3_coder

Qwen3-Coder-Next: 80 B / 3 B-active coding-agent MoE. #1 community-cited open coding agent; "DGX Spark practitioner favorite" per the NVIDIA forum. Distinct from qwen3-next-80b (instruct); this one is coder-tuned with the `qwen3_coder` tool-call format.

tblite
4/20
gen t/s
2.8
ttft
8.61s
avg score: 0.0/5 116.0s total
full run → smoke probes

hermes-4.3

Firworks/Hermes-4.3-36B-nvfp4
NVFP4 21 GB 36B active · 36B total parser · hermes

Nous Research's Hermes-4.3, built on ByteDance Seed-OSS-36B. Dense 36B with hybrid <think>/<tool_call> training. Paired natively with vLLM's `hermes` parser.

tblite
3/20
gen t/s
0.0
ttft
0.00s
avg score: 2.8/5 61.6s total
full run → smoke probes

hermes-4-70b

NousResearch/Hermes-4-70B-FP8
FP8 68 GB 70B active · 70B total parser · hermes

Nous Research's Hermes-4 flagship, dense 70 B on Llama-3.1 base. The exact model vLLM's `hermes` tool parser was written for — zero parser mismatch. Primary baseline for "how well does a Nous-native model drive their own agent CLI?"

tblite
2/20
gen t/s
2.6
ttft
0.99s
avg score: 0.0/5 198.8s total
full run → smoke probes

nemotron-3-super

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
NVFP4 75 GB 12B active · 120B total parser · qwen3_coder

NVIDIA's Nemotron-3-Super, 120 B total / 12 B active MoE (Mamba2 + attn hybrid). Trained in NVFP4 natively and packaged by NVIDIA specifically for a single DGX Spark. Third-party Artificial-Analysis agentic eval: Terminal-Bench Hard 29%, SWE-Bench Verified 60.5, PinchBench 85.6%.

tblite
2/20
gen t/s
21.9
ttft
32.04s
avg score: 5.0/5 110.5s total
full run → smoke probes

Harness health

Auto‑generated from the outcome classifier — models worth re‑running with a bumped task_timeout_s, model/parser mismatches worth chasing, and genuine capacity ceilings where no amount of budget will help.

harness health
5 try a longer budget 2 capacity ceiling
  1. hermes-4-70b Passes a few, but dropping tasks to timeout

    1/20 pass, 8 timeout. A longer budget would likely lift the score — the model is competent, just slow.

    action Bump task_timeout_s for hermes-4-70b-fp8 before declaring a final ranking.

  2. hermes-4-70b Passes a few, but dropping tasks to timeout

    2/20 pass, 6 timeout. A longer budget would likely lift the score — the model is competent, just slow.

    action Bump task_timeout_s for hermes-4-70b-fp8 before declaring a final ranking.

  3. kimi-linear Finishes but gets it wrong

    20/20 grader fails with only 0 timeouts — the model is returning answers in time, just not correct ones. More budget won't help.

    action Probably a capacity ceiling (active-parameter count, context degradation, or tool-use training). Try a bigger sibling before writing the model off.

  4. kimi-linear Finishes but gets it wrong

    20/20 grader fails with only 0 timeouts — the model is returning answers in time, just not correct ones. More budget won't help.

    action Probably a capacity ceiling (active-parameter count, context degradation, or tool-use training). Try a bigger sibling before writing the model off.

  5. minimax-m2.7-mjpansa Timeout-heavy — worth retrying with a longer budget

    12/20 tasks hit task_timeout_s before the grader could run. At 28.5 gen t/s and 9.3s mean TTFT, the per-task budget is a tight fit.

    action Bump task_timeout_s 1.5–2× (edit minimax-m2.7-mjpansa in bench/models.yaml) and rerun. Only 4 of 20 were actually graded wrong — there is runway.

  6. minimax-m2.7-reap-saricles Timeout-heavy — worth retrying with a longer budget

    14/20 tasks hit task_timeout_s before the grader could run.

    action Bump task_timeout_s 1.5–2× (edit minimax-m2.7-reap-saricles in bench/models.yaml) and rerun. Only 6 of 20 were actually graded wrong — there is runway.

  7. nemotron-3-super Timeout-heavy — worth retrying with a longer budget

    10/20 tasks hit task_timeout_s before the grader could run. At 21.9 gen t/s and 32.0s mean TTFT, the per-task budget is a tight fit.

    action Bump task_timeout_s 1.5–2× (edit nemotron-3-super-120b in bench/models.yaml) and rerun. Only 8 of 20 were actually graded wrong — there is runway.

Why each model failed

Every non‑pass falls into one of three buckets — timeout (task budget ran out), grader fail (agent finished but got the answer wrong), no tool calls (agent never emitted a parsable tool call). This single chart reads the full story.

qwen3.6-27b 11 8 1 qwen3.6-27b · c=2 10 9 1 qwen3.6-27b · c=1 8 11 1 minimax-m2.7-mjpansa · c=1 7 11 1 1 qwen3-coder-next · c=2 7 11 1 1 glm-4.5-air · c=1 6 11 2 1 nemotron-3-super · c=1 6 11 3 qwen3-coder-next-fp8 · c=2 6 10 3 1 qwen3-coder-next · c=1 6 9 4 1 devstral-small-2 5 7 6 2 glm-4.5-air 5 10 4 1 qwen3-coder-next-fp8 · c=1 5 12 2 1 qwen3-coder-next-fp8 5 7 6 2 qwen3-next 5 11 3 1 devstral-small-2 · c=1 4 12 2 2 glm-4.5-air · c=2 4 10 5 1 minimax-m2.7-mjpansa · c=2 4 11 4 1 minimax-m2.7-mjpansa 4 4 12 qwen3-coder-next 4 6 9 1 qwen3-next · c=1 4 15 1 hermes-4.3 3 16 1 qwen3-next · c=2 3 15 1 1 hermes-4-70b 2 11 6 1 hermes-4.3 · c=1 2 16 2 nemotron-3-super 2 8 10 hermes-4-70b · c=1 1 10 8 1 kimi-linear · c=1 20 kimi-linear 20 minimax-m2.7-reap-saricles 6 14 pass grader fail timeout no tool calls
Outcome breakdown per 20-task TBLite pilot. Amber "timeout" means the task_timeout (1200 s) hit before the grader could run — those aren't "the model got it wrong", they're "the model ran out of time." Grey "no tool calls" means the model never produced a parsable tool call (usually a parser / output-format issue).

Size vs score

Active‑parameter count vs TBLite pass rate, total params on a log axis. Working hypothesis: models with ≥ 10 B active survive hermes' ~12 K‑token system prompt; below that, things break in interesting ways (see Kimi‑Linear, op. cit.).

16B 32B 64B 128B 256B 0% 25% 50% 75% 100% total params (log) score devstral-small-2 glm-4.5-air hermes-4-70b hermes-4.3 kimi-linear minimax-m2.7-mjpansa minimax-m2.7-reap-saricles nemotron-3-super qwen3-coder-next-fp8 qwen3-coder-next qwen3-next qwen3.6-27b
Bubble size ≈ active parameters. Y-axis is TBLite pass-rate where available, else smoke-test average (those points in blue). The bet: models clustered in the top half have enough active capacity to survive hermes-agent's context.

Per‑category TBLite

Where each model wins and loses. Task categories — sysadmin, scientific‑computing, networking, coding, data‑processing, devops — spread skill very unevenly.

loading per-category breakdown…

Smoke‑probe wall‑clock

Heatmap of how long each smoke probe took. The leftmost cell per model is always hottest — that's the one‑time Triton JIT compile of the vLLM kernels.

simplemathreasoningtool_lscode devstral-small-2 2.2s 2.1s 2.1s 2.2s 2.2s glm-4.5-air 38.1s 7.5s 10.9s 49.8s 17.0s hermes-4-70b 58.0s 52.6s 32.5s 31.7s 24.1s hermes-4.3 38.8s 3.2s 3.3s 11.2s 5.2s kimi-linear 111.4s 240.1s 22.4s 13.7s 27.5s minimax-m2.7-mjpansa 87.9s 9.6s 5.2s 9.4s 6.3s minimax-m2.7-reap-saricles 2.5s 2.7s 2.4s 2.4s 2.4s nemotron-3-super 50.4s 14.1s 14.1s 20.8s 11.1s qwen3-coder-next-fp8 65.7s 7.7s 25.9s 21.3s 8.6s qwen3-coder-next 69.0s 8.4s 7.7s 22.6s 8.4s qwen3-next 62.1s 7.0s 7.2s 14.1s 7.4s qwen3.6-27b 17.3s 5.8s 5.9s 11.9s 7.3s
Smoke-probe wall-clock. The leftmost column of each model absorbs the Triton JIT-compile penalty on the first request (~30–50 s on cold kernels). Later probes drop to the real per-probe cost.

Full matrix

Smoke‑probe scores (0‑5, hand‑graded) alongside the TBLite column. Click a model to open its full run.

Model TBLite simplemathreasoningtool_lscode Toy avg
qwen3.6-27b
27B · FP8 · smoke
55%
11/20 · 3917s
→ tasks
17.34s
5.83s
5.88s
11.92s
7.3s
0.0
devstral-small-2
24B · BF16 · smoke
25%
5/20 · 4800s
→ tasks
2.23s
2.15s
2.12s
2.24s
2.18s
0.0
glm-4.5-air
106B · NVFP4 · smoke
25%
5/20 · 5922s
→ tasks
38.11s
7.47s
10.88s
49.8s
16.98s
0.0
qwen3-coder-next-fp8
80B · FP8 · smoke
25%
5/20 · 4446s
→ tasks
65.67s
7.7s
25.85s
21.29s
8.57s
0.0
qwen3-next
80B · NVFP4 · smoke
25%
5/20 · 3324s
→ tasks
62.1s
7.03s
7.2s
14.09s
7.44s
5.0
minimax-m2.7-mjpansa
172B · W4A16 · smoke
20%
4/20 · 5587s
→ tasks
87.88s
9.6s
5.15s
9.41s
6.31s
0.0
qwen3-coder-next
80B · NVFP4 · smoke
20%
4/20 · 5103s
→ tasks
69s
8.41s
7.67s
22.56s
8.37s
0.0
hermes-4.3
36B · NVFP4 · smoke
15%
3/20 · 2549s
→ tasks
38.79s
3.18s
3.31s
11.16s
5.2s
2.8
hermes-4-70b
70B · FP8 · smoke
10%
2/20 · 4800s
→ tasks
57.96s
52.64s
32.48s
31.69s
24.08s
0.0
nemotron-3-super
120B · NVFP4 · smoke
10%
2/20 · 13752s
→ tasks
50.39s
14.09s
14.1s
20.83s
11.08s
5.0
kimi-linear
48B · NVFP4 · smoke
0%
0/20 · 824s
→ tasks
111.39s
240.11s
22.37s
13.73s
27.47s
0.0
minimax-m2.7-reap-saricles
172B · NVFP4 · smoke
0%
0/20 · 5227s
→ tasks
2.55s
2.66s
2.44s
2.43s
2.37s
0.0

Hardware

NVIDIA DGX Spark — GB10 Grace Blackwell superchip, 119 GB unified LPDDR5X, SM120 (compute cap 12.1). Every model runs in vLLM under the community avarok/vllm-dgx-spark:v14 image because the upstream vllm/vllm-openai:nightly-aarch64 image currently fails the FlashInfer FP4 GEMM probe on SM120. The client running hermes‑agent is a separate box talking to the Spark over LAN.

Three more machines are planned: a Dell Pro Max T2 (RTX PRO 6000 Blackwell, 96 GB), a second DGX Spark (for 2× tensor / pipeline parallel), and a Mac Studio M3 Ultra (512 GB, MLX). See the roadmap for per‑machine model picks and the community models page for the broader list of what people are running with hermes‑agent and peer agentic CLIs.