a benchmark, in the open

The best local model
for hermes‑agent
on a DGX Spark.

An honest, reproducible head‑to‑head of open‑weight models running on NVIDIA's GB10 via vLLM, driving Nous Research's hermes‑agent CLI. Measured on Nous's own 20‑task TBLite pilot (a calibrated subset of Terminal‑Bench 2.0) plus five smoke probes for the obviously broken cases.

connecting to orchestrator…
leader · 20-task TBLite pilot · 8 models measurededition № 2026.04
The winner(so far)

nemotron-3-super

NVIDIA's Nemotron-3-Super, 120 B total / 12 B active MoE (Mamba2 + attn hybrid).

4/20
20%
tblite pass rate
gen tok/s
5.6
mean ttft
peak conc.
8
run length
68m
parameters
12B/120B
quant
NVFP4
Read the full run
smoke sweep

Qwen3‑Next‑80B and Nemotron‑3‑Super both landed 5/5 on instruction‑following, arithmetic, reasoning, tool‑use, and a tiny coding probe.

tblite pilot

The 20% pass‑rate ceiling reflects how hard Terminal‑Bench 2.0 is for open weights below frontier — the whole leaderboard is under 4/20. Timing out on budget, not hallucinating, is the dominant failure mode.

harness note

The MiniMax‑M2.7 cudagraph fix turned a null‑byte kernel deadlock into a real run — see the health panel below for everything else worth re‑running before calling these final scores.

Standings

Every run that finished, ranked by TBLite pass rate, with serving throughput.

#modeltblite passoutcome breakdowngen tok/sttft
01nemotron-3-super4/205.602glm-4.5-air3/2003qwen3-coder-next-fp83/2033.56.2s04hermes-4-70b2/2005hermes-4.32/2006qwen3-next2/2007devstral-small-22/2037.11.0s08qwen3-coder-next2/2036.56.8s09kimi-linear0/2010minimax-m2.7-reap-saricles0/20
passgrader fail (wrong answer)timeout (per-model budget)no tool calls (parser / refusal)

Candidates, by usability

Three tiers: models that pass at least one TBLite task (top), models that ran but didn't pass any (below), and models that couldn't be benchmarked at all (parked).

nemotron-3-super

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
NVFP4 75 GB 12B active · 120B total parser · qwen3_coder

NVIDIA's Nemotron-3-Super, 120 B total / 12 B active MoE (Mamba2 + attn hybrid). Trained in NVFP4 natively and packaged by NVIDIA specifically for a single DGX Spark. Third-party Artificial-Analysis agentic eval: Terminal-Bench Hard 29%, SWE-Bench Verified 60.5, PinchBench 85.6%.

tblite
4/20
gen t/s
5.6
ttft
avg score: 5.0/5 137.6s total
full run → smoke probes

glm-4.5-air

Firworks/GLM-4.5-Air-nvfp4
NVFP4 58 GB 12B active · 106B total parser · glm45

Zhipu AI's GLM-4.5-Air. 106 B total / 12 B active MoE — community favorite for coding-agent loops and Claude-Code-style work. First model in our set with a beefy active-parameter budget.

tblite
3/20
avg score: 0.0/5 127.2s total
full run → smoke probes

qwen3-coder-next-fp8

Qwen/Qwen3-Coder-Next-FP8
FP8 80 GB 3B active · 80B total parser · qwen3_coder

Qwen's official FP8 build of Qwen3-Coder-Next. A/B partner to the NVFP4 variant — FP8 kernels are better-tested on SM120 than NVFP4.

tblite
3/20
gen t/s
33.5
ttft
6.19s
avg score: 0.0/5 115.1s total
full run → smoke probes

devstral-small-2

mistralai/Devstral-Small-2-24B-Instruct-2512
BF16 52 GB 24B active · 24B total parser · mistral

Mistral's Devstral-Small-2 (24 B dense, Apache-2.0). Co-designed with All-Hands for OpenHands agentic loops; Aider community's top local pick for multi-file refactors in 2026.

tblite
2/20
gen t/s
37.1
ttft
1.01s
avg score: 0.0/5 13.4s total
full run → smoke probes

hermes-4-70b

NousResearch/Hermes-4-70B-FP8
FP8 68 GB 70B active · 70B total parser · hermes

Nous Research's Hermes-4 flagship, dense 70 B on Llama-3.1 base. The exact model vLLM's `hermes` tool parser was written for — zero parser mismatch. Primary baseline for "how well does a Nous-native model drive their own agent CLI?"

tblite
2/20
avg score: 0.0/5 215.9s total
full run → smoke probes

hermes-4.3

Firworks/Hermes-4.3-36B-nvfp4
NVFP4 21 GB 36B active · 36B total parser · hermes

Nous Research's Hermes-4.3, built on ByteDance Seed-OSS-36B. Dense 36B with hybrid <think>/<tool_call> training. Paired natively with vLLM's `hermes` parser.

tblite
2/20
avg score: 2.8/5 80.9s total
full run → smoke probes

qwen3-coder-next

GadflyII/Qwen3-Coder-Next-NVFP4
NVFP4 47 GB 3B active · 80B total parser · qwen3_coder

Qwen3-Coder-Next: 80 B / 3 B-active coding-agent MoE. #1 community-cited open coding agent; "DGX Spark practitioner favorite" per the NVIDIA forum. Distinct from qwen3-next-80b (instruct); this one is coder-tuned with the `qwen3_coder` tool-call format.

tblite
2/20
gen t/s
36.5
ttft
6.83s
avg score: 0.0/5 122.3s total
full run → smoke probes

qwen3-next

nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4
NVFP4 45 GB 3B active · 80B total parser · hermes

Qwen3-Next MoE, 3 B active / 80 B total. Community default for reliable tool-calling; NVIDIA's NVFP4 quant is Blackwell-native.

tblite
2/20
avg score: 5.0/5 108.1s total
full run → smoke probes

Harness health

Auto‑generated from the outcome classifier — models worth re‑running with a bumped task_timeout_s, model/parser mismatches worth chasing, and genuine capacity ceilings where no amount of budget will help.

harness health
9 try a longer budget 1 capacity ceiling
  1. devstral-small-2 Timeout-heavy — worth retrying with a longer budget

    12/20 tasks hit task_timeout_s before the grader could run. At 37.1 gen t/s and 1.0s mean TTFT, the per-task budget is a tight fit.

    action Bump task_timeout_s 1.5–2× (edit devstral-small-2 in bench/models.yaml) and rerun. Only 5 of 20 were actually graded wrong — there is runway.

  2. glm-4.5-air Timeout-heavy — worth retrying with a longer budget

    11/20 tasks hit task_timeout_s before the grader could run.

    action Bump task_timeout_s 1.5–2× (edit glm-4.5-air in bench/models.yaml) and rerun. Only 5 of 20 were actually graded wrong — there is runway.

  3. hermes-4-70b Passes a few, but dropping tasks to timeout

    2/20 pass, 6 timeout. A longer budget would likely lift the score — the model is competent, just slow.

    action Bump task_timeout_s for hermes-4-70b-fp8 before declaring a final ranking.

  4. hermes-4.3 Passes a few, but dropping tasks to timeout

    2/20 pass, 7 timeout. A longer budget would likely lift the score — the model is competent, just slow.

    action Bump task_timeout_s for hermes-4.3-36b before declaring a final ranking.

  5. kimi-linear Finishes but gets it wrong

    20/20 grader fails with only 0 timeouts — the model is returning answers in time, just not correct ones. More budget won't help.

    action Probably a capacity ceiling (active-parameter count, context degradation, or tool-use training). Try a bigger sibling before writing the model off.

  6. minimax-m2.7-reap-saricles Timeout-heavy — worth retrying with a longer budget

    14/20 tasks hit task_timeout_s before the grader could run.

    action Bump task_timeout_s 1.5–2× (edit minimax-m2.7-reap-saricles in bench/models.yaml) and rerun. Only 6 of 20 were actually graded wrong — there is runway.

  7. nemotron-3-super Timeout-heavy — worth retrying with a longer budget

    15/20 tasks hit task_timeout_s before the grader could run. At 5.6 gen t/s and —s mean TTFT, the per-task budget is a tight fit.

    action Bump task_timeout_s 1.5–2× (edit nemotron-3-super-120b in bench/models.yaml) and rerun. Only 1 of 20 were actually graded wrong — there is runway.

  8. qwen3-coder-next-fp8 Timeout-heavy — worth retrying with a longer budget

    13/20 tasks hit task_timeout_s before the grader could run. At 33.5 gen t/s and 6.2s mean TTFT, the per-task budget is a tight fit.

    action Bump task_timeout_s 1.5–2× (edit qwen3-coder-next-fp8 in bench/models.yaml) and rerun. Only 4 of 20 were actually graded wrong — there is runway.

  9. qwen3-coder-next Timeout-heavy — worth retrying with a longer budget

    15/20 tasks hit task_timeout_s before the grader could run. At 36.5 gen t/s and 6.8s mean TTFT, the per-task budget is a tight fit.

    action Bump task_timeout_s 1.5–2× (edit qwen3-coder-next in bench/models.yaml) and rerun. Only 3 of 20 were actually graded wrong — there is runway.

  10. qwen3-next Timeout-heavy — worth retrying with a longer budget

    10/20 tasks hit task_timeout_s before the grader could run.

    action Bump task_timeout_s 1.5–2× (edit qwen3-next-80b in bench/models.yaml) and rerun. Only 7 of 20 were actually graded wrong — there is runway.

Why each model failed

Every non‑pass falls into one of three buckets — timeout (task budget ran out), grader fail (agent finished but got the answer wrong), no tool calls (agent never emitted a parsable tool call). This single chart reads the full story.

nemotron-3-super 4 1 15 glm-4.5-air 3 5 11 1 qwen3-coder-next-fp8 3 4 13 devstral-small-2 2 5 12 1 hermes-4-70b 2 11 6 1 hermes-4.3 2 11 7 qwen3-coder-next 2 3 15 qwen3-next 2 7 10 1 kimi-linear 20 minimax-m2.7-reap-saricles 6 14 pass grader fail timeout no tool calls
Outcome breakdown per 20-task TBLite pilot. Amber "timeout" means the task_timeout (1200 s) hit before the grader could run — those aren't "the model got it wrong", they're "the model ran out of time." Grey "no tool calls" means the model never produced a parsable tool call (usually a parser / output-format issue).

Size vs score

Active‑parameter count vs TBLite pass rate, total params on a log axis. Working hypothesis: models with ≥ 10 B active survive hermes' ~12 K‑token system prompt; below that, things break in interesting ways (see Kimi‑Linear, op. cit.).

16B 32B 64B 128B 256B 0% 25% 50% 75% 100% total params (log) score devstral-small-2 glm-4.5-air hermes-4-70b hermes-4.3 kimi-linear minimax-m2.7-reap-saricles nemotron-3-super qwen3-coder-next-fp8 qwen3-coder-next qwen3-next
Bubble size ≈ active parameters. Y-axis is TBLite pass-rate where available, else smoke-test average (those points in blue). The bet: models clustered in the top half have enough active capacity to survive hermes-agent's context.

Per‑category TBLite

Where each model wins and loses. Task categories — sysadmin, scientific‑computing, networking, coding, data‑processing, devops — spread skill very unevenly.

backendbackend engineeringbash scriptingbuild and dependency managementcpdata engineeringdebugginggeneralmachine learningscientific computingsecuritysoftware engineeringsystem administration
Per-category TBLite pass rate.nemotron-3-super-120bglm-4.5-airqwen3-coder-next-fp8devstral-small-2hermes-4-70b-fp8hermes-4.3-36bqwen3-coder-nextqwen3-next-80bkimi-linearminimax-m2.7-reap-saricles

Smoke‑probe wall‑clock

Heatmap of how long each smoke probe took. The leftmost cell per model is always hottest — that's the one‑time Triton JIT compile of the vLLM kernels.

simplemathreasoningtool_lscode devstral-small-2 2.4s 2.7s 3.0s 2.4s 3.0s glm-4.5-air 43.1s 10.4s 10.3s 47.6s 15.8s hermes-4-70b 59.5s 56.4s 28.1s 19.4s 52.6s hermes-4.3 42.9s 3.2s 3.1s 5.1s 26.6s kimi-linear 30.5s 10.0s 11.4s 14.6s 8.1s minimax-m2.7-reap-saricles 2.5s 2.7s 2.4s 2.4s 2.4s nemotron-3-super 51.3s 16.0s 16.7s 37.8s 15.8s qwen3-coder-next-fp8 64.1s 14.1s 7.5s 21.3s 8.2s qwen3-coder-next 66.7s 11.3s 10.8s 24.7s 8.8s qwen3-next 64.6s 10.2s 7.0s 18.8s 7.5s
Smoke-probe wall-clock. The leftmost column of each model absorbs the Triton JIT-compile penalty on the first request (~30–50 s on cold kernels). Later probes drop to the real per-probe cost.

Full matrix

Smoke‑probe scores (0‑5, hand‑graded) alongside the TBLite column. Click a model to open its full run.

Model TBLite simplemathreasoningtool_lscode Toy avg
nemotron-3-super
120B · NVFP4 · smoke
20%
4/20 · 10800s
→ tasks
51.28s
15.99s
16.67s
37.83s
15.79s
5.0
glm-4.5-air
106B · NVFP4 · smoke
15%
3/20 · 5338s
→ tasks
43.07s
10.37s
10.31s
47.65s
15.84s
0.0
qwen3-coder-next-fp8
80B · FP8 · smoke
15%
3/20 · 2774s
→ tasks
64.13s
14.1s
7.45s
21.25s
8.19s
0.0
devstral-small-2
24B · BF16 · smoke
10%
2/20 · 2786s
→ tasks
2.37s
2.68s
2.97s
2.4s
2.96s
0.0
hermes-4-70b
70B · FP8 · smoke
10%
2/20 · 4800s
→ tasks
59.46s
56.37s
28.14s
19.35s
52.56s
0.0
hermes-4.3
36B · NVFP4 · smoke
10%
2/20 · 2400s
→ tasks
42.86s
3.18s
3.14s
5.14s
26.56s
2.8
qwen3-coder-next
80B · NVFP4 · smoke
10%
2/20 · 3059s
→ tasks
66.66s
11.32s
10.82s
24.7s
8.84s
0.0
qwen3-next
80B · NVFP4 · smoke
10%
2/20 · 2770s
→ tasks
64.57s
10.22s
7.04s
18.76s
7.53s
5.0
kimi-linear
48B · NVFP4 · smoke
0%
0/20 · 824s
→ tasks
30.5s
9.96s
11.41s
14.62s
8.1s
0.0
minimax-m2.7-reap-saricles
172B · NVFP4 · smoke
0%
0/20 · 5227s
→ tasks
2.55s
2.66s
2.44s
2.43s
2.37s
0.0

Hardware

NVIDIA DGX Spark — GB10 Grace Blackwell superchip, 119 GB unified LPDDR5X, SM120 (compute cap 12.1). Every model runs in vLLM under the community avarok/vllm-dgx-spark:v14 image because the upstream vllm/vllm-openai:nightly-aarch64 image currently fails the FlashInfer FP4 GEMM probe on SM120. The client running hermes‑agent is a separate box talking to the Spark over LAN.

Three more machines are planned: a Dell Pro Max T2 (RTX PRO 6000 Blackwell, 96 GB), a second DGX Spark (for 2× tensor / pipeline parallel), and a Mac Studio M3 Ultra (512 GB, MLX). See the roadmap for per‑machine model picks and the community models page for the broader list of what people are running with hermes‑agent and peer agentic CLIs.