Hermes × Spark Shootout

a benchmark, in the open

connecting to orchestrator…

leader · 20-task TBLite pilot · 8 models measurededition № 2026.04

The winner(so far)

nemotron-3-super

NVIDIA's Nemotron-3-Super, 120 B total / 12 B active MoE (Mamba2 + attn hybrid).

4/20

20%

tblite pass rate

gen tok/s: 5.6
mean ttft: —
peak conc.: 8
run length: 68m
parameters: 12B/120B
quant: NVFP4

Read the full run →

smoke sweep

Qwen3‑Next‑80B and Nemotron‑3‑Super both landed 5/5 on instruction‑following, arithmetic, reasoning, tool‑use, and a tiny coding probe.

tblite pilot

The 20% pass‑rate ceiling reflects how hard Terminal‑Bench 2.0 is for open weights below frontier — the whole leaderboard is under 4/20. Timing out on budget, not hallucinating, is the dominant failure mode.

harness note

The MiniMax‑M2.7 cudagraph fix turned a null‑byte kernel deadlock into a real run — see the health panel below for everything else worth re‑running before calling these final scores.

Standings

Every run that finished, ranked by TBLite pass rate, with serving throughput.

#modeltblite passoutcome breakdowngen tok/sttft

01nemotron-3-super4/205.6—02glm-4.5-air3/20——03qwen3-coder-next-fp83/2033.56.2s 04hermes-4-70b2/20——05hermes-4.32/20——06qwen3-next2/20——07devstral-small-22/2037.11.0s 08qwen3-coder-next2/2036.56.8s 09kimi-linear0/20——10minimax-m2.7-reap-saricles0/20——

Candidates, by usability

Three tiers: models that pass at least one TBLite task (top), models that ran but didn't pass any (below), and models that couldn't be benchmarked at all (parked).

nemotron-3-super

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

NVFP4 75 GB 12B active · 120B total parser · qwen3_coder

NVIDIA's Nemotron-3-Super, 120 B total / 12 B active MoE (Mamba2 + attn hybrid). Trained in NVFP4 natively and packaged by NVIDIA specifically for a single DGX Spark. Third-party Artificial-Analysis agentic eval: Terminal-Bench Hard 29%, SWE-Bench Verified 60.5, PinchBench 85.6%.

tblite

4/20

gen t/s

5.6

ttft

—

avg score: 5.0/5 137.6s total

full run → smoke probes

glm-4.5-air

Firworks/GLM-4.5-Air-nvfp4

NVFP4 58 GB 12B active · 106B total parser · glm45

Zhipu AI's GLM-4.5-Air. 106 B total / 12 B active MoE — community favorite for coding-agent loops and Claude-Code-style work. First model in our set with a beefy active-parameter budget.

tblite

3/20

avg score: 0.0/5 127.2s total

full run → smoke probes

qwen3-coder-next-fp8

Qwen/Qwen3-Coder-Next-FP8

FP8 80 GB 3B active · 80B total parser · qwen3_coder

Qwen's official FP8 build of Qwen3-Coder-Next. A/B partner to the NVFP4 variant — FP8 kernels are better-tested on SM120 than NVFP4.

tblite

3/20

gen t/s

33.5

ttft

6.19s

avg score: 0.0/5 115.1s total

full run → smoke probes

devstral-small-2

mistralai/Devstral-Small-2-24B-Instruct-2512

BF16 52 GB 24B active · 24B total parser · mistral

Mistral's Devstral-Small-2 (24 B dense, Apache-2.0). Co-designed with All-Hands for OpenHands agentic loops; Aider community's top local pick for multi-file refactors in 2026.

tblite

2/20

gen t/s

37.1

ttft

1.01s

avg score: 0.0/5 13.4s total

full run → smoke probes

hermes-4-70b

NousResearch/Hermes-4-70B-FP8

FP8 68 GB 70B active · 70B total parser · hermes

Nous Research's Hermes-4 flagship, dense 70 B on Llama-3.1 base. The exact model vLLM's `hermes` tool parser was written for — zero parser mismatch. Primary baseline for "how well does a Nous-native model drive their own agent CLI?"

tblite

2/20

avg score: 0.0/5 215.9s total

full run → smoke probes

hermes-4.3

Firworks/Hermes-4.3-36B-nvfp4

NVFP4 21 GB 36B active · 36B total parser · hermes

Nous Research's Hermes-4.3, built on ByteDance Seed-OSS-36B. Dense 36B with hybrid <think>/<tool_call> training. Paired natively with vLLM's `hermes` parser.

tblite

2/20

avg score: 2.8/5 80.9s total

full run → smoke probes

qwen3-coder-next

GadflyII/Qwen3-Coder-Next-NVFP4

NVFP4 47 GB 3B active · 80B total parser · qwen3_coder

Qwen3-Coder-Next: 80 B / 3 B-active coding-agent MoE. #1 community-cited open coding agent; "DGX Spark practitioner favorite" per the NVIDIA forum. Distinct from qwen3-next-80b (instruct); this one is coder-tuned with the `qwen3_coder` tool-call format.

tblite

2/20

gen t/s

36.5

ttft

6.83s

avg score: 0.0/5 122.3s total

full run → smoke probes

qwen3-next

nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4

NVFP4 45 GB 3B active · 80B total parser · hermes

Qwen3-Next MoE, 3 B active / 80 B total. Community default for reliable tool-calling; NVIDIA's NVFP4 quant is Blackwell-native.

tblite

2/20

avg score: 5.0/5 108.1s total

full run → smoke probes

kimi-linear

20/20 wrong answer

3B/48B · NVFP4 · parser pythonic

minimax-m2.7-reap-saricles

14/20 ran out of time

10B/172B · NVFP4 · parser minimax_m2

ARCHglm-4.7-flash3B/30BNVFP420 GBvLLM image's Transformers doesn't know this architecture yet ARCHqwen3.5-35b-a3b3B/35BBF1672 GBvLLM image's Transformers doesn't know this architecture yet OVERSIZEminimax-m2.110B/115BNVFP4122 GBweights exceed single-Spark unified memory SLOWminimax-m2.7-reap-saricles10B/172BNVFP499 GBruns cleanly but times out on single Spark DOWNLOADINGgemma-4-31b31B/31BNVFP416 GBweights in flight to the Spark cache DOWNLOADINGhermes-4-14b14B/14BFP814 GBweights in flight to the Spark cache DOWNLOADINGminimax-m2.7-mjpansa10B/172BW4A1692 GBweights in flight to the Spark cache DOWNLOADINGminimax-m2.7-reap-dervig-139b10B/139BNVFP480 GBweights in flight to the Spark cache DOWNLOADINGopenhands-lm-32b32B/32BBF1664 GBweights in flight to the Spark cache DOWNLOADINGqwen3-coder-30b-a3b3B/30BFP830 GBweights in flight to the Spark cache DOWNLOADINGseed-oss-36b36B/36BFP836 GBweights in flight to the Spark cache

Harness health

Auto‑generated from the outcome classifier — models worth re‑running with a bumped task_timeout_s, model/parser mismatches worth chasing, and genuine capacity ceilings where no amount of budget will help.

harness health

9 try a longer budget 1 capacity ceiling

⟳

devstral-small-2 Timeout-heavy — worth retrying with a longer budget

12/20 tasks hit task_timeout_s before the grader could run. At 37.1 gen t/s and 1.0s mean TTFT, the per-task budget is a tight fit.

action Bump task_timeout_s 1.5–2× (edit devstral-small-2 in bench/models.yaml) and rerun. Only 5 of 20 were actually graded wrong — there is runway.
⟳

glm-4.5-air Timeout-heavy — worth retrying with a longer budget

11/20 tasks hit task_timeout_s before the grader could run.

action Bump task_timeout_s 1.5–2× (edit glm-4.5-air in bench/models.yaml) and rerun. Only 5 of 20 were actually graded wrong — there is runway.
⟳

hermes-4-70b Passes a few, but dropping tasks to timeout

2/20 pass, 6 timeout. A longer budget would likely lift the score — the model is competent, just slow.

action Bump task_timeout_s for hermes-4-70b-fp8 before declaring a final ranking.
⟳

hermes-4.3 Passes a few, but dropping tasks to timeout

2/20 pass, 7 timeout. A longer budget would likely lift the score — the model is competent, just slow.

action Bump task_timeout_s for hermes-4.3-36b before declaring a final ranking.
≠

kimi-linear Finishes but gets it wrong

20/20 grader fails with only 0 timeouts — the model is returning answers in time, just not correct ones. More budget won't help.

action Probably a capacity ceiling (active-parameter count, context degradation, or tool-use training). Try a bigger sibling before writing the model off.
⟳

minimax-m2.7-reap-saricles Timeout-heavy — worth retrying with a longer budget

14/20 tasks hit task_timeout_s before the grader could run.

action Bump task_timeout_s 1.5–2× (edit minimax-m2.7-reap-saricles in bench/models.yaml) and rerun. Only 6 of 20 were actually graded wrong — there is runway.
⟳

nemotron-3-super Timeout-heavy — worth retrying with a longer budget

15/20 tasks hit task_timeout_s before the grader could run. At 5.6 gen t/s and —s mean TTFT, the per-task budget is a tight fit.

action Bump task_timeout_s 1.5–2× (edit nemotron-3-super-120b in bench/models.yaml) and rerun. Only 1 of 20 were actually graded wrong — there is runway.
⟳

qwen3-coder-next-fp8 Timeout-heavy — worth retrying with a longer budget

13/20 tasks hit task_timeout_s before the grader could run. At 33.5 gen t/s and 6.2s mean TTFT, the per-task budget is a tight fit.

action Bump task_timeout_s 1.5–2× (edit qwen3-coder-next-fp8 in bench/models.yaml) and rerun. Only 4 of 20 were actually graded wrong — there is runway.
⟳

qwen3-coder-next Timeout-heavy — worth retrying with a longer budget

15/20 tasks hit task_timeout_s before the grader could run. At 36.5 gen t/s and 6.8s mean TTFT, the per-task budget is a tight fit.

action Bump task_timeout_s 1.5–2× (edit qwen3-coder-next in bench/models.yaml) and rerun. Only 3 of 20 were actually graded wrong — there is runway.
⟳

qwen3-next Timeout-heavy — worth retrying with a longer budget

10/20 tasks hit task_timeout_s before the grader could run.

action Bump task_timeout_s 1.5–2× (edit qwen3-next-80b in bench/models.yaml) and rerun. Only 7 of 20 were actually graded wrong — there is runway.

Why each model failed

Every non‑pass falls into one of three buckets — timeout (task budget ran out), grader fail (agent finished but got the answer wrong), no tool calls (agent never emitted a parsable tool call). This single chart reads the full story.

Outcome breakdown per 20-task TBLite pilot. Amber "timeout" means the task_timeout (1200 s) hit before the grader could run — those aren't "the model got it wrong", they're "the model ran out of time." Grey "no tool calls" means the model never produced a parsable tool call (usually a parser / output-format issue).

Size vs score

Active‑parameter count vs TBLite pass rate, total params on a log axis. Working hypothesis: models with ≥ 10 B active survive hermes' ~12 K‑token system prompt; below that, things break in interesting ways (see Kimi‑Linear, op. cit.).

Bubble size ≈ active parameters. Y-axis is TBLite pass-rate where available, else smoke-test average (those points in blue). The bet: models clustered in the top half have enough active capacity to survive hermes-agent's context.

Per‑category TBLite

Where each model wins and loses. Task categories — sysadmin, scientific‑computing, networking, coding, data‑processing, devops — spread skill very unevenly.

Per-category TBLite pass rate.nemotron-3-super-120bglm-4.5-airqwen3-coder-next-fp8devstral-small-2hermes-4-70b-fp8hermes-4.3-36bqwen3-coder-nextqwen3-next-80bkimi-linearminimax-m2.7-reap-saricles

Smoke‑probe wall‑clock

Heatmap of how long each smoke probe took. The leftmost cell per model is always hottest — that's the one‑time Triton JIT compile of the vLLM kernels.

Smoke-probe wall-clock. The leftmost column of each model absorbs the Triton JIT-compile penalty on the first request (~30–50 s on cold kernels). Later probes drop to the real per-probe cost.

Full matrix

Smoke‑probe scores (0‑5, hand‑graded) alongside the TBLite column. Click a model to open its full run.

Model	TBLite	simple	math	reasoning	tool_ls	code	Toy avg
nemotron-3-super 120B · NVFP4 · smoke	20% 4/20 · 10800s → tasks	51.28s	15.99s	16.67s	37.83s	15.79s	5.0
glm-4.5-air 106B · NVFP4 · smoke	15% 3/20 · 5338s → tasks	43.07s	10.37s	10.31s	47.65s	15.84s	0.0
qwen3-coder-next-fp8 80B · FP8 · smoke	15% 3/20 · 2774s → tasks	64.13s	14.1s	7.45s	21.25s	8.19s	0.0
devstral-small-2 24B · BF16 · smoke	10% 2/20 · 2786s → tasks	2.37s	2.68s	2.97s	2.4s	2.96s	0.0
hermes-4-70b 70B · FP8 · smoke	10% 2/20 · 4800s → tasks	59.46s	56.37s	28.14s	19.35s	52.56s	0.0
hermes-4.3 36B · NVFP4 · smoke	10% 2/20 · 2400s → tasks	42.86s	3.18s	3.14s	5.14s	26.56s	2.8
qwen3-coder-next 80B · NVFP4 · smoke	10% 2/20 · 3059s → tasks	66.66s	11.32s	10.82s	24.7s	8.84s	0.0
qwen3-next 80B · NVFP4 · smoke	10% 2/20 · 2770s → tasks	64.57s	10.22s	7.04s	18.76s	7.53s	5.0
kimi-linear 48B · NVFP4 · smoke	0% 0/20 · 824s → tasks	30.5s	9.96s	11.41s	14.62s	8.1s	0.0
minimax-m2.7-reap-saricles 172B · NVFP4 · smoke	0% 0/20 · 5227s → tasks	2.55s	2.66s	2.44s	2.43s	2.37s	0.0

Hardware

NVIDIA DGX Spark — GB10 Grace Blackwell superchip, 119 GB unified LPDDR5X, SM120 (compute cap 12.1). Every model runs in vLLM under the community avarok/vllm-dgx-spark:v14 image because the upstream vllm/vllm-openai:nightly-aarch64 image currently fails the FlashInfer FP4 GEMM probe on SM120. The client running hermes‑agent is a separate box talking to the Spark over LAN.

Three more machines are planned: a Dell Pro Max T2 (RTX PRO 6000 Blackwell, 96 GB), a second DGX Spark (for 2× tensor / pipeline parallel), and a Mac Studio M3 Ultra (512 GB, MLX). See the roadmap for per‑machine model picks and the community models page for the broader list of what people are running with hermes‑agent and peer agentic CLIs.

The best local model for hermes‑agent on a DGX Spark.

Standings

Candidates, by usability

kimi-linear

minimax-m2.7-reap-saricles

Harness health

Why each model failed

Size vs score

Per‑category TBLite

Smoke‑probe wall‑clock

Full matrix

Hardware

The best local model
for hermes‑agent
on a DGX Spark.