Candidates, by usability
Three tiers: models that pass at least one TBLite task (top), models
that ran but didn't pass any (below), and models that couldn't be
benchmarked at all (parked).
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
NVFP4 75 GB 12B active · 120B total parser · qwen3_coder
NVIDIA's Nemotron-3-Super, 120 B total / 12 B active MoE (Mamba2 + attn hybrid).
Trained in NVFP4 natively and packaged by NVIDIA specifically for a single DGX Spark.
Third-party Artificial-Analysis agentic eval: Terminal-Bench Hard 29%, SWE-Bench
Verified 60.5, PinchBench 85.6%.
avg score: 5.0/5 137.6s total
Firworks/GLM-4.5-Air-nvfp4
NVFP4 58 GB 12B active · 106B total parser · glm45
Zhipu AI's GLM-4.5-Air. 106 B total / 12 B active MoE — community favorite for
coding-agent loops and Claude-Code-style work. First model in our set with a
beefy active-parameter budget.
avg score: 0.0/5 127.2s total
Qwen/Qwen3-Coder-Next-FP8
FP8 80 GB 3B active · 80B total parser · qwen3_coder
Qwen's official FP8 build of Qwen3-Coder-Next. A/B partner to the NVFP4
variant — FP8 kernels are better-tested on SM120 than NVFP4.
avg score: 0.0/5 115.1s total
mistralai/Devstral-Small-2-24B-Instruct-2512
BF16 52 GB 24B active · 24B total parser · mistral
Mistral's Devstral-Small-2 (24 B dense, Apache-2.0). Co-designed with
All-Hands for OpenHands agentic loops; Aider community's top local pick
for multi-file refactors in 2026.
avg score: 0.0/5 13.4s total
NousResearch/Hermes-4-70B-FP8
FP8 68 GB 70B active · 70B total parser · hermes
Nous Research's Hermes-4 flagship, dense 70 B on Llama-3.1 base. The exact model
vLLM's `hermes` tool parser was written for — zero parser mismatch. Primary baseline
for "how well does a Nous-native model drive their own agent CLI?"
avg score: 0.0/5 215.9s total
Firworks/Hermes-4.3-36B-nvfp4
NVFP4 21 GB 36B active · 36B total parser · hermes
Nous Research's Hermes-4.3, built on ByteDance Seed-OSS-36B. Dense 36B with hybrid
<think>/<tool_call> training. Paired natively with vLLM's `hermes` parser.
avg score: 2.8/5 80.9s total
GadflyII/Qwen3-Coder-Next-NVFP4
NVFP4 47 GB 3B active · 80B total parser · qwen3_coder
Qwen3-Coder-Next: 80 B / 3 B-active coding-agent MoE. #1 community-cited
open coding agent; "DGX Spark practitioner favorite" per the NVIDIA forum.
Distinct from qwen3-next-80b (instruct); this one is coder-tuned with the
`qwen3_coder` tool-call format.
avg score: 0.0/5 122.3s total
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4
NVFP4 45 GB 3B active · 80B total parser · hermes
Qwen3-Next MoE, 3 B active / 80 B total. Community default for reliable tool-calling;
NVIDIA's NVFP4 quant is Blackwell-native.
avg score: 5.0/5 108.1s total