What the community actually runs

A comprehensive index of open-weight models people mention using with hermes-agent and peer agentic CLIs (Cline, Roo-Code, Aider, OpenHands, opencode, Goose, mini-SWE-agent, Qwen-Code). Every row is sourced. For each entry we note where it fits on our own hardware roadmap — already benchmarked, planned, or out-of-reach. Compiled 2026-04-22 from two parallel research passes across GitHub issues, the NVIDIA DGX Spark forum, HuggingFace discussions, r/LocalLLaMA, Hacker News, Unsloth/vLLM/SGLang recipe docs, Latent.Space's top-local-models roundup, and agentic-CLI documentation.

Status legend

Our correspondence for each model — does it fit on our current or planned hardware, and at what phase of testing?

benchmarkedWe've run TBLite 20-task pilot against this model — result link goes to transcripts.
plannedHas a slot on the hardware roadmap. Will run once the machine joins the rig.
candidateHasn't made our shortlist yet but fits our existing hardware — worth adding.
oversizedDoes not fit any current or planned machine we control (single 119 GB Spark, 96 GB Dell, 238 GB 2× Spark, 512 GB Mac).
known-brokenCommunity reports say agentic tool-calling fails reliably or model architecture lacks tool support.
retiredSuperseded by later versions and rarely cited in 2026 sources.

What the community converges on (and where it disagrees)

Convergence

Disagreement

Rising vs falling

Rising: GLM-4.7 / GLM-5.1 (biggest mover in Q1 2026), MiniMax-M2.5 / M2.7, Qwen3-Coder-Next / Qwen3.5-35B-A3B, Seed-OSS-36B (real vLLM parser upstreamed), Hermes-4.3 (50× agentic-trace post-training expansion), Apriel-Nemotron-15B-Thinker. Falling: Llama 3.x as agent driver (still the Hermes-agent README default, almost no 2026 anecdotes), DeepSeek-R1 / V3 self-hosted (known-broken tool-calling), Codestral / Magistral (displaced by Devstral for agents), Yi-Coder / Yi-Lightning (absent from 2026 discussion), Granite-Code (appears only in aggregator lists).

Nous Research

ModelSizeStatusCommunity framingSource
NousResearch/Hermes-4-405B-FP8 dense 405B · FP8 planned 2× Spark Nous flagship; native <tool_call> tags, built-in vLLM/SGLang hermes parser. HF card
NousResearch/Hermes-4-70B-FP8 dense 70B · FP8 2/20 "Stays in character as agent past step 20" per community. HF card
NousResearch/Hermes-4.3-36B dense 36B (Seed-OSS base) · NVFP4 2/20 JSON-schema-conditioned; 5M-sample agent trace training (~50× expansion); Psyche decentralized-trained. review
NousResearch/Hermes-4-35B-A3B 35B MoE / 3B active candidate Fits 4090 at Q4KM per community write-ups, 128K ctx, trained on agentic traces. Would be a useful small-MoE companion to Hermes-4.3 dense. blog
NousResearch/Hermes-4-14B dense 14B candidate Smaller dense for budget-tier coverage. hermes-agent repo

Qwen

ModelSizeStatusCommunity framingSource
Qwen/Qwen3-Coder-480B-A35B-Instruct 480B MoE / 35B active · NVFP4/FP8 planned 2× Spark + Mac MLX Frontier OSS agentic coder; 61.8% Aider Polyglot; 256K→1M ctx. Highest-ROI add per the NVIDIA forum. repo, build.nvidia
Qwen/Qwen3-Coder-Next 80B MoE / 3B active · FP8 native candidate — highest priority DGX Spark practitioner favorite; ~43 t/s on Spark FP8 per NV forum. 66.2% Aider, 71.3% SWE-Verified. Not yet in our set — gap! NV forum
Qwen/Qwen3-Next-80B-A3B-Instruct (nvidia NVFP4 build) 80B MoE / 3B active 2/20 Instruct sibling of Coder-Next; swept our 5/5 smoke test. HF card
Qwen/Qwen3.5-35B-A3B 35B MoE / 3B active · MXFP4 candidate "DGX Spark darling — 70 t/s with vLLM 0.17 MXFP4 patches (TP=2)." NV forum
Qwen/Qwen3.6-35B-A3B 35B MoE / 3B active candidate Incremental refresh; "real-world agent reliability." FP8 has landed on DGX Spark. NV forum
Qwen/Qwen3.5-122B-A10B 122B MoE / 10B active · NVFP4 planned (fits single Spark) High-quality reasoning; community benchmark at ~42 t/s. SPARK recipe
Qwen/Qwen3-Coder-30B-A3B-Instruct 30B MoE / 3B active candidate "Most-recommended local default for Cline/Roo/Qwen-Code; 46 tok/s on M3 Ultra 96GB." willitrunai
Qwen/Qwen3-30B-A3B-Instruct 30B MoE / 3B active candidate Hermes-agent issue #523: "best overall local." issue
Qwen/Qwen3-32B dense 32B candidate Hermes-agent issue #523 "maximum-quality single-GPU dense." issue
Qwen/Qwen3-8B-Instruct dense 8B candidate Budget/small-GPU pick; solid tool calling. issue
Qwen/Qwen2.5-Coder-32B-Instruct dense 32B superseded "Still the baseline in RooCode local-eval; 73.7 on Aider." Mostly displaced by Qwen3-Coder family. RooCode-Local-Eval

Kimi / Moonshot

ModelSizeStatusCommunity framingSource
Firworks/Kimi-Linear-48B-A3B-Instruct-nvfp4 48B MoE / 3B active · NVFP4 0/20 Kimi Delta Attention (KDA) hybrid-linear. Architectural novelty; limited capacity for long agent prompts. our test
moonshotai/Kimi-K2-Thinking 1T MoE / 32B active planned Mac MLX SOTA open on τ²-Bench 87.4; Cline/Roo/Kilo compat noted by Moonshot. Unsloth
moonshotai/Kimi-K2.5 1T MoE / 32B active · MLX 3.6-bit ~470 GB planned Mac MLX 76.8% SWE-Bench Verified. Top open-weight per April 2026 OpenClaw vote. Unsloth, OpenClaw
moonshotai/Kimi-K2.6-Code-Preview 1T MoE / 32B active planned Mac MLX 1T hybrid-thinking coder, 256K ctx; Hermes-Agent day-0 support; "catching Opus 4.6." Latent.Space
moonshotai/Kimi-K2-Instruct 1T MoE / 32B active oversized "Near-100% tool-call accuracy." blog

Z.AI / GLM

ModelSizeStatusCommunity framingSource
Firworks/GLM-4.5-Air-nvfp4 106B MoE / 12B active · NVFP4 3/20 Tool-use + browsing optimized; community Cline/Roo pick. Cline docs
zai-org/GLM-4.6 355B MoE oversized single Spark Open coding SOTA predecessor to 4.7; often cited vs Sonnet-4/GPT-5. blog
zai-org/GLM-4.7 350B MoE oversized SWE-Bench 73.8%; Terminal-Bench 2.0 41% (+16.5 over 4.6); "thinking-before-acting" for Claude Code. HF
GadflyII/GLM-4.7-Flash-NVFP4 30B MoE / 3.6B active · NVFP4 planned Dell Blackwell Purpose-built 24 GB-class agentic model; 200K ctx; "best agent programming model I've used" (thin anon cite). Unsloth
zai-org/GLM-5 / GLM-5.1 744B MoE / 40B active oversized #1 SWE-Bench Pro 58.4; leads Vellum agentic composite at 55.0. analysis

MiniMax

ModelSizeStatusCommunity framingSource
MiniMaxAI/MiniMax-M2 230B MoE / 10B active oversized single Spark "King of open-source LLMs for agentic tool calling" (VentureBeat). VentureBeat
Tengyunw/MiniMax-M2.1-NVFP4 115B MoE / 10B active · 122 GB NVFP4 planned 2× Spark Interleaved <think> between tool calls; M2-family "agentic king." deployment guide
nvidia/MiniMax-M2.5-NVFP4 230B MoE / 10B active · NVFP4 planned Dell / 2× Spark SWE-Bench Verified 80.2%; 20–25 t/s on Blackwell/M4 Ultra; 2nd on Vellum agentic. guide
lukealonso/MiniMax-M2.7-NVFP4 (full) 230B MoE / 10B active · 131 GB NVFP4 planned 2× Spark Self-evolving agent training; 56.22% SWE-Pro, 57% Terminal-Bench 2. MarkTechPost
dervig/m51Lab-MiniMax-M2.7-REAP-139B 139B MoE / 10B active · NVFP4 null-byte output NVFP4 MoE kernel deadlock on GB10: every turn returned literal . cudagraph_mode:none fixes it on saricles variant. our test
saricles/MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10 172B MoE / 10B active · NVFP4 0/20 Single-Spark-ready REAP variant. Runs clean with cudagraph_mode:none but 14/20 timeout at ~42 t/s / 3-min turns. Needs 2× Spark for realistic agentic. our test

DeepSeek

ModelSizeStatusCommunity framingSource
nvidia/DeepSeek-V3.2-NVFP4 685B MoE / 37B active · ~170 GB NVFP4 planned 2× Spark First OSS model with thinking integrated into tool-use; DSA attention. vLLM recipe
deepseek-ai/DeepSeek-V3.2-Exp 685B MoE / 37B active broken tool calling Cline issue tracker full of tool-call failures. Parser leaks XML into content. vllm #36654, cline #8365
deepseek-ai/DeepSeek-V3.1 685B MoE / 37B active oversized Tool-use jump vs V3-0324; strong code-agent bench numbers on paper. review
deepseek-ai/DeepSeek-R1 671B MoE · or 32B distill broken for agents "You did not use a tool" loops even on 4× H100. "Does not natively support tool calling." cline #1828

NVIDIA

ModelSizeStatusCommunity framingSource
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 120B MoE / 12B active · NVFP4+FP8 mixed 4/20 (current winner) Hybrid Mamba-Transformer MoE; agentic reasoning post-training; native NVFP4 from NVIDIA for single Spark. NV blog
nvidia/Nemotron-3-Nano-30B-A3B 30B MoE / 3B active candidate Explicitly "built for agentic/tool-use"; 1M context; strong BFCL. blog
nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 dense 49B candidate RAG + tool-calling post-train; reasoning toggle; single-GPU. HF card
ServiceNow-AI/Apriel-Nemotron-15b-Thinker dense 15B candidate ServiceNow+NVIDIA enterprise reasoning; Q1 2026 niche breakout. HF card

OpenAI (open-weight)

ModelSizeStatusCommunity framingSource
openai/gpt-oss-120b 120B MoE / 5.1B active · MXFP4 planned Dell / 2× Spark ≈ o4-mini; 80 t/s on DGX Spark TP=2; llama.cpp + crush compat. OpenAI intro, llama.cpp discussion
openai/gpt-oss-20b 20B MoE · MXFP4 candidate ≈ o3-mini; works in Cline + Roo Code on 16 GB RAM. Cline+Roo writeup

Meta Llama

ModelSizeStatusCommunity framingSource
meta-llama/Llama-4-Scout 109B MoE / 17B active planned Dell Agentic workflows + 10M ctx; tool calling via Ollama v0.8. guide
meta-llama/Llama-4-Maverick 402B MoE oversized single Spark Meta flagship chat + code. model card
meta-llama/Llama-3.1-70B-Instruct dense 70B community moved on Canonical Hermes-Agent vLLM quickstart example; increasingly rarely mentioned in 2026 practitioner anecdotes. Hermes docs
meta-llama/Llama-3.3-70B-Instruct dense 70B community moved on Still a fallback baseline in Hermes/vLLM examples; displaced by Qwen/GLM/Kimi for agentic. Hermes docs

Mistral

ModelSizeStatusCommunity framingSource
mistralai/Devstral-2 dense 123B planned Dell / 2× Spark Co-developed with All Hands AI for OpenHands; SWE-bench 72.2%; 256K ctx. InfoQ
mistralai/Devstral-Small-24B dense 24B candidate Aider community's top pick for local agentic coding in 2026. blog
mistralai/Codestral-25.12 dense 22B FIM not agents Best Ollama FIM/autocomplete; no native tool use. Displaced by Devstral for agentic. guide
mistralai/Magistral-Small dense tool calling disabled Official GGUF missing tool-call enable. HN

Google Gemma

ModelSizeStatusCommunity framingSource
google/gemma-4-26b-it dense 26B candidate Tool-calling jumped 6.6% → 86.4% vs Gemma-3; Codex CLI compat. HN
google/gemma-4-26b-a4b 26B MoE / 4B active candidate Appears in DGX Spark LoRA/vLLM discussions. NV forum
google/gemma-3 various 6.6% tool-calling Actively cited as negative example — cautionary tale. Analytics Vidhya

Xiaomi MiMo

ModelSizeStatusCommunity framingSource
XiaomiMiMo/MiMo-V2-Flash 309B MoE / 15B active planned 2× Spark / Mac MLX #1 SWE-Bench Verified 73.4%; 150 t/s; agentic-tuned. repo
XiaomiMiMo/MiMo-V2-Pro 1T oversized Xiaomi agentic flagship; 78% SWE-Bench. site
XiaomiMiMo/MiMo candidate Surprise entry: listed as first-class built-in provider in Hermes Agent docs. Hermes docs

Long tail (less-cited but real)

ModelSizeStatusCommunity framingSource
inclusionAI/Ling-2.6-flash 104B MoE / 7.4B active · 256K ctx candidate 340 t/s; BFCL-V4 67.04; "economic agent" — ~7× fewer output tokens. blog
meituan-longcat/LongCat-Flash-Chat 560B MoE / 18–31B active planned 2× Spark / Mac MLX Meituan: "exceptional strengths in agentic tasks"; 128K. HF
meituan-longcat/LongCat-Flash-Lite 68.5B MoE / 3B active · 256K ctx candidate Prosumer-friendly LongCat entry; Hermes Agent has longcat parser ready. HF
meituan-longcat/LongCat-Flash-Thinking 560B MoE oversized Reasoning variant; STEM/coding/agent. arxiv
ByteDance-Seed/Seed-OSS-36B-Instruct dense 36B · 512K ctx candidate Real integration: vLLM ships seed_oss tool-call parser upstream. Base for Hermes-4.3. vLLM recipe
stepfun-ai/Step-3.5-Flash 196B MoE / 11B active candidate StepFun's "frontier agentic" pitch; niche but real. repo
tencent/Hunyuan-A13B-Instruct 80B MoE / 13B active candidate CoT reasoning; coding routes to specialists. repo
all-hands/openhands-lm-32b dense 32B candidate 37.2% SWE-Bench Verified; domain-fine-tuned for OpenHands agentic SWE. deployment
microsoft/Phi-4-reasoning-plus dense 14B candidate Small reasoner rivaling bigger models; agentic tool-calling. Unsloth
microsoft/phi-4-mini small candidate SLM with tool calling added; edge-device class. SLM comparison
ibm-granite/granite-4.0-* Nano / Micro / Small candidate Hybrid Mamba2+transformer; enterprise tool-calling AMA praise. IBM tutorial
ibm-granite/granite-code various aggregator-only Falling — brand-only recognition in 2026. blog
CohereForAI/c4ai-command-r-v01 dense 35B candidate Top-5 agentic local models in one aggregator. PCBuildAdvisor
liquid-ai/LFM2-24B-A2B 24B MoE / 2B active candidate ~390ms/tool-call (speed-optimized) — caveat: only 26% success on 3-6 step chains. issue #523
allenai/OLMo-2-* 7B / 13B candidate Fully-open; par with Llama-3.1 on English academic benchmarks. aggregator
infly/OpenCoder-8B-Instruct dense 8B candidate Fully-reproducible coder family; 2.5T-token training. site
01-ai/Yi-Coder / Yi-Lightning various absent 2026 Essentially absent from 2026 agentic-CLI discussion. repo
internlm/internlm3-8b-instruct dense 8B candidate General reasoning; thin citation (mostly self-claim). repo
baidu/ERNIE-4.5 various candidate Mentioned in multimodal + agent threads. benchmark roundup

Gaps in our current test set

Based on mention frequency across the surveyed threads, the clearest gaps — models we should be running but aren't — are:

  1. Qwen3-Coder-Next 80B-A3B — the single most-cited open coding agent, explicitly the DGX Spark practitioner pick, ~43 t/s FP8 on a single Spark. Distinct from Qwen3-Next-80B-A3B-Instruct (which we have). highest priority
  2. GLM-4.7-Flash 30B-A3B — purpose-built 24 GB-class agentic model; fits Dell Blackwell comfortably; our existing GLM-4.5-Air leader ought to compare against it.
  3. MiMo-V2-Flash 309B/15A — claimed #1 SWE-Bench Verified among open weights (73.4%); fits 2× Spark.
  4. Devstral 2 / Devstral-Small-24B — co-designed with All Hands for OpenHands; strong tool-use posterior. Small variant fits single Spark.
  5. Seed-OSS-36B — Hermes-4.3's base model; vLLM ships seed_oss parser upstream (real integration signal).
  6. Gemma-4 26B — the tool-calling reliability leap makes it a must-include lightweight baseline, even if the community hasn't internalized it yet.
  7. Nemotron-3-Nano-30B-A3B — same lineage as our current TBLite leader (Nemotron-3-Super), but 4× smaller — useful to measure how much of Super's 4/20 came from Nemotron training vs sheer parameter count.
  8. OpenHands-LM-32B — domain-fine-tuned for agent loops; valuable comparison to general drivers.
  9. LongCat-Flash-Lite 68.5B/A3B — 256K ctx, longcat parser already upstream in Hermes Agent, very little community benchmarking yet. Potential surprise.

Research passes: see hermes-agent issue #523, NV DGX Spark coding thread, Latent.Space April 2026 top local models, Cline docs, RooCode-Local-Evaluation, and the HF model cards + vLLM/SGLang recipes linked on each row. Everything here is cross-referenced to two independent passes, with thin citations flagged. This list is a living snapshot — the field is moving fast enough that monthly re-passes are probably warranted.