What the community actually runs

A comprehensive index of open-weight models people mention using with hermes-agent and peer agentic CLIs (Cline, Roo-Code, Aider, OpenHands, opencode, Goose, mini-SWE-agent, Qwen-Code). Every row is sourced. For each entry we note where it fits on our own hardware roadmap — already benchmarked, planned, or out-of-reach. Compiled 2026-04-22 from two parallel research passes across GitHub issues, the NVIDIA DGX Spark forum, HuggingFace discussions, r/LocalLLaMA, Hacker News, Unsloth/vLLM/SGLang recipe docs, Latent.Space's top-local-models roundup, and agentic-CLI documentation.

Status legend

Our correspondence for each model — does it fit on our current or planned hardware, and at what phase of testing?

benchmarked	We've run TBLite 20-task pilot against this model — result link goes to transcripts.
planned	Has a slot on the hardware roadmap. Will run once the machine joins the rig.
candidate	Hasn't made our shortlist yet but fits our existing hardware — worth adding.
oversized	Does not fit any current or planned machine we control (single 119 GB Spark, 96 GB Dell, 238 GB 2× Spark, 512 GB Mac).
known-broken	Community reports say agentic tool-calling fails reliably or model architecture lacks tool support.
retired	Superseded by later versions and rarely cited in 2026 sources.

What the community converges on (and where it disagrees)

Convergence

Qwen3-Coder family is the default. Cline, Roo-Code, Aider, opencode, Goose, Hermes Agent, and Qwen-Code docs all recommend Qwen3-Coder-30B-A3B or Qwen3-Coder-Next first for local. Qwen3-Coder-Next specifically is the DGX Spark practitioner pick per the April 2026 NVIDIA forum thread.
MoE with low active-param count (the A3B pattern) is the dominant architecture for local agentic: Qwen3-Coder-30B-A3B, Qwen3-Coder-Next-80B-A3B, GLM-4.7-Flash-30B-A3B, Nemotron-3-Nano-30B-A3B, LongCat-Flash-Lite-68B-A3B, Qwen3.5-35B-A3B, Gemma-4-26B-A4B. Strongest single trend in the corpus.
vLLM is the production default. Nearly every Hermes-agent, OpenHands, Goose, and Roo-Code thread assumes vLLM or SGLang. Ollama is treated as "easy mode" for smaller models; llama.cpp for Apple Silicon fallback.
Tool-call parsers matter more than raw model score. Multiple Cline issues (#1828, #8130, #8365) attribute DeepSeek brokenness to parser mismatch, not model quality. We've seen the same pattern: wrong tool parser = 0/20 regardless of model capability.

Disagreement

GLM-4.7 vs Qwen3-Coder-Next. Z.AI benchmarks claim GLM-4.7 leads on SWE-Bench Multilingual + Terminal-Bench; Qwen blog + DGX Spark practitioners prefer Qwen3-Coder-Next for "experience quality." Both camps cite their own benchmarks.
DeepSeek: great model, broken agent. DeepSeek-V3.2 shows up on leaderboards but Cline's issue tracker is full of tool-calling failures. Community split between "parser will improve" and "move on to GLM/Qwen."
Gemma-4 reception. Google + HN call it a breakthrough (86.4% tool calling vs Gemma-3's 6.6%); r/LocalLLaMA hasn't internalized it yet as of April 2026 — mostly Google/Android-blog evangelism, thin practitioner adoption.
Hermes-4.3 as driver vs Hermes-Agent as runtime. Some users treat Hermes-4.3 as the natural driver; the Hermes Agent docs themselves recommend Kimi-K2.6, MiniMax-M2.5, or GLM-4.7 as the driver, treating the "Hermes" brand as the agent runtime only.

Rising vs falling

Rising: GLM-4.7 / GLM-5.1 (biggest mover in Q1 2026), MiniMax-M2.5 / M2.7, Qwen3-Coder-Next / Qwen3.5-35B-A3B, Seed-OSS-36B (real vLLM parser upstreamed), Hermes-4.3 (50× agentic-trace post-training expansion), Apriel-Nemotron-15B-Thinker. Falling: Llama 3.x as agent driver (still the Hermes-agent README default, almost no 2026 anecdotes), DeepSeek-R1 / V3 self-hosted (known-broken tool-calling), Codestral / Magistral (displaced by Devstral for agents), Yi-Coder / Yi-Lightning (absent from 2026 discussion), Granite-Code (appears only in aggregator lists).

Nous Research

Model	Size	Status	Community framing	Source
`NousResearch/Hermes-4-405B-FP8`	dense 405B · FP8	planned 2× Spark	Nous flagship; native `<tool_call>` tags, built-in vLLM/SGLang `hermes` parser.	HF card
`NousResearch/Hermes-4-70B-FP8`	dense 70B · FP8	2/20	"Stays in character as agent past step 20" per community.	HF card
`NousResearch/Hermes-4.3-36B`	dense 36B (Seed-OSS base) · NVFP4	2/20	JSON-schema-conditioned; 5M-sample agent trace training (~50× expansion); Psyche decentralized-trained.	review
`NousResearch/Hermes-4-35B-A3B`	35B MoE / 3B active	candidate	Fits 4090 at Q4KM per community write-ups, 128K ctx, trained on agentic traces. Would be a useful small-MoE companion to Hermes-4.3 dense.	blog
`NousResearch/Hermes-4-14B`	dense 14B	candidate	Smaller dense for budget-tier coverage.	hermes-agent repo

Qwen

Model	Size	Status	Community framing	Source
`Qwen/Qwen3-Coder-480B-A35B-Instruct`	480B MoE / 35B active · NVFP4/FP8	planned 2× Spark + Mac MLX	Frontier OSS agentic coder; 61.8% Aider Polyglot; 256K→1M ctx. Highest-ROI add per the NVIDIA forum.	repo, build.nvidia
`Qwen/Qwen3-Coder-Next`	80B MoE / 3B active · FP8 native	candidate — highest priority	DGX Spark practitioner favorite; ~43 t/s on Spark FP8 per NV forum. 66.2% Aider, 71.3% SWE-Verified. Not yet in our set — gap!	NV forum
`Qwen/Qwen3-Next-80B-A3B-Instruct` (nvidia NVFP4 build)	80B MoE / 3B active	2/20	Instruct sibling of Coder-Next; swept our 5/5 smoke test.	HF card
`Qwen/Qwen3.5-35B-A3B`	35B MoE / 3B active · MXFP4	candidate	"DGX Spark darling — 70 t/s with vLLM 0.17 MXFP4 patches (TP=2)."	NV forum
`Qwen/Qwen3.6-35B-A3B`	35B MoE / 3B active	candidate	Incremental refresh; "real-world agent reliability." FP8 has landed on DGX Spark.	NV forum
`Qwen/Qwen3.5-122B-A10B`	122B MoE / 10B active · NVFP4	planned (fits single Spark)	High-quality reasoning; community benchmark at ~42 t/s.	SPARK recipe
`Qwen/Qwen3-Coder-30B-A3B-Instruct`	30B MoE / 3B active	candidate	"Most-recommended local default for Cline/Roo/Qwen-Code; 46 tok/s on M3 Ultra 96GB."	willitrunai
`Qwen/Qwen3-30B-A3B-Instruct`	30B MoE / 3B active	candidate	Hermes-agent issue #523: "best overall local."	issue
`Qwen/Qwen3-32B`	dense 32B	candidate	Hermes-agent issue #523 "maximum-quality single-GPU dense."	issue
`Qwen/Qwen3-8B-Instruct`	dense 8B	candidate	Budget/small-GPU pick; solid tool calling.	issue
`Qwen/Qwen2.5-Coder-32B-Instruct`	dense 32B	superseded	"Still the baseline in RooCode local-eval; 73.7 on Aider." Mostly displaced by Qwen3-Coder family.	RooCode-Local-Eval

Kimi / Moonshot

Model	Size	Status	Community framing	Source
`Firworks/Kimi-Linear-48B-A3B-Instruct-nvfp4`	48B MoE / 3B active · NVFP4	0/20	Kimi Delta Attention (KDA) hybrid-linear. Architectural novelty; limited capacity for long agent prompts.	our test
`moonshotai/Kimi-K2-Thinking`	1T MoE / 32B active	planned Mac MLX	SOTA open on τ²-Bench 87.4; Cline/Roo/Kilo compat noted by Moonshot.	Unsloth
`moonshotai/Kimi-K2.5`	1T MoE / 32B active · MLX 3.6-bit ~470 GB	planned Mac MLX	76.8% SWE-Bench Verified. Top open-weight per April 2026 OpenClaw vote.	Unsloth, OpenClaw
`moonshotai/Kimi-K2.6-Code-Preview`	1T MoE / 32B active	planned Mac MLX	1T hybrid-thinking coder, 256K ctx; Hermes-Agent day-0 support; "catching Opus 4.6."	Latent.Space
`moonshotai/Kimi-K2-Instruct`	1T MoE / 32B active	oversized	"Near-100% tool-call accuracy."	blog

Z.AI / GLM

Model	Size	Status	Community framing	Source
`Firworks/GLM-4.5-Air-nvfp4`	106B MoE / 12B active · NVFP4	3/20	Tool-use + browsing optimized; community Cline/Roo pick.	Cline docs
`zai-org/GLM-4.6`	355B MoE	oversized single Spark	Open coding SOTA predecessor to 4.7; often cited vs Sonnet-4/GPT-5.	blog
`zai-org/GLM-4.7`	350B MoE	oversized	SWE-Bench 73.8%; Terminal-Bench 2.0 41% (+16.5 over 4.6); "thinking-before-acting" for Claude Code.	HF
`GadflyII/GLM-4.7-Flash-NVFP4`	30B MoE / 3.6B active · NVFP4	planned Dell Blackwell	Purpose-built 24 GB-class agentic model; 200K ctx; "best agent programming model I've used" (thin anon cite).	Unsloth
`zai-org/GLM-5` / `GLM-5.1`	744B MoE / 40B active	oversized	#1 SWE-Bench Pro 58.4; leads Vellum agentic composite at 55.0.	analysis

MiniMax

Model	Size	Status	Community framing	Source
`MiniMaxAI/MiniMax-M2`	230B MoE / 10B active	oversized single Spark	"King of open-source LLMs for agentic tool calling" (VentureBeat).	VentureBeat
`Tengyunw/MiniMax-M2.1-NVFP4`	115B MoE / 10B active · 122 GB NVFP4	planned 2× Spark	Interleaved <think> between tool calls; M2-family "agentic king."	deployment guide
`nvidia/MiniMax-M2.5-NVFP4`	230B MoE / 10B active · NVFP4	planned Dell / 2× Spark	SWE-Bench Verified 80.2%; 20–25 t/s on Blackwell/M4 Ultra; 2nd on Vellum agentic.	guide
`lukealonso/MiniMax-M2.7-NVFP4` (full)	230B MoE / 10B active · 131 GB NVFP4	planned 2× Spark	Self-evolving agent training; 56.22% SWE-Pro, 57% Terminal-Bench 2.	MarkTechPost
`dervig/m51Lab-MiniMax-M2.7-REAP-139B`	139B MoE / 10B active · NVFP4	null-byte output	NVFP4 MoE kernel deadlock on GB10: every turn returned literal . cudagraph_mode:none fixes it on saricles variant.	our test
`saricles/MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10`	172B MoE / 10B active · NVFP4	0/20	Single-Spark-ready REAP variant. Runs clean with `cudagraph_mode:none` but 14/20 timeout at ~42 t/s / 3-min turns. Needs 2× Spark for realistic agentic.	our test

DeepSeek

Model	Size	Status	Community framing	Source
`nvidia/DeepSeek-V3.2-NVFP4`	685B MoE / 37B active · ~170 GB NVFP4	planned 2× Spark	First OSS model with thinking integrated into tool-use; DSA attention.	vLLM recipe
`deepseek-ai/DeepSeek-V3.2-Exp`	685B MoE / 37B active	broken tool calling	Cline issue tracker full of tool-call failures. Parser leaks XML into content.	vllm #36654, cline #8365
`deepseek-ai/DeepSeek-V3.1`	685B MoE / 37B active	oversized	Tool-use jump vs V3-0324; strong code-agent bench numbers on paper.	review
`deepseek-ai/DeepSeek-R1`	671B MoE · or 32B distill	broken for agents	"You did not use a tool" loops even on 4× H100. "Does not natively support tool calling."	cline #1828

NVIDIA

Model	Size	Status	Community framing	Source
`nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4`	120B MoE / 12B active · NVFP4+FP8 mixed	4/20 (current winner)	Hybrid Mamba-Transformer MoE; agentic reasoning post-training; native NVFP4 from NVIDIA for single Spark.	NV blog
`nvidia/Nemotron-3-Nano-30B-A3B`	30B MoE / 3B active	candidate	Explicitly "built for agentic/tool-use"; 1M context; strong BFCL.	blog
`nvidia/Llama-3_3-Nemotron-Super-49B-v1_5`	dense 49B	candidate	RAG + tool-calling post-train; reasoning toggle; single-GPU.	HF card
`ServiceNow-AI/Apriel-Nemotron-15b-Thinker`	dense 15B	candidate	ServiceNow+NVIDIA enterprise reasoning; Q1 2026 niche breakout.	HF card

OpenAI (open-weight)

Model	Size	Status	Community framing	Source
`openai/gpt-oss-120b`	120B MoE / 5.1B active · MXFP4	planned Dell / 2× Spark	≈ o4-mini; 80 t/s on DGX Spark TP=2; llama.cpp + crush compat.	OpenAI intro, llama.cpp discussion
`openai/gpt-oss-20b`	20B MoE · MXFP4	candidate	≈ o3-mini; works in Cline + Roo Code on 16 GB RAM.	Cline+Roo writeup

Meta Llama

Model	Size	Status	Community framing	Source
`meta-llama/Llama-4-Scout`	109B MoE / 17B active	planned Dell	Agentic workflows + 10M ctx; tool calling via Ollama v0.8.	guide
`meta-llama/Llama-4-Maverick`	402B MoE	oversized single Spark	Meta flagship chat + code.	model card
`meta-llama/Llama-3.1-70B-Instruct`	dense 70B	community moved on	Canonical Hermes-Agent vLLM quickstart example; increasingly rarely mentioned in 2026 practitioner anecdotes.	Hermes docs
`meta-llama/Llama-3.3-70B-Instruct`	dense 70B	community moved on	Still a fallback baseline in Hermes/vLLM examples; displaced by Qwen/GLM/Kimi for agentic.	Hermes docs

Mistral

Model	Size	Status	Community framing	Source
`mistralai/Devstral-2`	dense 123B	planned Dell / 2× Spark	Co-developed with All Hands AI for OpenHands; SWE-bench 72.2%; 256K ctx.	InfoQ
`mistralai/Devstral-Small-24B`	dense 24B	candidate	Aider community's top pick for local agentic coding in 2026.	blog
`mistralai/Codestral-25.12`	dense 22B	FIM not agents	Best Ollama FIM/autocomplete; no native tool use. Displaced by Devstral for agentic.	guide
`mistralai/Magistral-Small`	dense	tool calling disabled	Official GGUF missing tool-call enable.	HN

Google Gemma

Model	Size	Status	Community framing	Source
`google/gemma-4-26b-it`	dense 26B	candidate	Tool-calling jumped 6.6% → 86.4% vs Gemma-3; Codex CLI compat.	HN
`google/gemma-4-26b-a4b`	26B MoE / 4B active	candidate	Appears in DGX Spark LoRA/vLLM discussions.	NV forum
`google/gemma-3`	various	6.6% tool-calling	Actively cited as negative example — cautionary tale.	Analytics Vidhya

Xiaomi MiMo

Model	Size	Status	Community framing	Source
`XiaomiMiMo/MiMo-V2-Flash`	309B MoE / 15B active	planned 2× Spark / Mac MLX	#1 SWE-Bench Verified 73.4%; 150 t/s; agentic-tuned.	repo
`XiaomiMiMo/MiMo-V2-Pro`	1T	oversized	Xiaomi agentic flagship; 78% SWE-Bench.	site
`XiaomiMiMo/MiMo`	—	candidate	Surprise entry: listed as first-class built-in provider in Hermes Agent docs.	Hermes docs

Long tail (less-cited but real)

Model	Size	Status	Community framing	Source
`inclusionAI/Ling-2.6-flash`	104B MoE / 7.4B active · 256K ctx	candidate	340 t/s; BFCL-V4 67.04; "economic agent" — ~7× fewer output tokens.	blog
`meituan-longcat/LongCat-Flash-Chat`	560B MoE / 18–31B active	planned 2× Spark / Mac MLX	Meituan: "exceptional strengths in agentic tasks"; 128K.	HF
`meituan-longcat/LongCat-Flash-Lite`	68.5B MoE / 3B active · 256K ctx	candidate	Prosumer-friendly LongCat entry; Hermes Agent has `longcat` parser ready.	HF
`meituan-longcat/LongCat-Flash-Thinking`	560B MoE	oversized	Reasoning variant; STEM/coding/agent.	arxiv
`ByteDance-Seed/Seed-OSS-36B-Instruct`	dense 36B · 512K ctx	candidate	Real integration: vLLM ships `seed_oss` tool-call parser upstream. Base for Hermes-4.3.	vLLM recipe
`stepfun-ai/Step-3.5-Flash`	196B MoE / 11B active	candidate	StepFun's "frontier agentic" pitch; niche but real.	repo
`tencent/Hunyuan-A13B-Instruct`	80B MoE / 13B active	candidate	CoT reasoning; coding routes to specialists.	repo
`all-hands/openhands-lm-32b`	dense 32B	candidate	37.2% SWE-Bench Verified; domain-fine-tuned for OpenHands agentic SWE.	deployment
`microsoft/Phi-4-reasoning-plus`	dense 14B	candidate	Small reasoner rivaling bigger models; agentic tool-calling.	Unsloth
`microsoft/phi-4-mini`	small	candidate	SLM with tool calling added; edge-device class.	SLM comparison
`ibm-granite/granite-4.0-*`	Nano / Micro / Small	candidate	Hybrid Mamba2+transformer; enterprise tool-calling AMA praise.	IBM tutorial
`ibm-granite/granite-code`	various	aggregator-only	Falling — brand-only recognition in 2026.	blog
`CohereForAI/c4ai-command-r-v01`	dense 35B	candidate	Top-5 agentic local models in one aggregator.	PCBuildAdvisor
`liquid-ai/LFM2-24B-A2B`	24B MoE / 2B active	candidate	~390ms/tool-call (speed-optimized) — caveat: only 26% success on 3-6 step chains.	issue #523
`allenai/OLMo-2-*`	7B / 13B	candidate	Fully-open; par with Llama-3.1 on English academic benchmarks.	aggregator
`infly/OpenCoder-8B-Instruct`	dense 8B	candidate	Fully-reproducible coder family; 2.5T-token training.	site
`01-ai/Yi-Coder` / `Yi-Lightning`	various	absent 2026	Essentially absent from 2026 agentic-CLI discussion.	repo
`internlm/internlm3-8b-instruct`	dense 8B	candidate	General reasoning; thin citation (mostly self-claim).	repo
`baidu/ERNIE-4.5`	various	candidate	Mentioned in multimodal + agent threads.	benchmark roundup

Gaps in our current test set

Based on mention frequency across the surveyed threads, the clearest gaps — models we should be running but aren't — are:

Qwen3-Coder-Next 80B-A3B — the single most-cited open coding agent, explicitly the DGX Spark practitioner pick, ~43 t/s FP8 on a single Spark. Distinct from Qwen3-Next-80B-A3B-Instruct (which we have). highest priority
GLM-4.7-Flash 30B-A3B — purpose-built 24 GB-class agentic model; fits Dell Blackwell comfortably; our existing GLM-4.5-Air leader ought to compare against it.
MiMo-V2-Flash 309B/15A — claimed #1 SWE-Bench Verified among open weights (73.4%); fits 2× Spark.
Devstral 2 / Devstral-Small-24B — co-designed with All Hands for OpenHands; strong tool-use posterior. Small variant fits single Spark.
Seed-OSS-36B — Hermes-4.3's base model; vLLM ships seed_oss parser upstream (real integration signal).
Gemma-4 26B — the tool-calling reliability leap makes it a must-include lightweight baseline, even if the community hasn't internalized it yet.
Nemotron-3-Nano-30B-A3B — same lineage as our current TBLite leader (Nemotron-3-Super), but 4× smaller — useful to measure how much of Super's 4/20 came from Nemotron training vs sheer parameter count.
OpenHands-LM-32B — domain-fine-tuned for agent loops; valuable comparison to general drivers.
LongCat-Flash-Lite 68.5B/A3B — 256K ctx, longcat parser already upstream in Hermes Agent, very little community benchmarking yet. Potential surprise.

Research passes: see hermes-agent issue #523, NV DGX Spark coding thread, Latent.Space April 2026 top local models, Cline docs, RooCode-Local-Evaluation, and the HF model cards + vLLM/SGLang recipes linked on each row. Everything here is cross-referenced to two independent passes, with thin citations flagged. This list is a living snapshot — the field is moving fast enough that monthly re-passes are probably warranted.