Roadmap

Where this bench goes next: three more machines behind an LLM gateway, a wider set of benchmarks beyond TBLite, and a queue-style orchestrator that keeps every piece of hardware hot. Everything below is research in progress — links, dates, and claims traced back to primary sources so we can revisit as the landscape churns. Compiled 2026-04-21.

New hardware & models

We're about to get access to three more boxes behind an LLM gateway. Each has a distinct capability profile, so the goal isn't "run everything everywhere" — it's to choose a handful of models per machine that exercise the hardware's strengths and give us a genuinely different lineage to compare against the Spark set.

Dell

Pro Max Tower T2 · 96 GB VRAM

Dell Pro Max T2 with a single RTX PRO 6000 Blackwell Workstation Edition: 96 GB GDDR7, 1.6 TB/s, SM120, 600 W. Intel Core Ultra 9 285K host. Not dual-GPU — one very fast card with real HBM-class bandwidth.

Top adds to the benchmark

Model	Quant	Parser	Why
GLM-4.7-Flash	NVFP4, ~45 GB	`glm47` + `glm45` reasoning	New SOTA on TB2 (41%) & SWE-Multilingual (66.7%). Direct upgrade path from our GLM-4.5-Air Spark run.
Qwen3-Coder-Next / Qwen3.6-35B-A3B	FP8 (safer on SM120), ~35–45 GB	`qwen3_coder`	Best open agentic coder in the 30–80B class. FP8 path is rock-solid on PRO 6000.
Hermes-4-70B (re-run)	FP8, 68 GB	`hermes`	Already in our Spark set, but GDDR7 bandwidth makes the dense 70B snappy. Pure dense reference.
MiniMax-M2.5-NVFP4	NVFP4, ~58 GB	`minimax_m2`	NVIDIA-packaged NVFP4 (best-tested kernel) + A/B partner for our saricles M2.7 REAP.
gpt-oss-120B	MXFP4, ~60 GB	`hermes` or `pythonic`	Most widely-deployed OSS reference point in the ~100B class. vLLM 0.17+ Marlin kernel.

Gotchas: SM120 NVFP4 kernels still maturing (vLLM #35519, CUTLASS #3096). For MLA models (DeepSeek-V3.2, GLM-5) vLLM's MLA path is still Hopper-only on SM120 — use SGLang. Pin driver 580.126.20 + CUDA 13; avoid 590.x (CUDA-graph deadlocks).

2× Spark

Paired DGX Sparks · ~238 GB aggregate

Two GB10 nodes over ConnectX-7 200 Gb/s RoCE. PCIe 5.0 x4 per NIC caps real throughput at ~24 GB/s MTU 9000. NVIDIA markets this pairing up to 400 B parameters. Community caveat: --tensor-parallel-size 2 is still fragile — prefer pipeline-parallel unless you use eugr's Docker kit exactly.

What 2× Spark unlocks that 1× cannot

Model	Quant	Parser	Why
Qwen3-Coder-480B	NVFP4, ~240 GB	`qwen3_coder`	Frontier OSS agentic coder; Qwen claims Claude-Sonnet-4-class on agentic coding. Highest-ROI add.
MiniMax-M2.7 full (non-REAP)	NVFP4, ~131 GB	`minimax_m2`	A/B against saricles REAP-172B to measure distillation loss directly.
DeepSeek-V3.2 NVFP4	NVFP4, ~170 GB	`deepseek_v3_1`	First OSS model with thinking integrated into tool-use; different training lineage. Caveat: flash_mla + NVFP4 KV bugs still open.
MiniMax-M2.1	NVFP4, 122 GB	`minimax_m2`	Listed as `oversized` in our single-Spark catalog; now it runs.

Winning recipe per the NV MiniMax-M2.7 thread: --kv-cache-dtype fp8 --attention-backend flashinfer -tp 2 --distributed-executor-backend ray --max-model-len 196608 --load-format fastsafetensors with VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass env. Community consistently reports AWQ-4bit actually beats NVFP4 on gen t/s for M2.7 on 2× Spark — worth A/B'ing.

Mac

Mac Studio M3 Ultra · 512 GB

80-core GPU, 512 GB unified, 819 GB/s. Apple pulled the 512 GB SKU in March 2026 — current store tops at 256 GB, so this is a legacy config. MLX format is its own universe: 3/4/6/8-bit integer quants, not NVFP4/FP8. We talk to it via cubist38/mlx-openai-server (richer parser set than stock mlx_lm.server) or the newer vllm-mlx.

The only box that fits these

Model	Quant	Parser	Why
Kimi-K2.6	MLX 3.6-bit, ~470 GB	`kimi_k2` (⚠ parser missing in cubist)	Flagship 1 T/32 B active MoE. Day-0 MLX release. SOTA on TB2 (66.7) & SWE-Bench Pro (58.6). Only this machine can run it locally.
Qwen3-Coder-480B MLX	MLX 6-bit, ~270 GB	`qwen3_coder`	Cross-arch comparison point against the 2×-Spark NVFP4 run.
DeepSeek-V3.2 (self-convert if missing)	MLX 4-bit, ~380 GB	`deepseek_v3_1`	Third lineage alongside Kimi/Qwen on the Mac.
Qwen3-235B-A22B	MLX 4-bit, ~272 GB	`qwen3_moe`	Community baseline — Awni Hannun clocks ~24 t/s on M3 Ultra.
Qwen3.5-122B-A10B MLX 8-bit	MLX 8-bit, ~134 GB	`qwen3_coder`	High-quality reasoning reference, ~42 t/s community.

The real Mac story isn't generation speed — it's prefill. Hermes' fat system prompt plus long tool-call traces balloon TTFT on MLX in a way they don't on Blackwell NVFP4 (Newport critique, echoed on MacStories). TBLite wall-clock will expose this honestly — that's a feature. Expect DeepSeek-V3 ~20 t/s, Qwen3-235B ~24 t/s, Kimi-K2-Thinking MLX-4.25bit tight at 512 GB.

One underrated candidate across all three: LongCat-Flash (uses longcat parser in hermes-agent) — a genuinely different training lineage. Worth one slot once we shake out which of the above actually run clean.

Benchmarks in the wild — what we're missing

TBLite is a terminal-loop signal. That's one axis of a roughly seven-axis space: format brittleness, multi-turn dialogue, cost, reliability, contamination, long-context, and safety. Here's what else is out there and where our harness has blind spots.

The landscape

Benchmark	Maintainer	What it measures
Terminal-Bench 2.0	Laude Institute	89 Docker-sandboxed terminal tasks across SWE, security, bio, gaming
TBLite	OpenThoughts + Nous	100-task calibrated subset; what we run today
SWE-bench Verified	Princeton	500 human-verified GitHub issues; patch applies and tests pass
SWE-rebench	Nebius	Live, monthly-refreshed SWE-bench (contamination-resistant)
SWE-bench Pro	Scale AI	1,865 tasks from proprietary startup codebases
τ / τ² / τ³-bench	Sierra Research	Simulated customer-service tool-agent-user dialogues
BFCL v4	Berkeley (Gorilla)	AST-matched function calling + agentic web-search/memory/format
LiveCodeBench	UC Berkeley	Contamination-free competitive programming (date-windowed)
GAIA / GAIA2	Meta + HF	Real-world assistant Q's needing browsing + tools + multimodal
OSWorld-Verified	XLang Lab	369 GUI tasks across desktop apps/OSes
Aider Polyglot	Aider	225 Exercism problems × 6 languages; tests edit-format fidelity
MLE-Bench	OpenAI	75 Kaggle competitions end-to-end
HAL	Princeton PLI	Meta-harness — re-runs many benchmarks with cost/reliability/safety axes
AA Intelligence Index v4	Artificial Analysis	Weighted aggregate including τ²-Telecom & TB-Hard

What TBLite-alone does not measure

Latency / cost-per-task. We record pass/fail but not tokens-in, tokens-out, wall-clock, or $. HAL and AA surface this and it materially changes rankings.
Multi-turn dialogue with a simulated user. TBLite prompts are static. τ-bench and BFCL-v4-multi-turn catch "agent talks fine on turn 1, collapses on turn 6."
Tool-call format brittleness. BFCL-v4's format-sensitivity section directly measures the failure mode we hypothesize for Kimi-Linear (pythonic vs kimi_k2 parser) — TBLite just returns pass/fail, not the layer at fault.
Contamination. TBLite is public. SWE-rebench or LiveCodeBench (date-windowed) triangulate whether scores reflect capability or memorization.
Long-context degradation. No task in TBLite exercises > 32 k tokens. AA-LCR and RULER do.
Variance. We run 1×. On a 100-task bench at 60% pass rate, single-run std dev is ~±5 pp. HAL publishes seeded variance by default.
Partial credit. TBLite's pytest graders are pass/fail; "9/10 files correct" scores 0 — can't tell "almost solved" from "lost."

Easy wins (run against the same vLLM endpoints)

BFCL v4. pip install, points at OpenAI-compatible base URL. ~1 hour setup, <30 min per model. Catches tool-call brittleness directly. repo
τ²-bench retail + airline. Python runner, config-driven, OpenAI-compatible. Adds multi-turn user-sim signal. ~2–3 h setup. repo
Aider Polyglot. aider --benchmark. 225 Exercism problems × 6 languages. ~1 h setup. repo
Full Terminal-Bench 2.0. Same Harbor harness we already run; switch the dataset arg. Effectively free. Gets us an apples-to-apples vs the public leaderboard.
SWE-bench Verified via mini-SWE-agent. Close to our Hermes loop; the most-cited OSS number. ~half-day setup, long evaluation.
HAL wrap. Wrap hermes-agent once as a HAL agent and get reliability, cost, and 11 benchmarks (SWE-bench, Cybench, CORE-bench, USACO, AgentHarm, GAIA…) for marginal effort. repo
Log trajectories + post-hoc classify failures. Not a new benchmark — just classify raw TBLite tool-call streams into (malformed JSON, wrong args, grader mismatch, timeout, hallucinated tool). Single highest-leverage change for understanding why models fail.

Skipped for now (cost not yet justified): OSWorld (needs VM fleet), MLE-Bench (slow Kaggle datasets), GAIA2 (needs browser-use stack), WebArena (self-hosted site fleet).

Field notes — how people actually run these boxes

Dell Pro Max T2 · RTX PRO 6000 Blackwell

Stack: vLLM works for most dense/MoE, but stock vllm/vllm-openai:latest still mis-selects the SM120 NVFP4-MoE backend (#33416). Community uses voipmonitor/llm-pytorch-blackwell:nightly or builds with FlashInfer SM120 patches.
SGLang is mandatory for MLA models (DeepSeek-V3.2, GLM-5). Flags: SGLANG_ENABLE_SPEC_V2=True, SGLANG_ENABLE_JIT_DEEPGEMM=0, --moe-runner-backend cutlass, --kv-cache-dtype bf16 (FP8 KV corrupts output on SM120).
What people have actually run: Qwen3-Coder-30B AWQ-4bit, Qwen3-Coder-Next FP8, Hermes-4-70B FP8 (13–26 concurrent users), Llama-3.3-Nemotron-Super-49B FP8 (Akamai: 3,030 TPS aggregate @ bs=100), GLM-4.5-Air via SGLang, MiniMax-M2 REAP-136B NVFP4, Gemma-4-31B NVFP4 (1.63× H100 FP8).
Known pitfalls: CUTLASS grouped-GEMM garbage output on NVFP4 MoE without FlashInfer patches (#3096); MiniMax-M2.5 NVFP4 illegal memory access (#35566); Qwen3-32B FP8 55-s TTFT in some configs (#27649).

Two DGX Sparks paired

Use eugr's kit — explicitly endorsed on NV forum. (eugr, bkrabach, mark-ramsey-ri, NV playbook).
Pipeline-parallel > tensor-parallel. TP=2 across Sparks is still fragile (forum thread) — one node drops, other GPU pins 100% forever. PP=2 is the reliable path.
Numbers on MiniMax-M2.7 230B NVFP4 across 2× Spark: 3,146 t/s prefill, 25.7 t/s gen @ 2 k ctx · 11.3 t/s gen @ 131 k ctx (NV recipe thread). Tool-call bench: 30/30 on ToolCall15, beats M2.5's 27/30. AWQ-4bit hits 39.4 t/s gen — faster than NVFP4 at generation.
Pitfalls: firmware-induced sudden shutdowns under sustained load (mitigate with nvidia-smi -lgc); NVFP4 upstream support lags, FlashInfer load errors common; adding a switch between two Sparks raises latency — direct IB cable preferred.

Mac Studio M3 Ultra 512 GB · MLX

Server choice matters. Stock mlx_lm.server has limited parser support. cubist38/mlx-openai-server adds qwen3, qwen3_coder, qwen3_moe, qwen3_vl, glm4_moe, minimax_m2, harmony (gpt-oss), plus message converters for nemotron3_nano, longcat_flash_lite. No Kimi-K2 parser yet (issue #174) — would need contribution or text-based fallback.
Community-measured speeds: DeepSeek-V3 4-bit ~20 t/s (Awni Hannun); DeepSeek-R1 4-bit ~17–18 t/s; Qwen3-235B 4-bit ~24 t/s @ 272 GB (lmstudio card); Qwen3-Coder-480B 6-bit runs; Kimi-K2-Thinking 4.25-bit — tight at 512 GB.
Prefill is the bottleneck — long agentic traces hurt more than on Blackwell. Billy Newport argues it "misses the mark"; MacStories and Creative Strategies disagree. TBLite wall-clock will arbitrate.
Pitfalls: MLX quant conversions lag GGUF by days-to-weeks; rely on lmstudio-community and mlx-community HF orgs. For Kimi-K2.5 at 1 T you need distributed MLX across two Macs (EXO guide). Llama-3.3-405B does not exist — 405 B only ships under Llama 3.1.

Cross-cutting

All harnesses we care about (terminal-bench, tau-bench, SWE-bench, BFCL) are OpenAI-API clients — they just point at whatever vLLM / SGLang / MLX OpenAI shim is up, no platform-specific bench forks needed.
Apples-to-apples: run Hermes-4-70B FP8 on Dell and 2× Spark (PP=2) to isolate interconnect overhead; then MiniMax-M2.7 NVFP4 (2× Spark) vs MLX-4bit (Mac) for same-model cross-arch.
Pin driver 580.126.20 + CUDA 13 on both NVIDIA platforms to avoid 590.x CUDA-graph deadlocks.
EXO Labs publishes Spark+Mac combined benchmarks if we ever want heterogeneous pipelines.

Queue-aware, machine-aware orchestrator

Three machines × ten models × five benchmarks is too many ps -ef | grep loops. We want to enqueue everything once and have it run repeatably in parallel across hardware, batching benches per model-load where that saves warmup cost.

Design at a glance

SQLite at bench/orch/state.db (WAL). Four entities: machine, model, bench, job. models.yaml stays the human-edited catalog; a sync step reconciles it into SQLite. A new orch.toml holds the machine catalog.
One central scheduler process, 2-second tick, BEGIN IMMEDIATE transactions for atomic job-to-machine assignment. No Redis, no Postgres.
One lightweight worker daemon per machine. Reason: MLX is native (no docker), driver resets & CUDA-graph deadlocks are invisible from a central SSH poller, and per-box supervision is cleaner than shelling around. Workers heartbeat up every 15 s.
TBLite stays on the scheduler host (spawning Docker sandboxes). Workers only do ensure_model_loaded + endpoint healthcheck. This is the key split that our current pipeline_model.sh conflates.
Placement via a predicate on model.needs_kernels ∩ machine.kernels + model.min_vram_gb. MLX models only match kind = mac_mlx; NVFP4-SM120 matches Spark or Dell; spark_tp2 jobs atomically claim both Sparks.
Bench groups. A shared group_id UUID on jobs means "keep this model hot and drain the group." orch enqueue-group hermes-4-70b tblite,bfcl,tau2 --priority 80 warms once, runs three benches back-to-back. Preemptible between jobs, not mid-job. Optional max_hot_minutes bound.
Crash detection via heartbeat freshness, not sample-count. > 90 s stale → needs_retry, lease released, resume from existing samples.jsonl (pass --env.task_filter minus completed tasks). Never rm -rf the out_dir on retry.
No more docker rm -f footgun. Sandbox cleanup matches by label=orch.job=<id>, so an aborted run can't wipe another job's live sandboxes.
Live surface keeps our existing progress.json + live_turns/*.ndjson (Astro reads them unchanged), plus a new queue_state.json at 2 s cadence for the queue strip. live_progress.sh retires.

Stack choice: roll our own

Evaluated: Airflow / Prefect / Dagster (DAG-centric, assume ephemeral workers, bad fit for machine affinity + exclusive-resource semantics); Hatchet (nice but needs Postgres + its own server); rq / procrastinate (no machine-affinity primitives); Ray (actors + resource labels fit, but GCS is finicky across heterogeneous aarch64/x86/ARM64-Mac, and we lose "just SSH and docker run" simplicity). Verdict: ~600–800 LoC of Python 3.11 + stdlib + httpx + pyyaml + sqlite3. The problem — 3 machines × 10 models × 5 benches — is small enough that a one-file scheduler is cheaper to operate.

Migration path

M1 — SQLite + CLI over existing shell. orch enqueue <model> tblite writes a job row; trivial loop picks queued jobs one at a time and subprocess-execs today's pipeline_model.sh. No workers, no machines table. Kills the ps -ef | grep waiter scripts immediately. ~200 LoC.
M2 — machines table + placement predicates, still single-machine. Verify by enqueueing an MLX-only model and confirming it stays queued.
M3 — first real worker on the Dell Blackwell. Split launch_vllm.sh into the worker's ensure_model_loaded. Two machines in parallel. live_progress.sh retires.
M4 — bench groups. Add group_id, orch enqueue-group, affinity ordering. Prove multi-bench-per-load with BFCL or τ² as the second bench row.
M5 — MLX worker on Mac, second Spark. spark_tp2 kind claims both Sparks atomically.

Explicit non-goals

Not a CI system — no webhooks, no PR gates, no artifact registry.
Not Kubernetes — three machines behind SSH.
Not multi-tenant — one operator, trust boundary is SSH keys.
Not a cost optimizer — we pack for speed, not $.
Not a benchmark authoring platform — each bench is a config + a Python output parser; we don't unify BFCL's vs τ²'s vs Aider's metric schemas.
No HA, no distributed SQLite, no Raft.

Full design and source trail: see vLLM, eugr's 2× Spark kit, cubist38/mlx-openai-server, HAL, and the benchmark repos linked above. This page will be updated in place as the gateway comes online and each machine joins the rig.