Roadmap
Where this bench goes next: three more machines behind an LLM gateway, a wider set of benchmarks beyond TBLite, and a queue-style orchestrator that keeps every piece of hardware hot. Everything below is research in progress — links, dates, and claims traced back to primary sources so we can revisit as the landscape churns. Compiled 2026-04-21.
New hardware & models
We're about to get access to three more boxes behind an LLM gateway. Each has a distinct capability profile, so the goal isn't "run everything everywhere" — it's to choose a handful of models per machine that exercise the hardware's strengths and give us a genuinely different lineage to compare against the Spark set.
Pro Max Tower T2 · 96 GB VRAM
Dell Pro Max T2 with a single RTX PRO 6000 Blackwell Workstation Edition: 96 GB GDDR7, 1.6 TB/s, SM120, 600 W. Intel Core Ultra 9 285K host. Not dual-GPU — one very fast card with real HBM-class bandwidth.
Top adds to the benchmark
| Model | Quant | Parser | Why |
|---|---|---|---|
| GLM-4.7-Flash | NVFP4, ~45 GB | glm47 + glm45 reasoning | New SOTA on TB2 (41%) & SWE-Multilingual (66.7%). Direct upgrade path from our GLM-4.5-Air Spark run. |
| Qwen3-Coder-Next / Qwen3.6-35B-A3B | FP8 (safer on SM120), ~35–45 GB | qwen3_coder | Best open agentic coder in the 30–80B class. FP8 path is rock-solid on PRO 6000. |
| Hermes-4-70B (re-run) | FP8, 68 GB | hermes | Already in our Spark set, but GDDR7 bandwidth makes the dense 70B snappy. Pure dense reference. |
| MiniMax-M2.5-NVFP4 | NVFP4, ~58 GB | minimax_m2 | NVIDIA-packaged NVFP4 (best-tested kernel) + A/B partner for our saricles M2.7 REAP. |
| gpt-oss-120B | MXFP4, ~60 GB | hermes or pythonic | Most widely-deployed OSS reference point in the ~100B class. vLLM 0.17+ Marlin kernel. |
Gotchas: SM120 NVFP4 kernels still maturing (vLLM #35519, CUTLASS #3096). For MLA models (DeepSeek-V3.2, GLM-5) vLLM's MLA path is still Hopper-only on SM120 — use SGLang. Pin driver 580.126.20 + CUDA 13; avoid 590.x (CUDA-graph deadlocks).
Paired DGX Sparks · ~238 GB aggregate
Two GB10 nodes over ConnectX-7 200 Gb/s RoCE. PCIe 5.0 x4 per NIC caps real
throughput at ~24 GB/s MTU 9000. NVIDIA markets this pairing up to 400 B
parameters. Community caveat: --tensor-parallel-size 2 is
still fragile
— prefer pipeline-parallel unless you use
eugr's Docker kit exactly.
What 2× Spark unlocks that 1× cannot
| Model | Quant | Parser | Why |
|---|---|---|---|
| Qwen3-Coder-480B | NVFP4, ~240 GB | qwen3_coder | Frontier OSS agentic coder; Qwen claims Claude-Sonnet-4-class on agentic coding. Highest-ROI add. |
| MiniMax-M2.7 full (non-REAP) | NVFP4, ~131 GB | minimax_m2 | A/B against saricles REAP-172B to measure distillation loss directly. |
| DeepSeek-V3.2 NVFP4 | NVFP4, ~170 GB | deepseek_v3_1 | First OSS model with thinking integrated into tool-use; different training lineage. Caveat: flash_mla + NVFP4 KV bugs still open. |
| MiniMax-M2.1 | NVFP4, 122 GB | minimax_m2 | Listed as oversized in our single-Spark catalog; now it runs. |
Winning recipe per the NV MiniMax-M2.7 thread:
--kv-cache-dtype fp8 --attention-backend flashinfer -tp 2 --distributed-executor-backend ray
--max-model-len 196608 --load-format fastsafetensors with
VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass env. Community
consistently reports AWQ-4bit actually beats NVFP4 on gen t/s
for M2.7 on 2× Spark — worth A/B'ing.
Mac Studio M3 Ultra · 512 GB
80-core GPU, 512 GB unified, 819 GB/s. Apple pulled the 512 GB SKU in
March 2026 — current store tops at 256 GB, so this is a legacy config.
MLX format is its own universe: 3/4/6/8-bit integer quants, not NVFP4/FP8. We
talk to it via cubist38/mlx-openai-server
(richer parser set than stock mlx_lm.server) or the newer
vllm-mlx.
The only box that fits these
| Model | Quant | Parser | Why |
|---|---|---|---|
| Kimi-K2.6 | MLX 3.6-bit, ~470 GB | kimi_k2 (⚠ parser missing in cubist) | Flagship 1 T/32 B active MoE. Day-0 MLX release. SOTA on TB2 (66.7) & SWE-Bench Pro (58.6). Only this machine can run it locally. |
| Qwen3-Coder-480B MLX | MLX 6-bit, ~270 GB | qwen3_coder | Cross-arch comparison point against the 2×-Spark NVFP4 run. |
| DeepSeek-V3.2 (self-convert if missing) | MLX 4-bit, ~380 GB | deepseek_v3_1 | Third lineage alongside Kimi/Qwen on the Mac. |
| Qwen3-235B-A22B | MLX 4-bit, ~272 GB | qwen3_moe | Community baseline — Awni Hannun clocks ~24 t/s on M3 Ultra. |
| Qwen3.5-122B-A10B MLX 8-bit | MLX 8-bit, ~134 GB | qwen3_coder | High-quality reasoning reference, ~42 t/s community. |
The real Mac story isn't generation speed — it's prefill. Hermes' fat system prompt plus long tool-call traces balloon TTFT on MLX in a way they don't on Blackwell NVFP4 (Newport critique, echoed on MacStories). TBLite wall-clock will expose this honestly — that's a feature. Expect DeepSeek-V3 ~20 t/s, Qwen3-235B ~24 t/s, Kimi-K2-Thinking MLX-4.25bit tight at 512 GB.
One underrated candidate across all three: LongCat-Flash (uses
longcat parser in hermes-agent) — a genuinely different training
lineage. Worth one slot once we shake out which of the above actually run clean.
Benchmarks in the wild — what we're missing
TBLite is a terminal-loop signal. That's one axis of a roughly seven-axis space: format brittleness, multi-turn dialogue, cost, reliability, contamination, long-context, and safety. Here's what else is out there and where our harness has blind spots.
The landscape
| Benchmark | Maintainer | What it measures |
|---|---|---|
| Terminal-Bench 2.0 | Laude Institute | 89 Docker-sandboxed terminal tasks across SWE, security, bio, gaming |
| TBLite | OpenThoughts + Nous | 100-task calibrated subset; what we run today |
| SWE-bench Verified | Princeton | 500 human-verified GitHub issues; patch applies and tests pass |
| SWE-rebench | Nebius | Live, monthly-refreshed SWE-bench (contamination-resistant) |
| SWE-bench Pro | Scale AI | 1,865 tasks from proprietary startup codebases |
| τ / τ² / τ³-bench | Sierra Research | Simulated customer-service tool-agent-user dialogues |
| BFCL v4 | Berkeley (Gorilla) | AST-matched function calling + agentic web-search/memory/format |
| LiveCodeBench | UC Berkeley | Contamination-free competitive programming (date-windowed) |
| GAIA / GAIA2 | Meta + HF | Real-world assistant Q's needing browsing + tools + multimodal |
| OSWorld-Verified | XLang Lab | 369 GUI tasks across desktop apps/OSes |
| Aider Polyglot | Aider | 225 Exercism problems × 6 languages; tests edit-format fidelity |
| MLE-Bench | OpenAI | 75 Kaggle competitions end-to-end |
| HAL | Princeton PLI | Meta-harness — re-runs many benchmarks with cost/reliability/safety axes |
| AA Intelligence Index v4 | Artificial Analysis | Weighted aggregate including τ²-Telecom & TB-Hard |
What TBLite-alone does not measure
- Latency / cost-per-task. We record pass/fail but not tokens-in, tokens-out, wall-clock, or $. HAL and AA surface this and it materially changes rankings.
- Multi-turn dialogue with a simulated user. TBLite prompts are static. τ-bench and BFCL-v4-multi-turn catch "agent talks fine on turn 1, collapses on turn 6."
- Tool-call format brittleness. BFCL-v4's format-sensitivity section directly measures the failure mode we hypothesize for Kimi-Linear (pythonic vs kimi_k2 parser) — TBLite just returns pass/fail, not the layer at fault.
- Contamination. TBLite is public. SWE-rebench or LiveCodeBench (date-windowed) triangulate whether scores reflect capability or memorization.
- Long-context degradation. No task in TBLite exercises > 32 k tokens. AA-LCR and RULER do.
- Variance. We run 1×. On a 100-task bench at 60% pass rate, single-run std dev is ~±5 pp. HAL publishes seeded variance by default.
- Partial credit. TBLite's pytest graders are pass/fail; "9/10 files correct" scores 0 — can't tell "almost solved" from "lost."
Easy wins (run against the same vLLM endpoints)
- BFCL v4. pip install, points at OpenAI-compatible base URL. ~1 hour setup, <30 min per model. Catches tool-call brittleness directly. repo
- τ²-bench retail + airline. Python runner, config-driven, OpenAI-compatible. Adds multi-turn user-sim signal. ~2–3 h setup. repo
- Aider Polyglot.
aider --benchmark. 225 Exercism problems × 6 languages. ~1 h setup. repo - Full Terminal-Bench 2.0. Same Harbor harness we already run; switch the dataset arg. Effectively free. Gets us an apples-to-apples vs the public leaderboard.
- SWE-bench Verified via mini-SWE-agent. Close to our Hermes loop; the most-cited OSS number. ~half-day setup, long evaluation.
- HAL wrap. Wrap hermes-agent once as a HAL agent and get reliability, cost, and 11 benchmarks (SWE-bench, Cybench, CORE-bench, USACO, AgentHarm, GAIA…) for marginal effort. repo
- Log trajectories + post-hoc classify failures. Not a new benchmark — just classify raw TBLite tool-call streams into (malformed JSON, wrong args, grader mismatch, timeout, hallucinated tool). Single highest-leverage change for understanding why models fail.
Skipped for now (cost not yet justified): OSWorld (needs VM fleet), MLE-Bench (slow Kaggle datasets), GAIA2 (needs browser-use stack), WebArena (self-hosted site fleet).
Field notes — how people actually run these boxes
Dell Pro Max T2 · RTX PRO 6000 Blackwell
- Stack: vLLM works for most dense/MoE, but stock
vllm/vllm-openai:lateststill mis-selects the SM120 NVFP4-MoE backend (#33416). Community usesvoipmonitor/llm-pytorch-blackwell:nightlyor builds with FlashInfer SM120 patches. - SGLang is mandatory for MLA models (DeepSeek-V3.2, GLM-5). Flags:
SGLANG_ENABLE_SPEC_V2=True,SGLANG_ENABLE_JIT_DEEPGEMM=0,--moe-runner-backend cutlass,--kv-cache-dtype bf16(FP8 KV corrupts output on SM120). - What people have actually run: Qwen3-Coder-30B AWQ-4bit, Qwen3-Coder-Next FP8, Hermes-4-70B FP8 (13–26 concurrent users), Llama-3.3-Nemotron-Super-49B FP8 (Akamai: 3,030 TPS aggregate @ bs=100), GLM-4.5-Air via SGLang, MiniMax-M2 REAP-136B NVFP4, Gemma-4-31B NVFP4 (1.63× H100 FP8).
- Known pitfalls: CUTLASS grouped-GEMM garbage output on NVFP4 MoE without FlashInfer patches (#3096); MiniMax-M2.5 NVFP4 illegal memory access (#35566); Qwen3-32B FP8 55-s TTFT in some configs (#27649).
Two DGX Sparks paired
- Use eugr's kit — explicitly endorsed on NV forum. (eugr, bkrabach, mark-ramsey-ri, NV playbook).
- Pipeline-parallel > tensor-parallel. TP=2 across Sparks is still fragile (forum thread) — one node drops, other GPU pins 100% forever. PP=2 is the reliable path.
- Numbers on MiniMax-M2.7 230B NVFP4 across 2× Spark: 3,146 t/s prefill, 25.7 t/s gen @ 2 k ctx · 11.3 t/s gen @ 131 k ctx (NV recipe thread). Tool-call bench: 30/30 on ToolCall15, beats M2.5's 27/30. AWQ-4bit hits 39.4 t/s gen — faster than NVFP4 at generation.
- Pitfalls: firmware-induced sudden shutdowns under sustained load
(mitigate with
nvidia-smi -lgc); NVFP4 upstream support lags, FlashInfer load errors common; adding a switch between two Sparks raises latency — direct IB cable preferred.
Mac Studio M3 Ultra 512 GB · MLX
- Server choice matters. Stock
mlx_lm.serverhas limited parser support. cubist38/mlx-openai-server addsqwen3,qwen3_coder,qwen3_moe,qwen3_vl,glm4_moe,minimax_m2,harmony(gpt-oss), plus message converters fornemotron3_nano,longcat_flash_lite. No Kimi-K2 parser yet (issue #174) — would need contribution or text-based fallback. - Community-measured speeds: DeepSeek-V3 4-bit ~20 t/s (Awni Hannun); DeepSeek-R1 4-bit ~17–18 t/s; Qwen3-235B 4-bit ~24 t/s @ 272 GB (lmstudio card); Qwen3-Coder-480B 6-bit runs; Kimi-K2-Thinking 4.25-bit — tight at 512 GB.
- Prefill is the bottleneck — long agentic traces hurt more than on Blackwell. Billy Newport argues it "misses the mark"; MacStories and Creative Strategies disagree. TBLite wall-clock will arbitrate.
- Pitfalls: MLX quant conversions lag GGUF by days-to-weeks; rely on
lmstudio-communityandmlx-communityHF orgs. For Kimi-K2.5 at 1 T you need distributed MLX across two Macs (EXO guide). Llama-3.3-405B does not exist — 405 B only ships under Llama 3.1.
Cross-cutting
- All harnesses we care about (terminal-bench, tau-bench, SWE-bench, BFCL) are OpenAI-API clients — they just point at whatever vLLM / SGLang / MLX OpenAI shim is up, no platform-specific bench forks needed.
- Apples-to-apples: run Hermes-4-70B FP8 on Dell and 2× Spark (PP=2) to isolate interconnect overhead; then MiniMax-M2.7 NVFP4 (2× Spark) vs MLX-4bit (Mac) for same-model cross-arch.
- Pin driver 580.126.20 + CUDA 13 on both NVIDIA platforms to avoid 590.x CUDA-graph deadlocks.
- EXO Labs publishes Spark+Mac combined benchmarks if we ever want heterogeneous pipelines.
Queue-aware, machine-aware orchestrator
Three machines × ten models × five benchmarks is too many ps -ef | grep
loops. We want to enqueue everything once and have it run repeatably in parallel
across hardware, batching benches per model-load where that saves warmup cost.
Design at a glance
- SQLite at
bench/orch/state.db(WAL). Four entities:machine,model,bench,job.models.yamlstays the human-edited catalog; a sync step reconciles it into SQLite. A neworch.tomlholds the machine catalog. - One central scheduler process, 2-second tick,
BEGIN IMMEDIATEtransactions for atomic job-to-machine assignment. No Redis, no Postgres. - One lightweight worker daemon per machine. Reason: MLX is native (no docker), driver resets & CUDA-graph deadlocks are invisible from a central SSH poller, and per-box supervision is cleaner than shelling around. Workers heartbeat up every 15 s.
- TBLite stays on the scheduler host (spawning Docker sandboxes).
Workers only do
ensure_model_loaded+ endpoint healthcheck. This is the key split that our currentpipeline_model.shconflates. - Placement via a predicate on
model.needs_kernels∩machine.kernels+model.min_vram_gb. MLX models only matchkind = mac_mlx; NVFP4-SM120 matches Spark or Dell;spark_tp2jobs atomically claim both Sparks. - Bench groups. A shared
group_idUUID on jobs means "keep this model hot and drain the group."orch enqueue-group hermes-4-70b tblite,bfcl,tau2 --priority 80warms once, runs three benches back-to-back. Preemptible between jobs, not mid-job. Optionalmax_hot_minutesbound. - Crash detection via heartbeat freshness, not sample-count. > 90 s stale →
needs_retry, lease released, resume from existingsamples.jsonl(pass--env.task_filterminus completed tasks). Neverrm -rfthe out_dir on retry. - No more
docker rm -ffootgun. Sandbox cleanup matches bylabel=orch.job=<id>, so an aborted run can't wipe another job's live sandboxes. - Live surface keeps our existing
progress.json+live_turns/*.ndjson(Astro reads them unchanged), plus a newqueue_state.jsonat 2 s cadence for the queue strip.live_progress.shretires.
Stack choice: roll our own
Evaluated: Airflow / Prefect / Dagster (DAG-centric, assume ephemeral workers, bad fit for machine affinity + exclusive-resource semantics); Hatchet (nice but needs Postgres + its own server); rq / procrastinate (no machine-affinity primitives); Ray (actors + resource labels fit, but GCS is finicky across heterogeneous aarch64/x86/ARM64-Mac, and we lose "just SSH and docker run" simplicity). Verdict: ~600–800 LoC of Python 3.11 + stdlib + httpx + pyyaml + sqlite3. The problem — 3 machines × 10 models × 5 benches — is small enough that a one-file scheduler is cheaper to operate.
Migration path
- M1 — SQLite + CLI over existing shell.
orch enqueue <model> tblitewrites a job row; trivial loop picks queued jobs one at a time andsubprocess-execs today'spipeline_model.sh. No workers, no machines table. Kills theps -ef | grepwaiter scripts immediately. ~200 LoC. - M2 — machines table + placement predicates, still single-machine.
Verify by enqueueing an MLX-only model and confirming it stays
queued. - M3 — first real worker on the Dell Blackwell.
Split
launch_vllm.shinto the worker'sensure_model_loaded. Two machines in parallel.live_progress.shretires. - M4 — bench groups.
Add
group_id,orch enqueue-group, affinity ordering. Prove multi-bench-per-load with BFCL or τ² as the second bench row. - M5 — MLX worker on Mac, second Spark.
spark_tp2kind claims both Sparks atomically.
Explicit non-goals
- Not a CI system — no webhooks, no PR gates, no artifact registry.
- Not Kubernetes — three machines behind SSH.
- Not multi-tenant — one operator, trust boundary is SSH keys.
- Not a cost optimizer — we pack for speed, not $.
- Not a benchmark authoring platform — each bench is a config + a Python output parser; we don't unify BFCL's vs τ²'s vs Aider's metric schemas.
- No HA, no distributed SQLite, no Raft.
Full design and source trail: see vLLM, eugr's 2× Spark kit, cubist38/mlx-openai-server, HAL, and the benchmark repos linked above. This page will be updated in place as the gateway comes online and each machine joins the rig.