Roadmap

Where this bench goes next: three more machines behind an LLM gateway, a wider set of benchmarks beyond TBLite, and a queue-style orchestrator that keeps every piece of hardware hot. Everything below is research in progress — links, dates, and claims traced back to primary sources so we can revisit as the landscape churns. Compiled 2026-04-21.

New hardware & models

We're about to get access to three more boxes behind an LLM gateway. Each has a distinct capability profile, so the goal isn't "run everything everywhere" — it's to choose a handful of models per machine that exercise the hardware's strengths and give us a genuinely different lineage to compare against the Spark set.

Dell

Pro Max Tower T2 · 96 GB VRAM

Dell Pro Max T2 with a single RTX PRO 6000 Blackwell Workstation Edition: 96 GB GDDR7, 1.6 TB/s, SM120, 600 W. Intel Core Ultra 9 285K host. Not dual-GPU — one very fast card with real HBM-class bandwidth.

Top adds to the benchmark

ModelQuantParserWhy
GLM-4.7-Flash NVFP4, ~45 GB glm47 + glm45 reasoning New SOTA on TB2 (41%) & SWE-Multilingual (66.7%). Direct upgrade path from our GLM-4.5-Air Spark run.
Qwen3-Coder-Next / Qwen3.6-35B-A3B FP8 (safer on SM120), ~35–45 GB qwen3_coder Best open agentic coder in the 30–80B class. FP8 path is rock-solid on PRO 6000.
Hermes-4-70B (re-run) FP8, 68 GB hermes Already in our Spark set, but GDDR7 bandwidth makes the dense 70B snappy. Pure dense reference.
MiniMax-M2.5-NVFP4 NVFP4, ~58 GB minimax_m2 NVIDIA-packaged NVFP4 (best-tested kernel) + A/B partner for our saricles M2.7 REAP.
gpt-oss-120B MXFP4, ~60 GB hermes or pythonic Most widely-deployed OSS reference point in the ~100B class. vLLM 0.17+ Marlin kernel.

Gotchas: SM120 NVFP4 kernels still maturing (vLLM #35519, CUTLASS #3096). For MLA models (DeepSeek-V3.2, GLM-5) vLLM's MLA path is still Hopper-only on SM120 — use SGLang. Pin driver 580.126.20 + CUDA 13; avoid 590.x (CUDA-graph deadlocks).

2× Spark

Paired DGX Sparks · ~238 GB aggregate

Two GB10 nodes over ConnectX-7 200 Gb/s RoCE. PCIe 5.0 x4 per NIC caps real throughput at ~24 GB/s MTU 9000. NVIDIA markets this pairing up to 400 B parameters. Community caveat: --tensor-parallel-size 2 is still fragile — prefer pipeline-parallel unless you use eugr's Docker kit exactly.

What 2× Spark unlocks that 1× cannot

ModelQuantParserWhy
Qwen3-Coder-480B NVFP4, ~240 GB qwen3_coder Frontier OSS agentic coder; Qwen claims Claude-Sonnet-4-class on agentic coding. Highest-ROI add.
MiniMax-M2.7 full (non-REAP) NVFP4, ~131 GB minimax_m2 A/B against saricles REAP-172B to measure distillation loss directly.
DeepSeek-V3.2 NVFP4 NVFP4, ~170 GB deepseek_v3_1 First OSS model with thinking integrated into tool-use; different training lineage. Caveat: flash_mla + NVFP4 KV bugs still open.
MiniMax-M2.1 NVFP4, 122 GB minimax_m2 Listed as oversized in our single-Spark catalog; now it runs.

Winning recipe per the NV MiniMax-M2.7 thread: --kv-cache-dtype fp8 --attention-backend flashinfer -tp 2 --distributed-executor-backend ray --max-model-len 196608 --load-format fastsafetensors with VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass env. Community consistently reports AWQ-4bit actually beats NVFP4 on gen t/s for M2.7 on 2× Spark — worth A/B'ing.

Mac

Mac Studio M3 Ultra · 512 GB

80-core GPU, 512 GB unified, 819 GB/s. Apple pulled the 512 GB SKU in March 2026 — current store tops at 256 GB, so this is a legacy config. MLX format is its own universe: 3/4/6/8-bit integer quants, not NVFP4/FP8. We talk to it via cubist38/mlx-openai-server (richer parser set than stock mlx_lm.server) or the newer vllm-mlx.

The only box that fits these

ModelQuantParserWhy
Kimi-K2.6 MLX 3.6-bit, ~470 GB kimi_k2 (⚠ parser missing in cubist) Flagship 1 T/32 B active MoE. Day-0 MLX release. SOTA on TB2 (66.7) & SWE-Bench Pro (58.6). Only this machine can run it locally.
Qwen3-Coder-480B MLX MLX 6-bit, ~270 GB qwen3_coder Cross-arch comparison point against the 2×-Spark NVFP4 run.
DeepSeek-V3.2 (self-convert if missing) MLX 4-bit, ~380 GB deepseek_v3_1 Third lineage alongside Kimi/Qwen on the Mac.
Qwen3-235B-A22B MLX 4-bit, ~272 GB qwen3_moe Community baseline — Awni Hannun clocks ~24 t/s on M3 Ultra.
Qwen3.5-122B-A10B MLX 8-bit MLX 8-bit, ~134 GB qwen3_coder High-quality reasoning reference, ~42 t/s community.

The real Mac story isn't generation speed — it's prefill. Hermes' fat system prompt plus long tool-call traces balloon TTFT on MLX in a way they don't on Blackwell NVFP4 (Newport critique, echoed on MacStories). TBLite wall-clock will expose this honestly — that's a feature. Expect DeepSeek-V3 ~20 t/s, Qwen3-235B ~24 t/s, Kimi-K2-Thinking MLX-4.25bit tight at 512 GB.

One underrated candidate across all three: LongCat-Flash (uses longcat parser in hermes-agent) — a genuinely different training lineage. Worth one slot once we shake out which of the above actually run clean.

Benchmarks in the wild — what we're missing

TBLite is a terminal-loop signal. That's one axis of a roughly seven-axis space: format brittleness, multi-turn dialogue, cost, reliability, contamination, long-context, and safety. Here's what else is out there and where our harness has blind spots.

The landscape

BenchmarkMaintainerWhat it measures
Terminal-Bench 2.0Laude Institute89 Docker-sandboxed terminal tasks across SWE, security, bio, gaming
TBLiteOpenThoughts + Nous100-task calibrated subset; what we run today
SWE-bench VerifiedPrinceton500 human-verified GitHub issues; patch applies and tests pass
SWE-rebenchNebiusLive, monthly-refreshed SWE-bench (contamination-resistant)
SWE-bench ProScale AI1,865 tasks from proprietary startup codebases
τ / τ² / τ³-benchSierra ResearchSimulated customer-service tool-agent-user dialogues
BFCL v4Berkeley (Gorilla)AST-matched function calling + agentic web-search/memory/format
LiveCodeBenchUC BerkeleyContamination-free competitive programming (date-windowed)
GAIA / GAIA2Meta + HFReal-world assistant Q's needing browsing + tools + multimodal
OSWorld-VerifiedXLang Lab369 GUI tasks across desktop apps/OSes
Aider PolyglotAider225 Exercism problems × 6 languages; tests edit-format fidelity
MLE-BenchOpenAI75 Kaggle competitions end-to-end
HALPrinceton PLIMeta-harness — re-runs many benchmarks with cost/reliability/safety axes
AA Intelligence Index v4Artificial AnalysisWeighted aggregate including τ²-Telecom & TB-Hard

What TBLite-alone does not measure

Easy wins (run against the same vLLM endpoints)

  1. BFCL v4. pip install, points at OpenAI-compatible base URL. ~1 hour setup, <30 min per model. Catches tool-call brittleness directly. repo
  2. τ²-bench retail + airline. Python runner, config-driven, OpenAI-compatible. Adds multi-turn user-sim signal. ~2–3 h setup. repo
  3. Aider Polyglot. aider --benchmark. 225 Exercism problems × 6 languages. ~1 h setup. repo
  4. Full Terminal-Bench 2.0. Same Harbor harness we already run; switch the dataset arg. Effectively free. Gets us an apples-to-apples vs the public leaderboard.
  5. SWE-bench Verified via mini-SWE-agent. Close to our Hermes loop; the most-cited OSS number. ~half-day setup, long evaluation.
  6. HAL wrap. Wrap hermes-agent once as a HAL agent and get reliability, cost, and 11 benchmarks (SWE-bench, Cybench, CORE-bench, USACO, AgentHarm, GAIA…) for marginal effort. repo
  7. Log trajectories + post-hoc classify failures. Not a new benchmark — just classify raw TBLite tool-call streams into (malformed JSON, wrong args, grader mismatch, timeout, hallucinated tool). Single highest-leverage change for understanding why models fail.

Skipped for now (cost not yet justified): OSWorld (needs VM fleet), MLE-Bench (slow Kaggle datasets), GAIA2 (needs browser-use stack), WebArena (self-hosted site fleet).

Field notes — how people actually run these boxes

Dell Pro Max T2 · RTX PRO 6000 Blackwell

Two DGX Sparks paired

Mac Studio M3 Ultra 512 GB · MLX

Cross-cutting

Queue-aware, machine-aware orchestrator

Three machines × ten models × five benchmarks is too many ps -ef | grep loops. We want to enqueue everything once and have it run repeatably in parallel across hardware, batching benches per model-load where that saves warmup cost.

Design at a glance

Stack choice: roll our own

Evaluated: Airflow / Prefect / Dagster (DAG-centric, assume ephemeral workers, bad fit for machine affinity + exclusive-resource semantics); Hatchet (nice but needs Postgres + its own server); rq / procrastinate (no machine-affinity primitives); Ray (actors + resource labels fit, but GCS is finicky across heterogeneous aarch64/x86/ARM64-Mac, and we lose "just SSH and docker run" simplicity). Verdict: ~600–800 LoC of Python 3.11 + stdlib + httpx + pyyaml + sqlite3. The problem — 3 machines × 10 models × 5 benches — is small enough that a one-file scheduler is cheaper to operate.

Migration path

  1. M1 — SQLite + CLI over existing shell. orch enqueue <model> tblite writes a job row; trivial loop picks queued jobs one at a time and subprocess-execs today's pipeline_model.sh. No workers, no machines table. Kills the ps -ef | grep waiter scripts immediately. ~200 LoC.
  2. M2 — machines table + placement predicates, still single-machine. Verify by enqueueing an MLX-only model and confirming it stays queued.
  3. M3 — first real worker on the Dell Blackwell. Split launch_vllm.sh into the worker's ensure_model_loaded. Two machines in parallel. live_progress.sh retires.
  4. M4 — bench groups. Add group_id, orch enqueue-group, affinity ordering. Prove multi-bench-per-load with BFCL or τ² as the second bench row.
  5. M5 — MLX worker on Mac, second Spark. spark_tp2 kind claims both Sparks atomically.

Explicit non-goals

Full design and source trail: see vLLM, eugr's 2× Spark kit, cubist38/mlx-openai-server, HAL, and the benchmark repos linked above. This page will be updated in place as the gateway comes online and each machine joins the rig.