War stories
Things that weren't in any model card, docker hub README, or quick-start guide. Write-up so we (and anyone else benchmarking open agent-driver models on a single DGX Spark) don't hit them cold.
1. The Qwen3.6-27B image gauntlet
Qwen3.6-27B-FP8 shipped on 2026-04-22. Our benchmark rig was two days behind, and we learned exactly how fragmented the tooling is when a new Blackwell-native model drops.
The model needs two things at once that no prebuilt vLLM image has together:
- Transformers ≥ 5.x — Qwen3.6 re-uses the internal
qwen3_5model_typefor its Gated DeltaNet hybrid arch. Transformers 4.57 doesn't know it and rejects the checkpoint atAutoConfig.from_pretrained. - FP8 Cutlass W8A8 kernels compiled for
sm_121— DGX Spark is GB10 / SM 12.1 (not SM 12.0). vLLM's W8A8 kernel guards withenable_sm120_onlyincsrc/cutlass_extensions/common.hpp, which rejects CUDA_ARCH == 1210 and trips a trap instruction during the KV-cache memory probe.
Four images, four different ways of not working:
| Image | Transformers | SM121 FP8 kernels | Failure |
|---|---|---|---|
| avarok/vllm-dgx-spark:v14 | 4.57.5 | yes | ValueError: model_type qwen3_5 not recognized |
| avarok/vllm-dgx-spark:latest | 4.57.6 | yes | same arch-not-recognized failure |
| nvcr.io/nvidia/vllm:26.03.post1-py3 | 4.57.5 | yes | same |
| vllm/vllm-openai:nightly-aarch64 | 5.5.4 | no (sm_120 only) | cutlass_gemm_caller.cuh:61 Error Internal |
| hellohal2064/vllm-dgx-spark-gb10:latest | 4.57.3 | yes | same as avarok |
There's no runtime env-var to force a Triton fallback for the FP8 path —
kernel selection happens at the C++ layer. The working combination existed
as a source recipe in
eugr/spark-vllm-docker,
which bundles a --tf5 build flag that installs Transformers 5.x
and compiles kernels with TORCH_CUDA_ARCH_LIST=12.1a. ~30 minutes
on the Spark, and out comes a working image (Transformers 5.6.2, vLLM 0.19.2rc1).
Forum poster Turrican confirmed the same recipe gets 19–21 tok/s on Qwen3.6
at single-request decode.
Lesson: new Blackwell-native models will keep outrunning prebuilt images because the matrix of {kernel patches, Transformers version, CUDA toolchain} rarely converges in any one upstream. Budget 30 min for a custom build; don't assume avarok or nightly-aarch64 will "just work" for anything that landed in the last 90 days.
2. Your throughput metric is probably lying
Our first cut at a per-run vllm_run_summary.json computed
mean_generation_tps = total_generation_tokens / duration_s
over the whole eval run. That produced numbers like:
qwen3-coder-next gen_mean = 2.8 tok/s
qwen3-coder-next-fp8 gen_mean = 4.0 tok/s
devstral-small-2 gen_mean = 3.3 tok/s
We briefly concluded Qwen3-Coder-Next-NVFP4 had a perf bug (1.4× slower
than its FP8 sibling!) — until we looked at the raw Prometheus counter
stream. Diffing generation_tokens_total across adjacent samples
and filtering to windows where running>0 gave the real
in-flight throughput:
qwen3-coder-next gen_mean = 33.8 tok/s # not 2.8
qwen3-coder-next-fp8 gen_mean = 33.6 tok/s # not 4.0
devstral-small-2 gen_mean = 36.7 tok/s # not 3.3
qwen3.6-27b-fp8 gen_mean = 24.0 tok/s # not 22.9 The old divisor included idle time between tasks, server warmup, tear-down of one task sandbox before the next request arrived, every second the TBLite harness spent in its own Python (sandbox setup, grader invocation, cleanup). For a 20-task run that was ~40% of the wall clock. The "NVFP4 is broken" bug never existed; the quants actually perform within 1% of each other.
Lesson: if you're computing throughput from cumulative counters, always divide by in-flight time, not wall time — and be skeptical of any tok/s number that sits below the model's memory-bandwidth ceiling. At FP8, 27 B params × 1 byte × 1 pass ≈ ~10 tok/s floor at batch size 1; if your number is under that, it's probably the metric that's broken, not the kernels.
3. Timeouts are not grader_fails
Half our models post rankings that bake in wall-clock caps rather than actual capability. Before bumping budgets, our first batch had models like nemotron-3-super-120b at 2/20 and minimax-m2.7-saricles at 0/20 — with more than half of each run's tasks listed as timeout, not grader_fail. Those aren't comparable numbers. A grader_fail says "the model produced a wrong answer." A timeout says "the model never produced any answer, because wall-clock ran out."
Our rule of thumb, applied to every summary before we trust a rank:
- timeouts < grader_fails → legitimate. The model was given enough time; the score reflects capability. (e.g. kimi-linear 0/20, hermes-4.3-36b 3/20 — both decisive.)
- timeouts ≈ grader_fails → borderline. A budget bump might lift 1–2 tasks; rank order probably stable.
- timeouts > grader_fails → capped. The posted score is a floor on true ability, not a measurement. Rerun with a longer budget, or switch to a smaller/faster variant of the same family.
For the capped group we bumped task_timeout_s per model:
1200 → 2400 → 3600 → 5400. Nemotron-3-Super still timed out 10/20 at 5400 s
per task (90 min each). That's not a benchmark problem — it's a fact about
running a 120 B MoE on a single Spark. The model is genuinely top-tier; the
hardware just can't finish many agentic tool-call loops in under 90 min.
We label it "capacity-bound" rather than re-ranking it.
4. The 64K ctx floor that invalidates a run
hermes-agent refuses to initialize against a server whose advertised
context is below 64 K. TBLite has no such check. For MiniMax-M2.7-REAP-saricles,
we had to cut max_model_len to 32 K to leave KV headroom
(172 B MoE, 99 GB weights, concurrency 8). The smoke probes all came back
with "Failed to initialize agent: ctx window 32,768 below 64,000",
but TBLite ran anyway — and capped at 0/20, almost entirely from timeouts
on a server hermes-agent couldn't actually drive.
The score isn't meaningful. We keep the run on the page for completeness but label the model slow-on-single-spark and point at the MJPansa W4A16 variant (7 GB smaller weights → enough KV for the 64 K floor).
5. Free perf flags we weren't using
Three community-known vLLM knobs that meaningfully move the needle on agentic loops, and that weren't in any image default we tried:
| Flag | Why it matters for agent-driving | Reported win |
|---|---|---|
| --enable-prefix-caching | TBLite tasks re-use the same system prompt + tool list across 20–40 turns. Without the prefix cache, every turn reprocesses that ~2 K-token prefix from scratch. On by default in some newer images; off in v0.16-era ones. | 1.3–2× on multi-turn eval wall clock; universal |
| --attention-backend FLASHINFER |
Default attention backend on SM121-patched images is FLASH_ATTN;
the FlashInfer path compiled with FLASHINFER_CUDA_ARCH_LIST=12.1a
is faster for long prefill on the Gated DeltaNet / Qwen3-Next arches.
| ~25 % on Qwen3-Coder-Next (forum HOW-TO: 42.7 t/s bs=1 vs our 33.8) |
| --speculative-config qwen3_next_mtp |
Qwen3.6 and Qwen3-Next ship a Multi-Token Prediction head. With
num_speculative_tokens=3, decoding drafts 3 tokens per
forward pass and accepts the ones the base model would have picked.
| ~1.9× at bs=1 (7.8 → 15.2 tok/s per forum); smaller at concurrency 8 |
We plumbed all three through as per-model YAML keys in
bench/models.yaml; prefix caching is default-on, the other two
are opt-in per model. This writeup captures the baseline (before-flags)
numbers so the A/B delta shows up cleanly in the next rerun cycle.
6. Things we'd still like to try
- Expert parallelism across two Sparks for Nemotron-3-Super and MiniMax-M2.7. Might cut wall-clock in half without quality loss. Forum reports a working recipe; our rig isn't dual-node yet.
- Re-baseline every model after the perf-flag sweep. At least the qwen3.6, qwen3-next, and qwen3-coder-next entries should land meaningfully higher next round.
- Devstral FP8 — BF16 at 24 B is memory-bound; Mistral
ships an official FP8 build. (The
Firworks/*-nvfp4variant has known vLLM loader bugs per its own model card; avoid.) - Per-task timeout derived from expected turns. A fixed
2400 s per task makes "fast model does 40 turns" look identical to
"slow model that would do 12 turns given the chance." A dynamic budget
(e.g. 15 s ×
expected_turns) would compare models more fairly.
Updates on this page correspond to issues we hit while running the shootout. If something here turned out to be wrong or has been fixed upstream — open an issue on the repo.