War stories

Things that weren't in any model card, docker hub README, or quick-start guide. Write-up so we (and anyone else benchmarking open agent-driver models on a single DGX Spark) don't hit them cold.

1. The Qwen3.6-27B image gauntlet

Qwen3.6-27B-FP8 shipped on 2026-04-22. Our benchmark rig was two days behind, and we learned exactly how fragmented the tooling is when a new Blackwell-native model drops.

The model needs two things at once that no prebuilt vLLM image has together:

  1. Transformers ≥ 5.x — Qwen3.6 re-uses the internal qwen3_5 model_type for its Gated DeltaNet hybrid arch. Transformers 4.57 doesn't know it and rejects the checkpoint at AutoConfig.from_pretrained.
  2. FP8 Cutlass W8A8 kernels compiled for sm_121 — DGX Spark is GB10 / SM 12.1 (not SM 12.0). vLLM's W8A8 kernel guards with enable_sm120_only in csrc/cutlass_extensions/common.hpp, which rejects CUDA_ARCH == 1210 and trips a trap instruction during the KV-cache memory probe.

Four images, four different ways of not working:

Image Transformers SM121 FP8 kernels Failure
avarok/vllm-dgx-spark:v14 4.57.5 yes ValueError: model_type qwen3_5 not recognized
avarok/vllm-dgx-spark:latest 4.57.6 yes same arch-not-recognized failure
nvcr.io/nvidia/vllm:26.03.post1-py3 4.57.5 yes same
vllm/vllm-openai:nightly-aarch64 5.5.4 no (sm_120 only) cutlass_gemm_caller.cuh:61 Error Internal
hellohal2064/vllm-dgx-spark-gb10:latest 4.57.3 yes same as avarok

There's no runtime env-var to force a Triton fallback for the FP8 path — kernel selection happens at the C++ layer. The working combination existed as a source recipe in eugr/spark-vllm-docker, which bundles a --tf5 build flag that installs Transformers 5.x and compiles kernels with TORCH_CUDA_ARCH_LIST=12.1a. ~30 minutes on the Spark, and out comes a working image (Transformers 5.6.2, vLLM 0.19.2rc1). Forum poster Turrican confirmed the same recipe gets 19–21 tok/s on Qwen3.6 at single-request decode.

Lesson: new Blackwell-native models will keep outrunning prebuilt images because the matrix of {kernel patches, Transformers version, CUDA toolchain} rarely converges in any one upstream. Budget 30 min for a custom build; don't assume avarok or nightly-aarch64 will "just work" for anything that landed in the last 90 days.

2. Your throughput metric is probably lying

Our first cut at a per-run vllm_run_summary.json computed mean_generation_tps = total_generation_tokens / duration_s over the whole eval run. That produced numbers like:

qwen3-coder-next        gen_mean = 2.8 tok/s
qwen3-coder-next-fp8    gen_mean = 4.0 tok/s
devstral-small-2        gen_mean = 3.3 tok/s

We briefly concluded Qwen3-Coder-Next-NVFP4 had a perf bug (1.4× slower than its FP8 sibling!) — until we looked at the raw Prometheus counter stream. Diffing generation_tokens_total across adjacent samples and filtering to windows where running>0 gave the real in-flight throughput:

qwen3-coder-next        gen_mean = 33.8 tok/s     # not 2.8
qwen3-coder-next-fp8    gen_mean = 33.6 tok/s     # not 4.0
devstral-small-2        gen_mean = 36.7 tok/s     # not 3.3
qwen3.6-27b-fp8         gen_mean = 24.0 tok/s     # not 22.9

The old divisor included idle time between tasks, server warmup, tear-down of one task sandbox before the next request arrived, every second the TBLite harness spent in its own Python (sandbox setup, grader invocation, cleanup). For a 20-task run that was ~40% of the wall clock. The "NVFP4 is broken" bug never existed; the quants actually perform within 1% of each other.

Lesson: if you're computing throughput from cumulative counters, always divide by in-flight time, not wall time — and be skeptical of any tok/s number that sits below the model's memory-bandwidth ceiling. At FP8, 27 B params × 1 byte × 1 pass ≈ ~10 tok/s floor at batch size 1; if your number is under that, it's probably the metric that's broken, not the kernels.

3. Timeouts are not grader_fails

Half our models post rankings that bake in wall-clock caps rather than actual capability. Before bumping budgets, our first batch had models like nemotron-3-super-120b at 2/20 and minimax-m2.7-saricles at 0/20 — with more than half of each run's tasks listed as timeout, not grader_fail. Those aren't comparable numbers. A grader_fail says "the model produced a wrong answer." A timeout says "the model never produced any answer, because wall-clock ran out."

Our rule of thumb, applied to every summary before we trust a rank:

For the capped group we bumped task_timeout_s per model: 1200 → 2400 → 3600 → 5400. Nemotron-3-Super still timed out 10/20 at 5400 s per task (90 min each). That's not a benchmark problem — it's a fact about running a 120 B MoE on a single Spark. The model is genuinely top-tier; the hardware just can't finish many agentic tool-call loops in under 90 min. We label it "capacity-bound" rather than re-ranking it.

4. The 64K ctx floor that invalidates a run

hermes-agent refuses to initialize against a server whose advertised context is below 64 K. TBLite has no such check. For MiniMax-M2.7-REAP-saricles, we had to cut max_model_len to 32 K to leave KV headroom (172 B MoE, 99 GB weights, concurrency 8). The smoke probes all came back with "Failed to initialize agent: ctx window 32,768 below 64,000", but TBLite ran anyway — and capped at 0/20, almost entirely from timeouts on a server hermes-agent couldn't actually drive.

The score isn't meaningful. We keep the run on the page for completeness but label the model slow-on-single-spark and point at the MJPansa W4A16 variant (7 GB smaller weights → enough KV for the 64 K floor).

5. Free perf flags we weren't using

Three community-known vLLM knobs that meaningfully move the needle on agentic loops, and that weren't in any image default we tried:

Flag Why it matters for agent-driving Reported win
--enable-prefix-caching TBLite tasks re-use the same system prompt + tool list across 20–40 turns. Without the prefix cache, every turn reprocesses that ~2 K-token prefix from scratch. On by default in some newer images; off in v0.16-era ones. 1.3–2× on multi-turn eval wall clock; universal
--attention-backend FLASHINFER Default attention backend on SM121-patched images is FLASH_ATTN; the FlashInfer path compiled with FLASHINFER_CUDA_ARCH_LIST=12.1a is faster for long prefill on the Gated DeltaNet / Qwen3-Next arches. ~25 % on Qwen3-Coder-Next (forum HOW-TO: 42.7 t/s bs=1 vs our 33.8)
--speculative-config qwen3_next_mtp Qwen3.6 and Qwen3-Next ship a Multi-Token Prediction head. With num_speculative_tokens=3, decoding drafts 3 tokens per forward pass and accepts the ones the base model would have picked. ~1.9× at bs=1 (7.8 → 15.2 tok/s per forum); smaller at concurrency 8

We plumbed all three through as per-model YAML keys in bench/models.yaml; prefix caching is default-on, the other two are opt-in per model. This writeup captures the baseline (before-flags) numbers so the A/B delta shows up cleanly in the next rerun cycle.

6. Things we'd still like to try

Updates on this page correspond to issues we hit while running the shootout. If something here turned out to be wrong or has been fixed upstream — open an issue on the repo.