Reproduce this
Everything needed to re-run the shootout on your own DGX Spark (or any NVFP4-capable
Blackwell box). The workflow now flows through a small SQLite-backed orchestrator
(bench/orch), exposed as mise tasks.
Prereqs
- A DGX Spark (or GB10) reachable via SSH as
dgx-sparkwith Docker + theavarok/vllm-dgx-spark:v14image pulled. - A client box with hermes-agent
installed.
bench/run_tblite.shshells out to~/.hermes/hermes-agent/environments/benchmarks/tblite/tblite_env.py. - mise on the client box (for running
mise run orch:*).python3.11+andpyyamlfor the orchestrator itself. - HuggingFace cache on the Spark populated with the model weights you want to test.
First-time setup
git clone … kimi-2.6-local && cd kimi-2.6-local
mise trust . # trust the mise.toml tasks file
mise tasks # confirm orch:* tasks are visible
# one-time install of atroposlib (ships TBLite) into the hermes-agent venv
cd ~/.hermes/hermes-agent && uv pip install --python venv/bin/python3 \
'atroposlib @ git+https://github.com/NousResearch/atropos.git'
# cache at least one model on the Spark (example: Qwen3-Coder-Next NVFP4)
ssh dgx-spark 'docker run -d --rm --name dl \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HF_HUB_ENABLE_HF_TRANSFER=1 python:3.12-slim \
sh -c "pip install -q huggingface_hub[hf_transfer] hf_transfer \
&& hf download GadflyII/Qwen3-Coder-Next-NVFP4"' Run a benchmark (the new flow)
# 1. make sure the model is listed in bench/models.yaml
$EDITOR bench/models.yaml
# 2. enqueue it — priority higher = runs earlier
mise run orch:enqueue -- qwen3-coder-next --priority 80
mise run orch:enqueue -- qwen3.5-35b-a3b --priority 70
# 3. see what's queued
mise run orch:status
# 4. start the scheduler in the background
mise run orch:runbg
mise run orch:tail # follow the scheduler log
# 5. while it runs, watch individual models land live
open http://localhost:4321/tblite/live # (the dev site — `pnpm run dev` in site/)
# 6. stop the scheduler when you're done
mise run orch:stop
Re-run just one model that failed: mise run orch:requeue -- 5. Cancel
a queued job: mise run orch:cancel -- 6. Inspect one job row:
mise run orch:show -- 3.
What each piece does under the hood
./bench/orch-cli is a thin shell wrapper that runs
python3 -m bench.orch.cli from the repo root. mise run orch:<name>
just forwards to it with consistent CWD. The scheduler itself is a single foreground
process: it picks the highest-priority queued job, shells out to the
existing bench/pipeline_model.sh (which does launch_vllm → smoke probes →
TBLite → summarize), captures full stdout+stderr to
bench/orch/logs/job-NNNNN-<model>.log, then marks the job
complete or failed in SQLite.
See the roadmap for why we rolled our own (~490 LOC) instead of Airflow / Prefect / Dagster / Hatchet / Ray, and what M2–M5 add (per-machine workers, bench groups, heartbeat-based crash detection, 2× Spark / Mac MLX support).
Skipping the orchestrator (direct-shell path, still works)
If you want to run one model end-to-end without the queue — same as before:
# warmup + smoke probes + tblite pilot, all in one shell
./bench/pipeline_model.sh qwen3-coder-next 20
# or just one phase
./bench/launch_vllm.sh qwen3-coder-next
./bench/run_model.sh qwen3-coder-next # → bench/results/<key>.json
./bench/run_tblite.sh qwen3-coder-next 20
# rebuild the site
cd site && pnpm install && pnpm run build Layout
kimi-2.6-local/
├── mise.toml # orch:* tasks
├── bench/
│ ├── models.yaml # candidate catalog
│ ├── probes.yaml # smoke-test prompts + rubrics
│ ├── launch_vllm.sh # spin up one model on the Spark
│ ├── run_model.sh # run smoke probes
│ ├── run_tblite.sh # run the Nous TBLite harness
│ ├── pipeline_model.sh # launch + probes + tblite, end-to-end
│ ├── summarize_tblite.py # parse samples.jsonl → outcomes
│ ├── orch-cli # shell wrapper for the python CLI
│ ├── orch/
│ │ ├── schema.sql # jobs table
│ │ ├── db.py # sqlite CRUD + atomic claim_next
│ │ ├── scheduler.py # foreground loop
│ │ ├── cli.py # enqueue / status / list / cancel / requeue / run
│ │ ├── state.db # (auto-created)
│ │ └── logs/ # per-job stdout+stderr (auto-created)
│ ├── results/*.json # smoke-test results, one per model
│ └── tblite/<key>/ # TBLite metrics + per-task samples
└── site/ # this Astro site
├── src/lib/results.ts # imports smoke-test JSON
├── src/lib/tblite.ts # imports TBLite metrics.json
└── src/lib/grading.ts # hand-graded smoke-test scores Adding a model
Append it to bench/models.yaml with status: cached, cache
the weights on the Spark, pick the right tool_parser (vLLM ships
hermes, pythonic, qwen3_coder,
glm45 or glm47, minimax_m2,
mistral, llama3_json, deepseek_v3_1,
xlam, seed_oss, longcat, …), then
mise run orch:enqueue -- KEY. Grade the smoke-probe output and
add it to site/src/lib/grading.ts. The site picks everything up on
rebuild via Astro's glob-import of bench/results/*.json.
Invariants
- Only one
mise run orch:runbgat a time — the scheduler is single-process in M1 and two would race on the Spark'sbench-vllmcontainer. - Downloads are independent of the scheduler — they run in
<model>-dlDocker containers on the Spark directly, not as orch jobs. Confirm withssh dgx-spark 'docker ps --format "{{.Names}} {{.Status}}"'. - If the scheduler dies mid-run, the job stays
runningin SQLite. Confirm no live process withps -ef | grep pipeline_model.sh, thenmise run orch:requeue -- <id>to reset it.