Reproduce this

Everything needed to re-run the shootout on your own DGX Spark (or any NVFP4-capable Blackwell box). The workflow now flows through a small SQLite-backed orchestrator (bench/orch), exposed as mise tasks.

Prereqs

A DGX Spark (or GB10) reachable via SSH as dgx-spark with Docker + the avarok/vllm-dgx-spark:v14 image pulled.
A client box with hermes-agent installed. bench/run_tblite.sh shells out to ~/.hermes/hermes-agent/environments/benchmarks/tblite/tblite_env.py.
mise on the client box (for running mise run orch:*). python3.11+ and pyyaml for the orchestrator itself.
HuggingFace cache on the Spark populated with the model weights you want to test.

First-time setup

git clone … kimi-2.6-local && cd kimi-2.6-local
mise trust .              # trust the mise.toml tasks file
mise tasks                # confirm orch:* tasks are visible

# one-time install of atroposlib (ships TBLite) into the hermes-agent venv
cd ~/.hermes/hermes-agent && uv pip install --python venv/bin/python3 \
  'atroposlib @ git+https://github.com/NousResearch/atropos.git'

# cache at least one model on the Spark (example: Qwen3-Coder-Next NVFP4)
ssh dgx-spark 'docker run -d --rm --name dl \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HF_HUB_ENABLE_HF_TRANSFER=1 python:3.12-slim \
  sh -c "pip install -q huggingface_hub[hf_transfer] hf_transfer \
         && hf download GadflyII/Qwen3-Coder-Next-NVFP4"'

Run a benchmark (the new flow)

# 1. make sure the model is listed in bench/models.yaml
$EDITOR bench/models.yaml

# 2. enqueue it — priority higher = runs earlier
mise run orch:enqueue -- qwen3-coder-next --priority 80
mise run orch:enqueue -- qwen3.5-35b-a3b  --priority 70

# 3. see what's queued
mise run orch:status

# 4. start the scheduler in the background
mise run orch:runbg
mise run orch:tail                  # follow the scheduler log

# 5. while it runs, watch individual models land live
open http://localhost:4321/tblite/live    # (the dev site — `pnpm run dev` in site/)

# 6. stop the scheduler when you're done
mise run orch:stop

Re-run just one model that failed: mise run orch:requeue -- 5. Cancel a queued job: mise run orch:cancel -- 6. Inspect one job row: mise run orch:show -- 3.

What each piece does under the hood

./bench/orch-cli is a thin shell wrapper that runs python3 -m bench.orch.cli from the repo root. mise run orch:<name> just forwards to it with consistent CWD. The scheduler itself is a single foreground process: it picks the highest-priority queued job, shells out to the existing bench/pipeline_model.sh (which does launch_vllm → smoke probes → TBLite → summarize), captures full stdout+stderr to bench/orch/logs/job-NNNNN-<model>.log, then marks the job complete or failed in SQLite.

See the roadmap for why we rolled our own (~490 LOC) instead of Airflow / Prefect / Dagster / Hatchet / Ray, and what M2–M5 add (per-machine workers, bench groups, heartbeat-based crash detection, 2× Spark / Mac MLX support).

Skipping the orchestrator (direct-shell path, still works)

If you want to run one model end-to-end without the queue — same as before:

# warmup + smoke probes + tblite pilot, all in one shell
./bench/pipeline_model.sh qwen3-coder-next 20

# or just one phase
./bench/launch_vllm.sh qwen3-coder-next
./bench/run_model.sh   qwen3-coder-next   # → bench/results/<key>.json
./bench/run_tblite.sh  qwen3-coder-next 20

# rebuild the site
cd site && pnpm install && pnpm run build

Layout

kimi-2.6-local/
├── mise.toml                   # orch:* tasks
├── bench/
│   ├── models.yaml             # candidate catalog
│   ├── probes.yaml             # smoke-test prompts + rubrics
│   ├── launch_vllm.sh          # spin up one model on the Spark
│   ├── run_model.sh            # run smoke probes
│   ├── run_tblite.sh           # run the Nous TBLite harness
│   ├── pipeline_model.sh       # launch + probes + tblite, end-to-end
│   ├── summarize_tblite.py     # parse samples.jsonl → outcomes
│   ├── orch-cli                # shell wrapper for the python CLI
│   ├── orch/
│   │   ├── schema.sql          # jobs table
│   │   ├── db.py               # sqlite CRUD + atomic claim_next
│   │   ├── scheduler.py        # foreground loop
│   │   ├── cli.py              # enqueue / status / list / cancel / requeue / run
│   │   ├── state.db            # (auto-created)
│   │   └── logs/               # per-job stdout+stderr (auto-created)
│   ├── results/*.json          # smoke-test results, one per model
│   └── tblite/<key>/           # TBLite metrics + per-task samples
└── site/                       # this Astro site
    ├── src/lib/results.ts      # imports smoke-test JSON
    ├── src/lib/tblite.ts       # imports TBLite metrics.json
    └── src/lib/grading.ts      # hand-graded smoke-test scores

Adding a model

Append it to bench/models.yaml with status: cached, cache the weights on the Spark, pick the right tool_parser (vLLM ships hermes, pythonic, qwen3_coder, glm45 or glm47, minimax_m2, mistral, llama3_json, deepseek_v3_1, xlam, seed_oss, longcat, …), then mise run orch:enqueue -- KEY. Grade the smoke-probe output and add it to site/src/lib/grading.ts. The site picks everything up on rebuild via Astro's glob-import of bench/results/*.json.

Invariants

Only one mise run orch:runbg at a time — the scheduler is single-process in M1 and two would race on the Spark's bench-vllm container.
Downloads are independent of the scheduler — they run in <model>-dl Docker containers on the Spark directly, not as orch jobs. Confirm with ssh dgx-spark 'docker ps --format "{{.Names}} {{.Status}}"'.
If the scheduler dies mid-run, the job stays running in SQLite. Confirm no live process with ps -ef | grep pipeline_model.sh, then mise run orch:requeue -- <id> to reset it.