← all results · Qwen/Qwen3.6-27B-FP8 · FP8 · 27B/27B · parser qwen3_coder · smoke probes · live

qwen3.6-27b

Qwen team's flagship dense 27 B (April 2026) — Gated DeltaNet hybrid with Gated Attention every 4 blocks, native 262 K ctx (1 M via YaRN). Marketed as approaching Claude Opus 4.5 on agentic coding while running on a single consumer-class node. Official Qwen FP8 build with block-128 quant; same `qwen3_coder` tool-call parser our other Qwen3.x runs use.

11/20 tasks passed · 55% pass rate

11 pass 8 grader fail 0 timeout 1 no tool calls

vLLM metrics · 8414 samples over 1328m

gen tokens/s

3.7

prompt tokens/s

mean TTFT

3.24s

mean TPOT

—

peak concurrency

peak KV cache

—

total gen tokens

292,352

total requests

368

Task	Category	Status	Turns	Time	Tool calls	What the model ended with
api-endpoint-permission-canonicalizer	software-engineering	✓ pass	23	—	23	{"output": "---OUTPUT---\n/api/users GET admin ALLOW\n---END---", "exit_code": 0, "error": null}
basic-message-queue	backend	✓ pass	27	—	26	The solution is complete and all tests pass. Here's a summary of what was built: ## Solution Summary File: `/workdir/solution.py` - A FastAPI-based message queue service running on port 8000. ### API Endpoints \| Method \| Endpoint \| D…
battery-charging-optimization	general	✓ pass	11	—	10	The solution is complete and verified. Here's a summary: Solution: - 364 moves with 132,606 total units of energy transferred - Final state: Battery A = 159 units, Battery B = 357 units Key observations: - GCD(753, 357) = …
book-portfolio-analysis	data_engineering	✓ pass	9	—	10	Everything is complete. Here's a summary of what was accomplished: ## PHASE 1 - Database Setup ✅ - Parsed both JSON files (handling intentional syntax errors like missing commas, trailing commas, mixed quote styles) - Created SQLite databa…
bracket-sequence-restoration	general	✓ pass	39	—	38	The solution is complete and verified. Here's a summary of the algorithm: Algorithm: 1. Count fixed brackets: Determine `need_open` and `need_close` — how many `?` must become `(` and `)`. 2. Trivial uniqueness: If all `?` mu…
breast-cancer-mlflow	machine-learning	✓ pass	53	—	56	The complete ML classification pipeline is built and verified. Here's a summary of everything that was accomplished: ## Summary ### Step 1: Data Loading & Splitting - Loaded training data (227 samples, 31 columns) and test data (342 sampl…
broken-python	software-engineering	✓ pass	10	—	9	Everything is working now. Here's a summary: ## Problem The `pip` module files were missing from `/usr/local/lib/python3.13/site-packages/` — only the metadata directory (`pip-25.3.dist-info`) remained. This caused the `ModuleNotFoundError…
build-merkle-tree-cli-sha512	software-engineering	✓ pass	21	—	20	All functionality is working correctly. Here's a summary of what was built: ## Merkle CLI Tool (`/app/merkle_cli.py`) The tool implements a complete Merkle tree system with SHA-512 hashing: ### Commands - `scan <directory>` - Scans a…
build-system-task-ordering	build-and-dependency-management	✓ pass	4	—	4	{"output": "Traceback (most recent call last):\n File \"<string>\", line 14, in <module>\n assert result1 == '{\"order\":[\"all\",\"compile\",\"clean\"],\"cycles\":[]}', f'FAIL: {result1}'\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^…
chained-forensic-extraction_20260101_011957	security	✓ pass	17	—	17	All 5 stages of the forensic extraction are complete. Here's a summary: Stage 1 - Polyglot ZIP Detection: - First (decoy) ZIP at offset 80 - Second (real) ZIP at offset 336 Stage 2 - PNG Chunk Verification: - tEXt chunk fo…
competitive-programming-solver	CP	✓ pass	10	—	17	The solution is complete. The key insight is that the operations `x → x+k` and `x → \|x-k\|` preserve the value `min(x % k, k - x % k)`, which I call the "group" of an element. Two elements can transform into each other if and only if they be…
acl-permissions-inheritance	system-administration	✗ fail	3	—	4	{"output": "", "exit_code": 0, "error": null} grader output grader verdict: FAIL (exit 0) _access - s... FAILED ../tests/test_outputs.py::test_bob_can_create_and_alice_can_access - s... FAILED ../tests/test_outputs.py::test_subdirectory_inheritance - subprocess.C... FAILED ../tests/test_outputs.py::test_outside_users_denied_access - Assertion... FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - subproce... FAILED ../tests/test_outputs.py::test_execute_permissions_on_scripts - subpro... ============================== 9 failed in 0.47s ===============================
amuse-install	scientific-computing	✗ fail	7	—	12	{"output": "Makefile\n__init__.py\naarsethzare\nadaptb\nasterisk\nathena\nbhtree\nbonsai\nbonsai2\nbrutus\nbse\ncachedse\ncapreole\ndistributed\netics\nevtwin\nevtwin2sse\nfastkick\nfi\nflash\nfractalcluster\ngadget2\ngalactics\ngalaxia\nha… grader output grader verdict: FAIL (exit 0) 4c45f38c825f9549d2bc941de9fdc85 Expected file hash: eb181d4c3ea88b13e6486c731f2bdd7c84c45f38c825f9549d2bc941de9fdc85 =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_test_script_unchanged FAILED ../tests/test_outputs.py::test_virtual_environment_created - Assertion... FAILED ../tests/test_outputs.py::test_test_script_executes - AssertionError: ... ========================= 2 failed, 1 passed in 0.03s ==========================
anomaly-detection-ranking	general	✗ fail	6	—	5	{"output": "total 80\ndrwxr-xr-x 2 root root 4096 Mar 4 09:08 .\ndrwxr-xr-x 1 root root 4096 Mar 4 09:08 ..\n-rw-rw-r-- 1 root root 216 Mar 4 09:05 both_test.csv\n-rw-rw-r-- 1 root root 250 Mar 4 09:05 both_train.csv\n-rw-rw-r-- 1… grader output grader verdict: FAIL (exit 0) ============ ____________________________ test_grader_validation ____________________________ ----------------------------- Captured stdout call ----------------------------- Feedback: results.json not found Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 0.35s ===============================
application-debug	debugging	✗ fail	8	—	16	Both output files have been generated successfully. Here's a summary of the analysis: ## Analysis Results Files Analyzed: 5 log files in `/app/logs/` ### Key Findings \| Metric \| Value \| \|--------\|-------\| \| Total Log Entries \| 602 \|… grader output grader verdict: FAIL (exit 0) SED ../tests/test_outputs.py::test_critical_failures_calculation PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers PASSED ../tests/test_outputs.py::test_percentage_calculations PASSED ../tests/test_outputs.py::test_anti_cheat_validation FAILED ../tests/test_outputs.py::test_stack_traces_present - AssertionError: ... ========================= 1 failed, 12 passed in 0.14s =========================
auth_token_race_condition	backend-engineering	✗ fail	14	—	14	The fix is complete and verified. Here's a summary of the changes made to `/workspace/auth_gateway/token_service.py` in the `refresh_tokens` method: Problem: Multiple concurrent requests with the same refresh token could slip through v… grader output grader verdict: FAIL (exit 1) Starting test suite for concurrent token refresh... Killing any existing uvicorn processes... Waiting for port 8000 to be available... Waiting for Redis to be ready... /tests/test.sh: line 27: /workspace/start.sh: No such file or directory Redis failed to start within 60 seconds
bandit-delayed-feedback	machine-learning	✗ fail	33	—	39	The solution is complete. All three output files are generated and verified: \| File \| Status \| \|------\|--------\| \| `/app/output/delay_model.json` \| ✅ Log-normal model, μ/σ/median/rate errors all ≈ 0% \| \| `/app/output/arm_statistics.csv` \| … grader output grader verdict: FAIL (exit 0) D ../tests/test_outputs.py::TestDelayModel::test_sigma_parameter_accuracy PASSED ../tests/test_outputs.py::TestDelayModel::test_median_delay_error_threshold PASSED ../tests/test_outputs.py::TestDelayModel::test_conversion_rate_error_threshold PASSED ../tests/test_outputs.py::TestArmStatistics::test_conversion_rate_rmse_threshold FAILED ../tests/test_outputs.py::TestRewardCalibration::test_calibration_error_threshold ========================= 1 failed, 19 passed in 1.48s =========================
bash-log-processor-fix	bash_scripting	✗ fail	21	—	23	All requirements are met. Here's a summary of the fixes made to `/workdir/data/process_logs.sh`: Bugs fixed: 1. No subdirectory recursion — Changed from `"$INPUT_DIR"/.log` glob to `find "$INPUT_DIR" -name '.log' -type f -print0`…
bloom-filter-cache-penetration-prevention	software-engineering	✗ fail	22	—	24	The implementation is complete. Here's a summary of what was built in `/app/main.py`: ### Bloom Filter Cache Penetration Prevention System Core Components: 1. Bloom Filter (`BloomFilter` class) - Custom implementation with: - … grader output grader verdict: FAIL (exit 0) AssertionError: Fa... FAILED ../tests/test_outputs.py::test_large_dataset_bloom_filter_performance FAILED ../tests/test_outputs.py::test_initial_data_loading - AssertionError: ... FAILED ../tests/test_outputs.py::test_all_source_types - assert False is True FAILED ../tests/test_outputs.py::test_bloom_filter_memory_efficiency - Assert... FAILED ../tests/test_outputs.py::test_redis_client_reuse - AssertionError: Fi... ================== 8 failed, 19 passed, 232 warnings in 7.48s ==================
california-housing-api	machine-learning	✗ fail	24	—	24	{"output": "", "exit_code": -1, "error": "This foreground command appears to start a long-lived server/watch process. Run it with background=true, verify readiness (health endpoint/log signal), then execute tests in a separate command.", "s… grader output grader verdict: FAIL (exit 0) t - Failed: Fa... FAILED ../tests/test_outputs.py::test_api_predict_batch_matches_single_predict FAILED ../tests/test_outputs.py::test_api_predict_batch_order_preserved - Fai... FAILED ../tests/test_outputs.py::test_api_prediction_matches_model - Failed: ... FAILED ../tests/test_outputs.py::test_validation_vs_test_metrics - FileNotFou... FAILED ../tests/test_outputs.py::test_predictions_reproducible_from_saved_model ================== 14 failed, 16 passed, 2 warnings in 6.62s ===================