← all results · Qwen/Qwen3.6-27B-FP8 · FP8 · 27B/27B · parser qwen3_coder · smoke probes · live

qwen3.6-27b

Qwen team's flagship dense 27 B (April 2026) — Gated DeltaNet hybrid with Gated Attention every 4 blocks, native 262 K ctx (1 M via YaRN). Marketed as approaching Claude Opus 4.5 on agentic coding while running on a single consumer-class node. Official Qwen FP8 build with block-128 quant; same `qwen3_coder` tool-call parser our other Qwen3.x runs use.

11/20 tasks passed · 55% pass rate

11 pass 8 grader fail 0 timeout 1 no tool calls
vLLM metrics · 8414 samples over 1328m
gen tokens/s
3.7
prompt tokens/s
67
mean TTFT
3.24s
mean TPOT
peak concurrency
8
peak KV cache
total gen tokens
292,352
total requests
368
Task Category Status Turns Time Tool calls What the model ended with
api-endpoint-permission-canonicalizer software-engineering ✓ pass 23 23 {"output": "---OUTPUT---\n/api/users GET admin ALLOW\n---END---", "exit_code": 0, "error": null}
basic-message-queue backend ✓ pass 27 26 The solution is complete and all tests pass. Here's a summary of what was built: ## Solution Summary **File:** `/workdir/solution.py` - A FastAPI-based message queue service running on port 8000. ### API Endpoints | Method | Endpoint | D…
battery-charging-optimization general ✓ pass 11 10 The solution is complete and verified. Here's a summary: **Solution:** - **364 moves** with **132,606 total units of energy transferred** - Final state: Battery A = 159 units, Battery B = 357 units **Key observations:** - GCD(753, 357) = …
book-portfolio-analysis data_engineering ✓ pass 9 10 Everything is complete. Here's a summary of what was accomplished: ## PHASE 1 - Database Setup ✅ - Parsed both JSON files (handling intentional syntax errors like missing commas, trailing commas, mixed quote styles) - Created SQLite databa…
bracket-sequence-restoration general ✓ pass 39 38 The solution is complete and verified. Here's a summary of the algorithm: **Algorithm:** 1. **Count fixed brackets**: Determine `need_open` and `need_close` — how many `?` must become `(` and `)`. 2. **Trivial uniqueness**: If all `?` mu…
breast-cancer-mlflow machine-learning ✓ pass 53 56 The complete ML classification pipeline is built and verified. Here's a summary of everything that was accomplished: ## Summary ### Step 1: Data Loading & Splitting - Loaded training data (227 samples, 31 columns) and test data (342 sampl…
broken-python software-engineering ✓ pass 10 9 Everything is working now. Here's a summary: ## Problem The `pip` module files were missing from `/usr/local/lib/python3.13/site-packages/` — only the metadata directory (`pip-25.3.dist-info`) remained. This caused the `ModuleNotFoundError…
build-merkle-tree-cli-sha512 software-engineering ✓ pass 21 20 All functionality is working correctly. Here's a summary of what was built: ## Merkle CLI Tool (`/app/merkle_cli.py`) The tool implements a complete Merkle tree system with SHA-512 hashing: ### Commands - **`scan <directory>`** - Scans a…
build-system-task-ordering build-and-dependency-management ✓ pass 4 4 {"output": "Traceback (most recent call last):\n File \"<string>\", line 14, in <module>\n assert result1 == '{\"order\":[\"all\",\"compile\",\"clean\"],\"cycles\":[]}', f'FAIL: {result1}'\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^…
chained-forensic-extraction_20260101_011957 security ✓ pass 17 17 All 5 stages of the forensic extraction are complete. Here's a summary: **Stage 1 - Polyglot ZIP Detection:** - First (decoy) ZIP at offset **80** - Second (real) ZIP at offset **336** **Stage 2 - PNG Chunk Verification:** - tEXt chunk fo…
competitive-programming-solver CP ✓ pass 10 17 The solution is complete. The key insight is that the operations `x → x+k` and `x → |x-k|` preserve the value `min(x % k, k - x % k)`, which I call the "group" of an element. Two elements can transform into each other if and only if they be…
acl-permissions-inheritance system-administration ✗ fail 3 4 {"output": "", "exit_code": 0, "error": null}
grader output
grader verdict: FAIL (exit 0)
_access - s...
FAILED ../tests/test_outputs.py::test_bob_can_create_and_alice_can_access - s...
FAILED ../tests/test_outputs.py::test_subdirectory_inheritance - subprocess.C...
FAILED ../tests/test_outputs.py::test_outside_users_denied_access - Assertion...
FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - subproce...
FAILED ../tests/test_outputs.py::test_execute_permissions_on_scripts - subpro...
============================== 9 failed in 0.47s ===============================
amuse-install scientific-computing ✗ fail 7 12 {"output": "Makefile\n__init__.py\naarsethzare\nadaptb\nasterisk\nathena\nbhtree\nbonsai\nbonsai2\nbrutus\nbse\ncachedse\ncapreole\ndistributed\netics\nevtwin\nevtwin2sse\nfastkick\nfi\nflash\nfractalcluster\ngadget2\ngalactics\ngalaxia\nha…
grader output
grader verdict: FAIL (exit 0)
4c45f38c825f9549d2bc941de9fdc85
Expected file hash: eb181d4c3ea88b13e6486c731f2bdd7c84c45f38c825f9549d2bc941de9fdc85
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_test_script_unchanged
FAILED ../tests/test_outputs.py::test_virtual_environment_created - Assertion...
FAILED ../tests/test_outputs.py::test_test_script_executes - AssertionError: ...
========================= 2 failed, 1 passed in 0.03s ==========================
anomaly-detection-ranking general ✗ fail 6 5 {"output": "total 80\ndrwxr-xr-x 2 root root 4096 Mar 4 09:08 .\ndrwxr-xr-x 1 root root 4096 Mar 4 09:08 ..\n-rw-rw-r-- 1 root root 216 Mar 4 09:05 both_test.csv\n-rw-rw-r-- 1 root root 250 Mar 4 09:05 both_train.csv\n-rw-rw-r-- 1…
grader output
grader verdict: FAIL (exit 0)
============
____________________________ test_grader_validation ____________________________
----------------------------- Captured stdout call -----------------------------
Feedback: results.json not found
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 0.35s ===============================
application-debug debugging ✗ fail 8 16 Both output files have been generated successfully. Here's a summary of the analysis: ## Analysis Results **Files Analyzed:** 5 log files in `/app/logs/` ### Key Findings | Metric | Value | |--------|-------| | Total Log Entries | 602 |…
grader output
grader verdict: FAIL (exit 0)
SED ../tests/test_outputs.py::test_critical_failures_calculation
PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions
PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers
PASSED ../tests/test_outputs.py::test_percentage_calculations
PASSED ../tests/test_outputs.py::test_anti_cheat_validation
FAILED ../tests/test_outputs.py::test_stack_traces_present - AssertionError: ...
========================= 1 failed, 12 passed in 0.14s =========================
auth_token_race_condition backend-engineering ✗ fail 14 14 The fix is complete and verified. Here's a summary of the changes made to `/workspace/auth_gateway/token_service.py` in the `refresh_tokens` method: **Problem:** Multiple concurrent requests with the same refresh token could slip through v…
grader output
grader verdict: FAIL (exit 1)
Starting test suite for concurrent token refresh...
Killing any existing uvicorn processes...
Waiting for port 8000 to be available...
Waiting for Redis to be ready...
/tests/test.sh: line 27: /workspace/start.sh: No such file or directory
Redis failed to start within 60 seconds
bandit-delayed-feedback machine-learning ✗ fail 33 39 The solution is complete. All three output files are generated and verified: | File | Status | |------|--------| | `/app/output/delay_model.json` | ✅ Log-normal model, μ/σ/median/rate errors all ≈ 0% | | `/app/output/arm_statistics.csv` | …
grader output
grader verdict: FAIL (exit 0)
D ../tests/test_outputs.py::TestDelayModel::test_sigma_parameter_accuracy
PASSED ../tests/test_outputs.py::TestDelayModel::test_median_delay_error_threshold
PASSED ../tests/test_outputs.py::TestDelayModel::test_conversion_rate_error_threshold
PASSED ../tests/test_outputs.py::TestArmStatistics::test_conversion_rate_rmse_threshold
FAILED ../tests/test_outputs.py::TestRewardCalibration::test_calibration_error_threshold
========================= 1 failed, 19 passed in 1.48s =========================
bash-log-processor-fix bash_scripting ✗ fail 21 23 All requirements are met. Here's a summary of the fixes made to `/workdir/data/process_logs.sh`: **Bugs fixed:** 1. **No subdirectory recursion** — Changed from `"$INPUT_DIR"/*.log` glob to `find "$INPUT_DIR" -name '*.log' -type f -print0`…
bloom-filter-cache-penetration-prevention software-engineering ✗ fail 22 24 The implementation is complete. Here's a summary of what was built in `/app/main.py`: ### Bloom Filter Cache Penetration Prevention System **Core Components:** 1. **Bloom Filter** (`BloomFilter` class) - Custom implementation with: - …
grader output
grader verdict: FAIL (exit 0)
AssertionError: Fa...
FAILED ../tests/test_outputs.py::test_large_dataset_bloom_filter_performance
FAILED ../tests/test_outputs.py::test_initial_data_loading - AssertionError: ...
FAILED ../tests/test_outputs.py::test_all_source_types - assert False is True
FAILED ../tests/test_outputs.py::test_bloom_filter_memory_efficiency - Assert...
FAILED ../tests/test_outputs.py::test_redis_client_reuse - AssertionError: Fi...
================== 8 failed, 19 passed, 232 warnings in 7.48s ==================
california-housing-api machine-learning ✗ fail 24 24 {"output": "", "exit_code": -1, "error": "This foreground command appears to start a long-lived server/watch process. Run it with background=true, verify readiness (health endpoint/log signal), then execute tests in a separate command.", "s…
grader output
grader verdict: FAIL (exit 0)
t - Failed: Fa...
FAILED ../tests/test_outputs.py::test_api_predict_batch_matches_single_predict
FAILED ../tests/test_outputs.py::test_api_predict_batch_order_preserved - Fai...
FAILED ../tests/test_outputs.py::test_api_prediction_matches_model - Failed: ...
FAILED ../tests/test_outputs.py::test_validation_vs_test_metrics - FileNotFou...
FAILED ../tests/test_outputs.py::test_predictions_reproducible_from_saved_model
================== 14 failed, 16 passed, 2 warnings in 6.62s ===================