← all results · Qwen/Qwen3.6-27B-FP8 · FP8 · 27B/27B · parser qwen3_coder · smoke probes · live
qwen3.6-27b
Qwen team's flagship dense 27 B (April 2026) — Gated DeltaNet hybrid with Gated Attention every 4 blocks, native 262 K ctx (1 M via YaRN). Marketed as approaching Claude Opus 4.5 on agentic coding while running on a single consumer-class node. Official Qwen FP8 build with block-128 quant; same `qwen3_coder` tool-call parser our other Qwen3.x runs use.
11/20 tasks passed · 55% pass rate
11 pass 8 grader fail 0 timeout 1 no tool calls
vLLM metrics · 8414 samples over 1328m
gen tokens/s
3.7prompt tokens/s
67mean TTFT
3.24smean TPOT
—peak concurrency
8peak KV cache
—total gen tokens
292,352total requests
368| Task | Category | Status | Turns | Time | Tool calls | What the model ended with |
|---|---|---|---|---|---|---|
| api-endpoint-permission-canonicalizer | software-engineering | ✓ pass | 23 | — | 23 | {"output": "---OUTPUT---\n/api/users GET admin ALLOW\n---END---", "exit_code": 0, "error": null} |
| basic-message-queue | backend | ✓ pass | 27 | — | 26 | The solution is complete and all tests pass. Here's a summary of what was built: ## Solution Summary **File:** `/workdir/solution.py` - A FastAPI-based message queue service running on port 8000. ### API Endpoints | Method | Endpoint | D… |
| battery-charging-optimization | general | ✓ pass | 11 | — | 10 | The solution is complete and verified. Here's a summary: **Solution:** - **364 moves** with **132,606 total units of energy transferred** - Final state: Battery A = 159 units, Battery B = 357 units **Key observations:** - GCD(753, 357) = … |
| book-portfolio-analysis | data_engineering | ✓ pass | 9 | — | 10 | Everything is complete. Here's a summary of what was accomplished: ## PHASE 1 - Database Setup ✅ - Parsed both JSON files (handling intentional syntax errors like missing commas, trailing commas, mixed quote styles) - Created SQLite databa… |
| bracket-sequence-restoration | general | ✓ pass | 39 | — | 38 | The solution is complete and verified. Here's a summary of the algorithm: **Algorithm:** 1. **Count fixed brackets**: Determine `need_open` and `need_close` — how many `?` must become `(` and `)`. 2. **Trivial uniqueness**: If all `?` mu… |
| breast-cancer-mlflow | machine-learning | ✓ pass | 53 | — | 56 | The complete ML classification pipeline is built and verified. Here's a summary of everything that was accomplished: ## Summary ### Step 1: Data Loading & Splitting - Loaded training data (227 samples, 31 columns) and test data (342 sampl… |
| broken-python | software-engineering | ✓ pass | 10 | — | 9 | Everything is working now. Here's a summary: ## Problem The `pip` module files were missing from `/usr/local/lib/python3.13/site-packages/` — only the metadata directory (`pip-25.3.dist-info`) remained. This caused the `ModuleNotFoundError… |
| build-merkle-tree-cli-sha512 | software-engineering | ✓ pass | 21 | — | 20 | All functionality is working correctly. Here's a summary of what was built: ## Merkle CLI Tool (`/app/merkle_cli.py`) The tool implements a complete Merkle tree system with SHA-512 hashing: ### Commands - **`scan <directory>`** - Scans a… |
| build-system-task-ordering | build-and-dependency-management | ✓ pass | 4 | — | 4 | {"output": "Traceback (most recent call last):\n File \"<string>\", line 14, in <module>\n assert result1 == '{\"order\":[\"all\",\"compile\",\"clean\"],\"cycles\":[]}', f'FAIL: {result1}'\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^… |
| chained-forensic-extraction_20260101_011957 | security | ✓ pass | 17 | — | 17 | All 5 stages of the forensic extraction are complete. Here's a summary: **Stage 1 - Polyglot ZIP Detection:** - First (decoy) ZIP at offset **80** - Second (real) ZIP at offset **336** **Stage 2 - PNG Chunk Verification:** - tEXt chunk fo… |
| competitive-programming-solver | CP | ✓ pass | 10 | — | 17 | The solution is complete. The key insight is that the operations `x → x+k` and `x → |x-k|` preserve the value `min(x % k, k - x % k)`, which I call the "group" of an element. Two elements can transform into each other if and only if they be… |
| acl-permissions-inheritance | system-administration | ✗ fail | 3 | — | 4 | {"output": "", "exit_code": 0, "error": null} grader outputgrader verdict: FAIL (exit 0) _access - s... FAILED ../tests/test_outputs.py::test_bob_can_create_and_alice_can_access - s... FAILED ../tests/test_outputs.py::test_subdirectory_inheritance - subprocess.C... FAILED ../tests/test_outputs.py::test_outside_users_denied_access - Assertion... FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - subproce... FAILED ../tests/test_outputs.py::test_execute_permissions_on_scripts - subpro... ============================== 9 failed in 0.47s =============================== |
| amuse-install | scientific-computing | ✗ fail | 7 | — | 12 | {"output": "Makefile\n__init__.py\naarsethzare\nadaptb\nasterisk\nathena\nbhtree\nbonsai\nbonsai2\nbrutus\nbse\ncachedse\ncapreole\ndistributed\netics\nevtwin\nevtwin2sse\nfastkick\nfi\nflash\nfractalcluster\ngadget2\ngalactics\ngalaxia\nha… grader outputgrader verdict: FAIL (exit 0) 4c45f38c825f9549d2bc941de9fdc85 Expected file hash: eb181d4c3ea88b13e6486c731f2bdd7c84c45f38c825f9549d2bc941de9fdc85 =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_test_script_unchanged FAILED ../tests/test_outputs.py::test_virtual_environment_created - Assertion... FAILED ../tests/test_outputs.py::test_test_script_executes - AssertionError: ... ========================= 2 failed, 1 passed in 0.03s ========================== |
| anomaly-detection-ranking | general | ✗ fail | 6 | — | 5 | {"output": "total 80\ndrwxr-xr-x 2 root root 4096 Mar 4 09:08 .\ndrwxr-xr-x 1 root root 4096 Mar 4 09:08 ..\n-rw-rw-r-- 1 root root 216 Mar 4 09:05 both_test.csv\n-rw-rw-r-- 1 root root 250 Mar 4 09:05 both_train.csv\n-rw-rw-r-- 1… grader outputgrader verdict: FAIL (exit 0) ============ ____________________________ test_grader_validation ____________________________ ----------------------------- Captured stdout call ----------------------------- Feedback: results.json not found Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 0.35s =============================== |
| application-debug | debugging | ✗ fail | 8 | — | 16 | Both output files have been generated successfully. Here's a summary of the analysis:
## Analysis Results
**Files Analyzed:** 5 log files in `/app/logs/`
### Key Findings
| Metric | Value |
|--------|-------|
| Total Log Entries | 602 |… grader outputgrader verdict: FAIL (exit 0) SED ../tests/test_outputs.py::test_critical_failures_calculation PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers PASSED ../tests/test_outputs.py::test_percentage_calculations PASSED ../tests/test_outputs.py::test_anti_cheat_validation FAILED ../tests/test_outputs.py::test_stack_traces_present - AssertionError: ... ========================= 1 failed, 12 passed in 0.14s ========================= |
| auth_token_race_condition | backend-engineering | ✗ fail | 14 | — | 14 | The fix is complete and verified. Here's a summary of the changes made to `/workspace/auth_gateway/token_service.py` in the `refresh_tokens` method:
**Problem:** Multiple concurrent requests with the same refresh token could slip through v… grader outputgrader verdict: FAIL (exit 1) Starting test suite for concurrent token refresh... Killing any existing uvicorn processes... Waiting for port 8000 to be available... Waiting for Redis to be ready... /tests/test.sh: line 27: /workspace/start.sh: No such file or directory Redis failed to start within 60 seconds |
| bandit-delayed-feedback | machine-learning | ✗ fail | 33 | — | 39 | The solution is complete. All three output files are generated and verified:
| File | Status |
|------|--------|
| `/app/output/delay_model.json` | ✅ Log-normal model, μ/σ/median/rate errors all ≈ 0% |
| `/app/output/arm_statistics.csv` | … grader outputgrader verdict: FAIL (exit 0) D ../tests/test_outputs.py::TestDelayModel::test_sigma_parameter_accuracy PASSED ../tests/test_outputs.py::TestDelayModel::test_median_delay_error_threshold PASSED ../tests/test_outputs.py::TestDelayModel::test_conversion_rate_error_threshold PASSED ../tests/test_outputs.py::TestArmStatistics::test_conversion_rate_rmse_threshold FAILED ../tests/test_outputs.py::TestRewardCalibration::test_calibration_error_threshold ========================= 1 failed, 19 passed in 1.48s ========================= |
| bash-log-processor-fix | bash_scripting | ✗ fail | 21 | — | 23 | All requirements are met. Here's a summary of the fixes made to `/workdir/data/process_logs.sh`: **Bugs fixed:** 1. **No subdirectory recursion** — Changed from `"$INPUT_DIR"/*.log` glob to `find "$INPUT_DIR" -name '*.log' -type f -print0`… |
| bloom-filter-cache-penetration-prevention | software-engineering | ✗ fail | 22 | — | 24 | The implementation is complete. Here's a summary of what was built in `/app/main.py`:
### Bloom Filter Cache Penetration Prevention System
**Core Components:**
1. **Bloom Filter** (`BloomFilter` class) - Custom implementation with:
- … grader outputgrader verdict: FAIL (exit 0) AssertionError: Fa... FAILED ../tests/test_outputs.py::test_large_dataset_bloom_filter_performance FAILED ../tests/test_outputs.py::test_initial_data_loading - AssertionError: ... FAILED ../tests/test_outputs.py::test_all_source_types - assert False is True FAILED ../tests/test_outputs.py::test_bloom_filter_memory_efficiency - Assert... FAILED ../tests/test_outputs.py::test_redis_client_reuse - AssertionError: Fi... ================== 8 failed, 19 passed, 232 warnings in 7.48s ================== |
| california-housing-api | machine-learning | ✗ fail | 24 | — | 24 | {"output": "", "exit_code": -1, "error": "This foreground command appears to start a long-lived server/watch process. Run it with background=true, verify readiness (health endpoint/log signal), then execute tests in a separate command.", "s… grader outputgrader verdict: FAIL (exit 0) t - Failed: Fa... FAILED ../tests/test_outputs.py::test_api_predict_batch_matches_single_predict FAILED ../tests/test_outputs.py::test_api_predict_batch_order_preserved - Fai... FAILED ../tests/test_outputs.py::test_api_prediction_matches_model - Failed: ... FAILED ../tests/test_outputs.py::test_validation_vs_test_metrics - FileNotFou... FAILED ../tests/test_outputs.py::test_predictions_reproducible_from_saved_model ================== 14 failed, 16 passed, 2 warnings in 6.62s =================== |