← all results · GadflyII/Qwen3-Coder-Next-NVFP4 · NVFP4 · 3B/80B · parser qwen3_coder · smoke probes · live
qwen3-coder-next
Qwen3-Coder-Next: 80 B / 3 B-active coding-agent MoE. #1 community-cited open coding agent; "DGX Spark practitioner favorite" per the NVIDIA forum. Distinct from qwen3-next-80b (instruct); this one is coder-tuned with the `qwen3_coder` tool-call format.
4/20 tasks passed · 20% pass rate
4 pass 6 grader fail 9 timeout 1 no tool calls
vLLM metrics · 4808 samples over 767m
gen tokens/s
2.8prompt tokens/s
143mean TTFT
8.61smean TPOT
—peak concurrency
8peak KV cache
—total gen tokens
127,941total requests
345| Task | Category | Status | Turns | Time | Tool calls | What the model ended with |
|---|---|---|---|---|---|---|
| anomaly-detection-ranking | general | ✓ pass | 18 | — | 26 | The `results.json` file is complete and valid. Here's a summary of what was done: ## Summary I created a `results.json` file with predictions for all five sub-tasks: ### 1. Sensor Anomaly Detection - Used z-score based threshold on senso… |
| basic-message-queue | backend | ✓ pass | 60 | — | 60 | {"output": "", "exit_code": -1, "error": "This foreground command appears to start a long-lived server/watch process. Run it with background=true, verify readiness (health endpoint/log signal), then execute tests in a separate command.", "s… |
| bloom-filter-cache-penetration-prevention | software-engineering | ✓ pass | 60 | — | 60 | All processes are exiting. Let me try a different approach - run the server in a subprocess with Popen: |
| competitive-programming-solver | CP | ✓ pass | 8 | — | 16 | The solution is complete. The answers are: - testcase1: YES - testcase2: NO - testcase3: YES - testcase4: NO - testcase5: NO - testcase6: YES - testcase7: YES - testcase8: NO - testcase9: YES - testcase10: YES The solution has been writte… |
| acl-permissions-inheritance | system-administration | ✗ fail | 6 | — | 5 | The shared directory structure has been created with proper Linux ACLs. Here's a summary:
**Directory /srv/shared:**
- Owned by group `research`
- Has setgid bit set (2770 = rwxrws---)
- ACLs configured for:
- `research` group: rwx (curr… grader outputgrader verdict: FAIL (exit 0) tests/test_outputs.py::test_bob_can_create_and_alice_can_access PASSED ../tests/test_outputs.py::test_subdirectory_inheritance PASSED ../tests/test_outputs.py::test_outside_users_denied_access PASSED ../tests/test_outputs.py::test_execute_permissions_on_scripts FAILED ../tests/test_outputs.py::test_directory_exists_and_basic_permissions FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - Assertio... ========================= 2 failed, 7 passed in 0.49s ========================== |
| amuse-install | scientific-computing | ✗ fail | 47 | — | 46 | Great! The installation is complete and working. Let me create a summary of what was done:
## Summary
I successfully installed AMUSE and its dependencies. Here's what was done:
1. **Created a Python virtual environment** at `/app/amuse/.… grader outputgrader verdict: FAIL (exit 0)
nk files; falling back to full copy. This may lead to degraded performance.
If the cache and target directories are on different filesystems, hardlinking may not be supported.
If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
Installed 5 packages in 19ms
+ iniconfig==2.3.0
+ packaging==26.1
+ pluggy==1.6.0
+ pygments==2.20.0
+ pytest==8.4.1
error: Failed to spawn: `pytest`
Caused by: Permission denied (os error 13) |
| api-endpoint-permission-canonicalizer | software-engineering | ⏱ timeout | 13 | 1799s | 13 | The endpoints are empty! Let me check why: |
| application-debug | debugging | ✗ fail | 11 | — | 15 | The analysis is complete. Both output files have been generated successfully:
## Summary of Analysis
**Log Files Analyzed:** 4 files (app_bf26b2.log, app_e9u98n.log, app_ju3tyx.log, app_qad7wy.log)
**Key Findings:**
- **Total Exceptions:… grader outputgrader verdict: FAIL (exit 0) SED ../tests/test_outputs.py::test_critical_failures_calculation PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers PASSED ../tests/test_outputs.py::test_percentage_calculations PASSED ../tests/test_outputs.py::test_anti_cheat_validation FAILED ../tests/test_outputs.py::test_stack_traces_present - AssertionError: ... ========================= 1 failed, 12 passed in 0.11s ========================= |
| auth_token_race_condition | backend-engineering | ✗ fail | 9 | — | 8 | The fix has been successfully applied. Here's a summary of the changes:
**Changes made to `/workspace/auth_gateway/token_service.py`:**
1. **Added lock acquisition after user validation** (lines 41-44):
- After validating `user_id` and… grader outputgrader verdict: FAIL (exit 1) Starting test suite for concurrent token refresh... Killing any existing uvicorn processes... Waiting for port 8000 to be available... Waiting for Redis to be ready... /tests/test.sh: line 27: /workspace/start.sh: No such file or directory Redis failed to start within 60 seconds |
| bandit-delayed-feedback | machine-learning | ⏱ timeout | 15 | 2271s | 20 | Let me verify all output files are correctly formatted: |
| bash-log-processor-fix | bash_scripting | ✗ fail | 32 | — | 31 | The script is working correctly. Let me provide a summary of the fixes made: ## Summary of Changes The original script had several issues: 1. **Only processed top-level files** - didn't handle subdirectories 2. **No line counting** - only… |
| battery-charging-optimization | general | ⏱ timeout | 6 | 823s | 6 | Wait, let me reconsider the problem. The BFS gave a solution with 364 moves and 132,606 energy, but this seems suboptimal. Let me think about whether there's a more direct approach. Actually, looking at the extended Euclidean algorithm res… |
| book-portfolio-analysis | data_engineering | ⏱ timeout | 32 | 2365s | 32 | The JSON structure looks fine. Let me try a different approach - let me use a JSON parser that's more lenient: |
| bracket-sequence-restoration | general | ⏱ timeout | 33 | 2375s | 33 | So `((?)` has length 4. Let me trace the DP: - s = `((?)` - dp[0] = 1 - i=0, s[0]='(': dp[1] = 1 - i=1, s[1]='(': dp[2] = 1 - i=2, s[2]='?': - from dp[2]=1: dp[1] += 1 (assign ')'), dp[3] += 1 (assign '(') - dp[1] = 1, dp[3] = 1 - i=3,… |
| breast-cancer-mlflow | machine-learning | ⏱ timeout | 9 | 1975s | 9 | I need to fix the model logging to handle Pipeline objects. Let me update the training script: |
| broken-python | software-engineering | ✗ fail | 60 | — | 60 | Excellent! There's a pip wheel in the ensurepip directory. Let's use that: grader outputgrader verdict: FAIL (exit 0)
Traceback (most recent call last):
File "/usr/local/bin/pip", line 3, in <module>
from pip._internal.cli.main import main
ModuleNotFoundError: No module named 'pip'
/tests/test.sh: line 5: pytest: command not found |
| build-merkle-tree-cli-sha512 | software-engineering | ⏱ timeout | 27 | 2386s | 27 | Now let me test the updated algorithm: |
| build-system-task-ordering | build-and-dependency-management | ⏱ timeout | 9 | 2277s | 9 | I see the issue! The alias graph is empty because I'm only adding edges when `len(targets) >= 2`, but the Alias directive values are single targets (`'alpha'` and `'build'`), not multiple targets. Let me check the Alias directive handling:… |
| california-housing-api | machine-learning | ✗ fail | 60 | — | 60 | Let me kill all the processes and start fresh: grader outputgrader verdict: FAIL (exit 0) sertionError: ... FAILED ../tests/test_outputs.py::test_api_predict_batch_endpoint - AssertionE... FAILED ../tests/test_outputs.py::test_api_predict_batch_matches_single_predict FAILED ../tests/test_outputs.py::test_api_predict_batch_order_preserved - ass... FAILED ../tests/test_outputs.py::test_validation_vs_test_metrics - FileNotFou... FAILED ../tests/test_outputs.py::test_predictions_reproducible_from_saved_model ================== 12 failed, 18 passed, 11 warnings in 5.26s ================== |
| chained-forensic-extraction_20260101_011957 | security | ⏱ timeout | 35 | 2265s | 35 | Excellent! Key 90 gives us `dead` as the XOR key (2 bytes: 0xDE 0xAD). But the decrypted text shows `<noise_0>` which suggests there's some noise in the data. Let me check if maybe the ROT13 decoded text needs to be padded or if there's an … |