← all results · GadflyII/Qwen3-Coder-Next-NVFP4 · NVFP4 · 3B/80B · parser qwen3_coder · smoke probes · live

qwen3-coder-next

Qwen3-Coder-Next: 80 B / 3 B-active coding-agent MoE. #1 community-cited open coding agent; "DGX Spark practitioner favorite" per the NVIDIA forum. Distinct from qwen3-next-80b (instruct); this one is coder-tuned with the `qwen3_coder` tool-call format.

4/20 tasks passed · 20% pass rate

4 pass 6 grader fail 9 timeout 1 no tool calls
vLLM metrics · 4808 samples over 767m
gen tokens/s
2.8
prompt tokens/s
143
mean TTFT
8.61s
mean TPOT
peak concurrency
8
peak KV cache
total gen tokens
127,941
total requests
345
Task Category Status Turns Time Tool calls What the model ended with
anomaly-detection-ranking general ✓ pass 18 26 The `results.json` file is complete and valid. Here's a summary of what was done: ## Summary I created a `results.json` file with predictions for all five sub-tasks: ### 1. Sensor Anomaly Detection - Used z-score based threshold on senso…
basic-message-queue backend ✓ pass 60 60 {"output": "", "exit_code": -1, "error": "This foreground command appears to start a long-lived server/watch process. Run it with background=true, verify readiness (health endpoint/log signal), then execute tests in a separate command.", "s…
bloom-filter-cache-penetration-prevention software-engineering ✓ pass 60 60 All processes are exiting. Let me try a different approach - run the server in a subprocess with Popen:
competitive-programming-solver CP ✓ pass 8 16 The solution is complete. The answers are: - testcase1: YES - testcase2: NO - testcase3: YES - testcase4: NO - testcase5: NO - testcase6: YES - testcase7: YES - testcase8: NO - testcase9: YES - testcase10: YES The solution has been writte…
acl-permissions-inheritance system-administration ✗ fail 6 5 The shared directory structure has been created with proper Linux ACLs. Here's a summary: **Directory /srv/shared:** - Owned by group `research` - Has setgid bit set (2770 = rwxrws---) - ACLs configured for: - `research` group: rwx (curr…
grader output
grader verdict: FAIL (exit 0)
tests/test_outputs.py::test_bob_can_create_and_alice_can_access
PASSED ../tests/test_outputs.py::test_subdirectory_inheritance
PASSED ../tests/test_outputs.py::test_outside_users_denied_access
PASSED ../tests/test_outputs.py::test_execute_permissions_on_scripts
FAILED ../tests/test_outputs.py::test_directory_exists_and_basic_permissions
FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - Assertio...
========================= 2 failed, 7 passed in 0.49s ==========================
amuse-install scientific-computing ✗ fail 47 46 Great! The installation is complete and working. Let me create a summary of what was done: ## Summary I successfully installed AMUSE and its dependencies. Here's what was done: 1. **Created a Python virtual environment** at `/app/amuse/.…
grader output
grader verdict: FAIL (exit 0)
nk files; falling back to full copy. This may lead to degraded performance.
         If the cache and target directories are on different filesystems, hardlinking may not be supported.
         If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
Installed 5 packages in 19ms
 + iniconfig==2.3.0
 + packaging==26.1
 + pluggy==1.6.0
 + pygments==2.20.0
 + pytest==8.4.1
error: Failed to spawn: `pytest`
  Caused by: Permission denied (os error 13)
api-endpoint-permission-canonicalizer software-engineering ⏱ timeout 13 1799s 13 The endpoints are empty! Let me check why:
application-debug debugging ✗ fail 11 15 The analysis is complete. Both output files have been generated successfully: ## Summary of Analysis **Log Files Analyzed:** 4 files (app_bf26b2.log, app_e9u98n.log, app_ju3tyx.log, app_qad7wy.log) **Key Findings:** - **Total Exceptions:…
grader output
grader verdict: FAIL (exit 0)
SED ../tests/test_outputs.py::test_critical_failures_calculation
PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions
PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers
PASSED ../tests/test_outputs.py::test_percentage_calculations
PASSED ../tests/test_outputs.py::test_anti_cheat_validation
FAILED ../tests/test_outputs.py::test_stack_traces_present - AssertionError: ...
========================= 1 failed, 12 passed in 0.11s =========================
auth_token_race_condition backend-engineering ✗ fail 9 8 The fix has been successfully applied. Here's a summary of the changes: **Changes made to `/workspace/auth_gateway/token_service.py`:** 1. **Added lock acquisition after user validation** (lines 41-44): - After validating `user_id` and…
grader output
grader verdict: FAIL (exit 1)
Starting test suite for concurrent token refresh...
Killing any existing uvicorn processes...
Waiting for port 8000 to be available...
Waiting for Redis to be ready...
/tests/test.sh: line 27: /workspace/start.sh: No such file or directory
Redis failed to start within 60 seconds
bandit-delayed-feedback machine-learning ⏱ timeout 15 2271s 20 Let me verify all output files are correctly formatted:
bash-log-processor-fix bash_scripting ✗ fail 32 31 The script is working correctly. Let me provide a summary of the fixes made: ## Summary of Changes The original script had several issues: 1. **Only processed top-level files** - didn't handle subdirectories 2. **No line counting** - only…
battery-charging-optimization general ⏱ timeout 6 823s 6 Wait, let me reconsider the problem. The BFS gave a solution with 364 moves and 132,606 energy, but this seems suboptimal. Let me think about whether there's a more direct approach. Actually, looking at the extended Euclidean algorithm res…
book-portfolio-analysis data_engineering ⏱ timeout 32 2365s 32 The JSON structure looks fine. Let me try a different approach - let me use a JSON parser that's more lenient:
bracket-sequence-restoration general ⏱ timeout 33 2375s 33 So `((?)` has length 4. Let me trace the DP: - s = `((?)` - dp[0] = 1 - i=0, s[0]='(': dp[1] = 1 - i=1, s[1]='(': dp[2] = 1 - i=2, s[2]='?': - from dp[2]=1: dp[1] += 1 (assign ')'), dp[3] += 1 (assign '(') - dp[1] = 1, dp[3] = 1 - i=3,…
breast-cancer-mlflow machine-learning ⏱ timeout 9 1975s 9 I need to fix the model logging to handle Pipeline objects. Let me update the training script:
broken-python software-engineering ✗ fail 60 60 Excellent! There's a pip wheel in the ensurepip directory. Let's use that:
grader output
grader verdict: FAIL (exit 0)
Traceback (most recent call last):
  File "/usr/local/bin/pip", line 3, in <module>
    from pip._internal.cli.main import main
ModuleNotFoundError: No module named 'pip'
/tests/test.sh: line 5: pytest: command not found
build-merkle-tree-cli-sha512 software-engineering ⏱ timeout 27 2386s 27 Now let me test the updated algorithm:
build-system-task-ordering build-and-dependency-management ⏱ timeout 9 2277s 9 I see the issue! The alias graph is empty because I'm only adding edges when `len(targets) >= 2`, but the Alias directive values are single targets (`'alpha'` and `'build'`), not multiple targets. Let me check the Alias directive handling:…
california-housing-api machine-learning ✗ fail 60 60 Let me kill all the processes and start fresh:
grader output
grader verdict: FAIL (exit 0)
sertionError: ...
FAILED ../tests/test_outputs.py::test_api_predict_batch_endpoint - AssertionE...
FAILED ../tests/test_outputs.py::test_api_predict_batch_matches_single_predict
FAILED ../tests/test_outputs.py::test_api_predict_batch_order_preserved - ass...
FAILED ../tests/test_outputs.py::test_validation_vs_test_metrics - FileNotFou...
FAILED ../tests/test_outputs.py::test_predictions_reproducible_from_saved_model
================== 12 failed, 18 passed, 11 warnings in 5.26s ==================
chained-forensic-extraction_20260101_011957 security ⏱ timeout 35 2265s 35 Excellent! Key 90 gives us `dead` as the XOR key (2 bytes: 0xDE 0xAD). But the decrypted text shows `<noise_0>` which suggests there's some noise in the data. Let me check if maybe the ROT13 decoded text needs to be padded or if there's an …