← all results · GadflyII/Qwen3-Coder-Next-NVFP4 · NVFP4 · 3B/80B · parser qwen3_coder · smoke probes · live

qwen3-coder-next

Qwen3-Coder-Next: 80 B / 3 B-active coding-agent MoE. #1 community-cited open coding agent; "DGX Spark practitioner favorite" per the NVIDIA forum. Distinct from qwen3-next-80b (instruct); this one is coder-tuned with the `qwen3_coder` tool-call format.

4/20 tasks passed · 20% pass rate

4 pass 6 grader fail 9 timeout 1 no tool calls

vLLM metrics · 4808 samples over 767m

gen tokens/s

2.8

prompt tokens/s

143

mean TTFT

8.61s

mean TPOT

—

peak concurrency

peak KV cache

—

total gen tokens

127,941

total requests

345

Task	Category	Status	Turns	Time	Tool calls	What the model ended with
anomaly-detection-ranking	general	✓ pass	18	—	26	The `results.json` file is complete and valid. Here's a summary of what was done: ## Summary I created a `results.json` file with predictions for all five sub-tasks: ### 1. Sensor Anomaly Detection - Used z-score based threshold on senso…
basic-message-queue	backend	✓ pass	60	—	60	{"output": "", "exit_code": -1, "error": "This foreground command appears to start a long-lived server/watch process. Run it with background=true, verify readiness (health endpoint/log signal), then execute tests in a separate command.", "s…
bloom-filter-cache-penetration-prevention	software-engineering	✓ pass	60	—	60	All processes are exiting. Let me try a different approach - run the server in a subprocess with Popen:
competitive-programming-solver	CP	✓ pass	8	—	16	The solution is complete. The answers are: - testcase1: YES - testcase2: NO - testcase3: YES - testcase4: NO - testcase5: NO - testcase6: YES - testcase7: YES - testcase8: NO - testcase9: YES - testcase10: YES The solution has been writte…
acl-permissions-inheritance	system-administration	✗ fail	6	—	5	The shared directory structure has been created with proper Linux ACLs. Here's a summary: Directory /srv/shared: - Owned by group `research` - Has setgid bit set (2770 = rwxrws---) - ACLs configured for: - `research` group: rwx (curr… grader output grader verdict: FAIL (exit 0) tests/test_outputs.py::test_bob_can_create_and_alice_can_access PASSED ../tests/test_outputs.py::test_subdirectory_inheritance PASSED ../tests/test_outputs.py::test_outside_users_denied_access PASSED ../tests/test_outputs.py::test_execute_permissions_on_scripts FAILED ../tests/test_outputs.py::test_directory_exists_and_basic_permissions FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - Assertio... ========================= 2 failed, 7 passed in 0.49s ==========================
amuse-install	scientific-computing	✗ fail	47	—	46	Great! The installation is complete and working. Let me create a summary of what was done: ## Summary I successfully installed AMUSE and its dependencies. Here's what was done: 1. Created a Python virtual environment at `/app/amuse/.… grader output grader verdict: FAIL (exit 0) nk files; falling back to full copy. This may lead to degraded performance. If the cache and target directories are on different filesystems, hardlinking may not be supported. If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. Installed 5 packages in 19ms + iniconfig==2.3.0 + packaging==26.1 + pluggy==1.6.0 + pygments==2.20.0 + pytest==8.4.1 error: Failed to spawn: `pytest` Caused by: Permission denied (os error 13)
api-endpoint-permission-canonicalizer	software-engineering	⏱ timeout	13	1799s	13	The endpoints are empty! Let me check why:
application-debug	debugging	✗ fail	11	—	15	The analysis is complete. Both output files have been generated successfully: ## Summary of Analysis Log Files Analyzed: 4 files (app_bf26b2.log, app_e9u98n.log, app_ju3tyx.log, app_qad7wy.log) Key Findings: - **Total Exceptions:… grader output grader verdict: FAIL (exit 0) SED ../tests/test_outputs.py::test_critical_failures_calculation PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers PASSED ../tests/test_outputs.py::test_percentage_calculations PASSED ../tests/test_outputs.py::test_anti_cheat_validation FAILED ../tests/test_outputs.py::test_stack_traces_present - AssertionError: ... ========================= 1 failed, 12 passed in 0.11s =========================
auth_token_race_condition	backend-engineering	✗ fail	9	—	8	The fix has been successfully applied. Here's a summary of the changes: Changes made to `/workspace/auth_gateway/token_service.py`: 1. Added lock acquisition after user validation (lines 41-44): - After validating `user_id` and… grader output grader verdict: FAIL (exit 1) Starting test suite for concurrent token refresh... Killing any existing uvicorn processes... Waiting for port 8000 to be available... Waiting for Redis to be ready... /tests/test.sh: line 27: /workspace/start.sh: No such file or directory Redis failed to start within 60 seconds
bandit-delayed-feedback	machine-learning	⏱ timeout	15	2271s	20	Let me verify all output files are correctly formatted:
bash-log-processor-fix	bash_scripting	✗ fail	32	—	31	The script is working correctly. Let me provide a summary of the fixes made: ## Summary of Changes The original script had several issues: 1. Only processed top-level files - didn't handle subdirectories 2. No line counting - only…
battery-charging-optimization	general	⏱ timeout	6	823s	6	Wait, let me reconsider the problem. The BFS gave a solution with 364 moves and 132,606 energy, but this seems suboptimal. Let me think about whether there's a more direct approach. Actually, looking at the extended Euclidean algorithm res…
book-portfolio-analysis	data_engineering	⏱ timeout	32	2365s	32	The JSON structure looks fine. Let me try a different approach - let me use a JSON parser that's more lenient:
bracket-sequence-restoration	general	⏱ timeout	33	2375s	33	So `((?)` has length 4. Let me trace the DP: - s = `((?)` - dp[0] = 1 - i=0, s[0]='(': dp[1] = 1 - i=1, s[1]='(': dp[2] = 1 - i=2, s[2]='?': - from dp[2]=1: dp[1] += 1 (assign ')'), dp[3] += 1 (assign '(') - dp[1] = 1, dp[3] = 1 - i=3,…
breast-cancer-mlflow	machine-learning	⏱ timeout	9	1975s	9	I need to fix the model logging to handle Pipeline objects. Let me update the training script:
broken-python	software-engineering	✗ fail	60	—	60	Excellent! There's a pip wheel in the ensurepip directory. Let's use that: grader output grader verdict: FAIL (exit 0) Traceback (most recent call last): File "/usr/local/bin/pip", line 3, in <module> from pip._internal.cli.main import main ModuleNotFoundError: No module named 'pip' /tests/test.sh: line 5: pytest: command not found
build-merkle-tree-cli-sha512	software-engineering	⏱ timeout	27	2386s	27	Now let me test the updated algorithm:
build-system-task-ordering	build-and-dependency-management	⏱ timeout	9	2277s	9	I see the issue! The alias graph is empty because I'm only adding edges when `len(targets) >= 2`, but the Alias directive values are single targets (`'alpha'` and `'build'`), not multiple targets. Let me check the Alias directive handling:…
california-housing-api	machine-learning	✗ fail	60	—	60	Let me kill all the processes and start fresh: grader output grader verdict: FAIL (exit 0) sertionError: ... FAILED ../tests/test_outputs.py::test_api_predict_batch_endpoint - AssertionE... FAILED ../tests/test_outputs.py::test_api_predict_batch_matches_single_predict FAILED ../tests/test_outputs.py::test_api_predict_batch_order_preserved - ass... FAILED ../tests/test_outputs.py::test_validation_vs_test_metrics - FileNotFou... FAILED ../tests/test_outputs.py::test_predictions_reproducible_from_saved_model ================== 12 failed, 18 passed, 11 warnings in 5.26s ==================
chained-forensic-extraction_20260101_011957	security	⏱ timeout	35	2265s	35	Excellent! Key 90 gives us `dead` as the XOR key (2 bytes: 0xDE 0xAD). But the decrypted text shows `<noise_0>` which suggests there's some noise in the data. Let me check if maybe the ROT13 decoded text needs to be padded or if there's an …