← all results · Qwen/Qwen3-Coder-Next-FP8 · FP8 · 3B/80B · parser qwen3_coder · smoke probes · live

qwen3-coder-next-fp8

Qwen's official FP8 build of Qwen3-Coder-Next. A/B partner to the NVFP4 variant — FP8 kernels are better-tested on SM120 than NVFP4.

3/20 tasks passed · 15% pass rate

3 pass 4 grader fail 13 timeout 0 no tool calls

vLLM metrics · 1321 samples over 48m

gen tokens/s

33.5

prompt tokens/s

1801

mean TTFT

6.19s

mean TPOT

—

peak concurrency

peak KV cache

—

total gen tokens

97,152

total requests

406

Task	Category	Status	Turns	Time	Tool calls	What the model ended with
amuse-install	scientific-computing	✓ pass	48	—	47	Perfect! Let me create a summary of what was done: ## Summary I have successfully set up the AMUSE framework with a Python virtual environment at `/app/amuse/.venv` and installed the necessary components to run the test script. Here's wha…
anomaly-detection-ranking	general	✓ pass	19	—	18	The results.json file has been successfully created with all required keys and proper structure: 1. sensor: 100 predictions (matching the 100 rows in sensor_test.csv) 2. network: 150 predictions (matching the 150 rows in network_te…
broken-python	software-engineering	✓ pass	15	—	14	Perfect! pip is now working correctly and we've successfully installed the requests package. ## Summary The issue was that pip was partially installed but the actual pip module files were missing from your Python installation. This can ha…
acl-permissions-inheritance	system-administration	✗ fail	10	—	9	The shared directory structure with proper Linux ACLs has been created successfully. Here's what was configured: Directory `/srv/shared`: - Owned by group `research` - Has the setgid bit set (permissions show as `drwxrwx---+`) - Base p… grader output grader verdict: FAIL (exit 0) tests/test_outputs.py::test_bob_can_create_and_alice_can_access PASSED ../tests/test_outputs.py::test_subdirectory_inheritance PASSED ../tests/test_outputs.py::test_outside_users_denied_access PASSED ../tests/test_outputs.py::test_execute_permissions_on_scripts FAILED ../tests/test_outputs.py::test_directory_exists_and_basic_permissions FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - Assertio... ========================= 2 failed, 7 passed in 0.54s ==========================
api-endpoint-permission-canonicalizer	software-engineering	⏱ timeout	8	1039s	8	I see the issue. Looking at the spec more carefully: 1. When `base /base/api` is specified, only endpoints starting with `/base/api` are included 2. When `exclude restricted` is specified, `restricted` role is excluded 3. But in my test ca…
application-debug	debugging	✗ fail	14	—	13	The analysis is complete. Here's a summary of the incident reports generated: ## Summary Log Analysis Results: - Total Log Files Analyzed: 4 - Total Exceptions: 149 - Critical Failures (ERROR+FATAL): 149 - **Most Common Ex… grader output grader verdict: FAIL (exit 0) SED ../tests/test_outputs.py::test_critical_failures_calculation PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers PASSED ../tests/test_outputs.py::test_percentage_calculations PASSED ../tests/test_outputs.py::test_anti_cheat_validation FAILED ../tests/test_outputs.py::test_stack_traces_present - AssertionError: ... ========================= 1 failed, 12 passed in 0.10s =========================
auth_token_race_condition	backend-engineering	✗ fail	6	—	5	The file has been created with the fix implemented. The key changes are: 1. Lock acquisition: After validating `user_id` and `username`, the code now acquires a lock using `await self.acquire_lock(f"refresh_lock:{refresh_token}")`. If … grader output grader verdict: FAIL (exit 1) Starting test suite for concurrent token refresh... Killing any existing uvicorn processes... Waiting for port 8000 to be available... Waiting for Redis to be ready... /tests/test.sh: line 27: /workspace/start.sh: No such file or directory Redis failed to start within 60 seconds
bandit-delayed-feedback	machine-learning	⏱ timeout	8	937s	8	Now let me create the output directory and write a Python script to solve this problem:
bash-log-processor-fix	bash_scripting	⏱ timeout	36	1178s	50	The lock mechanism is working - only one process succeeded while the others were blocked. Now let me verify all the output files:
basic-message-queue	backend	⏱ timeout	46	1178s	46	I see the issue - the receive_message function is not properly handling the empty queue case. Let me check the full receive_message function:
battery-charging-optimization	general	⏱ timeout	1	20s	1	[tool_call] process({"action": "list"})
bloom-filter-cache-penetration-prevention	software-engineering	⏱ timeout	20	1142s	20	Let me run the test differently:
book-portfolio-analysis	data_engineering	⏱ timeout	9	1017s	10	Now let me run the script:
bracket-sequence-restoration	general	⏱ timeout	27	1190s	27	Perfect! The solution matches the expected output: - `()()` → YES (already valid RBS, no question marks) - `(??)` → NO (can be `(() )` or `()()`) - `(?` → YES (only `()` is valid) The solution is complete and efficient. Let me submit it:
breast-cancer-mlflow	machine-learning	⏱ timeout	39	1149s	39	Let me check if there's an issue with the app.py file:
build-merkle-tree-cli-sha512	software-engineering	⏱ timeout	18	1037s	18	Let me read the current state of the file and fix it properly:
build-system-task-ordering	build-and-dependency-management	⏱ timeout	4	1068s	4	I see several issues. Let me analyze: 1. Example 1: The expected output is `{"order":["all","compile","clean"],"cycles":[]}` but I'm getting all 5 targets. The issue is that `build`, `compile`, and `link` should all canonicalize to `co…
california-housing-api	machine-learning	⏱ timeout	50	1160s	50	Let me try a different approach - start the server using terminal with pty mode:
chained-forensic-extraction_20260101_011957	security	⏱ timeout	14	1178s	14	I see the issue now. The text extracted from the PNG is 416 characters, which is not valid Base64 (Base64 length should be multiple of 4). Let me check if there are extra characters or if I'm extracting the wrong part. Let me look at the ra…
competitive-programming-solver	CP	✗ fail	8	—	16	The solution has been written to `/workdir/sol.csv` in the required format. Here's a summary of the results: \| Testcase \| Answer \| \|----------\|--------\| \| testcase1 \| YES \| \| testcase2 \| NO \| \| testcase3 \| YES \| \| testcase4 \| NO \| \| testca… grader output grader verdict: FAIL (exit 0) stcase4 : CORRECT (NO); testcase5 : CORRECT (NO); testcase6 : CORRECT (YES); testcase7 : CORRECT (YES); testcase8 : INCORRECT - Expected: NO, Found: YES; testcase9 : CORRECT (YES); testcase10 : CORRECT (YES) Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 0.37s ===============================