← all results · Qwen/Qwen3-Coder-Next-FP8 · FP8 · 3B/80B · parser qwen3_coder · smoke probes · live

qwen3-coder-next-fp8

Qwen's official FP8 build of Qwen3-Coder-Next. A/B partner to the NVFP4 variant — FP8 kernels are better-tested on SM120 than NVFP4.

3/20 tasks passed · 15% pass rate

3 pass 4 grader fail 13 timeout 0 no tool calls
vLLM metrics · 1321 samples over 48m
gen tokens/s
33.5
prompt tokens/s
1801
mean TTFT
6.19s
mean TPOT
peak concurrency
8
peak KV cache
total gen tokens
97,152
total requests
406
Task Category Status Turns Time Tool calls What the model ended with
amuse-install scientific-computing ✓ pass 48 47 Perfect! Let me create a summary of what was done: ## Summary I have successfully set up the AMUSE framework with a Python virtual environment at `/app/amuse/.venv` and installed the necessary components to run the test script. Here's wha…
anomaly-detection-ranking general ✓ pass 19 18 The results.json file has been successfully created with all required keys and proper structure: 1. **sensor**: 100 predictions (matching the 100 rows in sensor_test.csv) 2. **network**: 150 predictions (matching the 150 rows in network_te…
broken-python software-engineering ✓ pass 15 14 Perfect! pip is now working correctly and we've successfully installed the requests package. ## Summary The issue was that pip was partially installed but the actual pip module files were missing from your Python installation. This can ha…
acl-permissions-inheritance system-administration ✗ fail 10 9 The shared directory structure with proper Linux ACLs has been created successfully. Here's what was configured: **Directory `/srv/shared`:** - Owned by group `research` - Has the setgid bit set (permissions show as `drwxrwx---+`) - Base p…
grader output
grader verdict: FAIL (exit 0)
tests/test_outputs.py::test_bob_can_create_and_alice_can_access
PASSED ../tests/test_outputs.py::test_subdirectory_inheritance
PASSED ../tests/test_outputs.py::test_outside_users_denied_access
PASSED ../tests/test_outputs.py::test_execute_permissions_on_scripts
FAILED ../tests/test_outputs.py::test_directory_exists_and_basic_permissions
FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - Assertio...
========================= 2 failed, 7 passed in 0.54s ==========================
api-endpoint-permission-canonicalizer software-engineering ⏱ timeout 8 1039s 8 I see the issue. Looking at the spec more carefully: 1. When `base /base/api` is specified, only endpoints starting with `/base/api` are included 2. When `exclude restricted` is specified, `restricted` role is excluded 3. But in my test ca…
application-debug debugging ✗ fail 14 13 The analysis is complete. Here's a summary of the incident reports generated: ## Summary **Log Analysis Results:** - **Total Log Files Analyzed:** 4 - **Total Exceptions:** 149 - **Critical Failures (ERROR+FATAL):** 149 - **Most Common Ex…
grader output
grader verdict: FAIL (exit 0)
SED ../tests/test_outputs.py::test_critical_failures_calculation
PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions
PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers
PASSED ../tests/test_outputs.py::test_percentage_calculations
PASSED ../tests/test_outputs.py::test_anti_cheat_validation
FAILED ../tests/test_outputs.py::test_stack_traces_present - AssertionError: ...
========================= 1 failed, 12 passed in 0.10s =========================
auth_token_race_condition backend-engineering ✗ fail 6 5 The file has been created with the fix implemented. The key changes are: 1. **Lock acquisition**: After validating `user_id` and `username`, the code now acquires a lock using `await self.acquire_lock(f"refresh_lock:{refresh_token}")`. If …
grader output
grader verdict: FAIL (exit 1)
Starting test suite for concurrent token refresh...
Killing any existing uvicorn processes...
Waiting for port 8000 to be available...
Waiting for Redis to be ready...
/tests/test.sh: line 27: /workspace/start.sh: No such file or directory
Redis failed to start within 60 seconds
bandit-delayed-feedback machine-learning ⏱ timeout 8 937s 8 Now let me create the output directory and write a Python script to solve this problem:
bash-log-processor-fix bash_scripting ⏱ timeout 36 1178s 50 The lock mechanism is working - only one process succeeded while the others were blocked. Now let me verify all the output files:
basic-message-queue backend ⏱ timeout 46 1178s 46 I see the issue - the receive_message function is not properly handling the empty queue case. Let me check the full receive_message function:
battery-charging-optimization general ⏱ timeout 1 20s 1 [tool_call] process({"action": "list"})
bloom-filter-cache-penetration-prevention software-engineering ⏱ timeout 20 1142s 20 Let me run the test differently:
book-portfolio-analysis data_engineering ⏱ timeout 9 1017s 10 Now let me run the script:
bracket-sequence-restoration general ⏱ timeout 27 1190s 27 Perfect! The solution matches the expected output: - `()()` → YES (already valid RBS, no question marks) - `(??)` → NO (can be `(() )` or `()()`) - `(?` → YES (only `()` is valid) The solution is complete and efficient. Let me submit it:
breast-cancer-mlflow machine-learning ⏱ timeout 39 1149s 39 Let me check if there's an issue with the app.py file:
build-merkle-tree-cli-sha512 software-engineering ⏱ timeout 18 1037s 18 Let me read the current state of the file and fix it properly:
build-system-task-ordering build-and-dependency-management ⏱ timeout 4 1068s 4 I see several issues. Let me analyze: 1. **Example 1**: The expected output is `{"order":["all","compile","clean"],"cycles":[]}` but I'm getting all 5 targets. The issue is that `build`, `compile`, and `link` should all canonicalize to `co…
california-housing-api machine-learning ⏱ timeout 50 1160s 50 Let me try a different approach - start the server using terminal with pty mode:
chained-forensic-extraction_20260101_011957 security ⏱ timeout 14 1178s 14 I see the issue now. The text extracted from the PNG is 416 characters, which is not valid Base64 (Base64 length should be multiple of 4). Let me check if there are extra characters or if I'm extracting the wrong part. Let me look at the ra…
competitive-programming-solver CP ✗ fail 8 16 The solution has been written to `/workdir/sol.csv` in the required format. Here's a summary of the results: | Testcase | Answer | |----------|--------| | testcase1 | YES | | testcase2 | NO | | testcase3 | YES | | testcase4 | NO | | testca…
grader output
grader verdict: FAIL (exit 0)
stcase4 : CORRECT (NO); testcase5 : CORRECT (NO); testcase6 : CORRECT (YES); testcase7 : CORRECT (YES); testcase8 : INCORRECT - Expected: NO, Found: YES; testcase9 : CORRECT (YES); testcase10 : CORRECT (YES)
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 0.37s ===============================