← all results · Qwen/Qwen3-Coder-Next-FP8 · FP8 · 3B/80B · parser qwen3_coder · smoke probes · live
qwen3-coder-next-fp8
Qwen's official FP8 build of Qwen3-Coder-Next. A/B partner to the NVFP4 variant — FP8 kernels are better-tested on SM120 than NVFP4.
3/20 tasks passed · 15% pass rate
3 pass 4 grader fail 13 timeout 0 no tool calls
vLLM metrics · 1321 samples over 48m
gen tokens/s
33.5prompt tokens/s
1801mean TTFT
6.19smean TPOT
—peak concurrency
8peak KV cache
—total gen tokens
97,152total requests
406| Task | Category | Status | Turns | Time | Tool calls | What the model ended with |
|---|---|---|---|---|---|---|
| amuse-install | scientific-computing | ✓ pass | 48 | — | 47 | Perfect! Let me create a summary of what was done: ## Summary I have successfully set up the AMUSE framework with a Python virtual environment at `/app/amuse/.venv` and installed the necessary components to run the test script. Here's wha… |
| anomaly-detection-ranking | general | ✓ pass | 19 | — | 18 | The results.json file has been successfully created with all required keys and proper structure: 1. **sensor**: 100 predictions (matching the 100 rows in sensor_test.csv) 2. **network**: 150 predictions (matching the 150 rows in network_te… |
| broken-python | software-engineering | ✓ pass | 15 | — | 14 | Perfect! pip is now working correctly and we've successfully installed the requests package. ## Summary The issue was that pip was partially installed but the actual pip module files were missing from your Python installation. This can ha… |
| acl-permissions-inheritance | system-administration | ✗ fail | 10 | — | 9 | The shared directory structure with proper Linux ACLs has been created successfully. Here's what was configured:
**Directory `/srv/shared`:**
- Owned by group `research`
- Has the setgid bit set (permissions show as `drwxrwx---+`)
- Base p… grader outputgrader verdict: FAIL (exit 0) tests/test_outputs.py::test_bob_can_create_and_alice_can_access PASSED ../tests/test_outputs.py::test_subdirectory_inheritance PASSED ../tests/test_outputs.py::test_outside_users_denied_access PASSED ../tests/test_outputs.py::test_execute_permissions_on_scripts FAILED ../tests/test_outputs.py::test_directory_exists_and_basic_permissions FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - Assertio... ========================= 2 failed, 7 passed in 0.54s ========================== |
| api-endpoint-permission-canonicalizer | software-engineering | ⏱ timeout | 8 | 1039s | 8 | I see the issue. Looking at the spec more carefully: 1. When `base /base/api` is specified, only endpoints starting with `/base/api` are included 2. When `exclude restricted` is specified, `restricted` role is excluded 3. But in my test ca… |
| application-debug | debugging | ✗ fail | 14 | — | 13 | The analysis is complete. Here's a summary of the incident reports generated:
## Summary
**Log Analysis Results:**
- **Total Log Files Analyzed:** 4
- **Total Exceptions:** 149
- **Critical Failures (ERROR+FATAL):** 149
- **Most Common Ex… grader outputgrader verdict: FAIL (exit 0) SED ../tests/test_outputs.py::test_critical_failures_calculation PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers PASSED ../tests/test_outputs.py::test_percentage_calculations PASSED ../tests/test_outputs.py::test_anti_cheat_validation FAILED ../tests/test_outputs.py::test_stack_traces_present - AssertionError: ... ========================= 1 failed, 12 passed in 0.10s ========================= |
| auth_token_race_condition | backend-engineering | ✗ fail | 6 | — | 5 | The file has been created with the fix implemented. The key changes are:
1. **Lock acquisition**: After validating `user_id` and `username`, the code now acquires a lock using `await self.acquire_lock(f"refresh_lock:{refresh_token}")`. If … grader outputgrader verdict: FAIL (exit 1) Starting test suite for concurrent token refresh... Killing any existing uvicorn processes... Waiting for port 8000 to be available... Waiting for Redis to be ready... /tests/test.sh: line 27: /workspace/start.sh: No such file or directory Redis failed to start within 60 seconds |
| bandit-delayed-feedback | machine-learning | ⏱ timeout | 8 | 937s | 8 | Now let me create the output directory and write a Python script to solve this problem: |
| bash-log-processor-fix | bash_scripting | ⏱ timeout | 36 | 1178s | 50 | The lock mechanism is working - only one process succeeded while the others were blocked. Now let me verify all the output files: |
| basic-message-queue | backend | ⏱ timeout | 46 | 1178s | 46 | I see the issue - the receive_message function is not properly handling the empty queue case. Let me check the full receive_message function: |
| battery-charging-optimization | general | ⏱ timeout | 1 | 20s | 1 | [tool_call] process({"action": "list"}) |
| bloom-filter-cache-penetration-prevention | software-engineering | ⏱ timeout | 20 | 1142s | 20 | Let me run the test differently: |
| book-portfolio-analysis | data_engineering | ⏱ timeout | 9 | 1017s | 10 | Now let me run the script: |
| bracket-sequence-restoration | general | ⏱ timeout | 27 | 1190s | 27 | Perfect! The solution matches the expected output: - `()()` → YES (already valid RBS, no question marks) - `(??)` → NO (can be `(() )` or `()()`) - `(?` → YES (only `()` is valid) The solution is complete and efficient. Let me submit it: |
| breast-cancer-mlflow | machine-learning | ⏱ timeout | 39 | 1149s | 39 | Let me check if there's an issue with the app.py file: |
| build-merkle-tree-cli-sha512 | software-engineering | ⏱ timeout | 18 | 1037s | 18 | Let me read the current state of the file and fix it properly: |
| build-system-task-ordering | build-and-dependency-management | ⏱ timeout | 4 | 1068s | 4 | I see several issues. Let me analyze: 1. **Example 1**: The expected output is `{"order":["all","compile","clean"],"cycles":[]}` but I'm getting all 5 targets. The issue is that `build`, `compile`, and `link` should all canonicalize to `co… |
| california-housing-api | machine-learning | ⏱ timeout | 50 | 1160s | 50 | Let me try a different approach - start the server using terminal with pty mode: |
| chained-forensic-extraction_20260101_011957 | security | ⏱ timeout | 14 | 1178s | 14 | I see the issue now. The text extracted from the PNG is 416 characters, which is not valid Base64 (Base64 length should be multiple of 4). Let me check if there are extra characters or if I'm extracting the wrong part. Let me look at the ra… |
| competitive-programming-solver | CP | ✗ fail | 8 | — | 16 | The solution has been written to `/workdir/sol.csv` in the required format. Here's a summary of the results:
| Testcase | Answer |
|----------|--------|
| testcase1 | YES |
| testcase2 | NO |
| testcase3 | YES |
| testcase4 | NO |
| testca… grader outputgrader verdict: FAIL (exit 0) stcase4 : CORRECT (NO); testcase5 : CORRECT (NO); testcase6 : CORRECT (YES); testcase7 : CORRECT (YES); testcase8 : INCORRECT - Expected: NO, Found: YES; testcase9 : CORRECT (YES); testcase10 : CORRECT (YES) Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 0.37s =============================== |