← all results · Qwen/Qwen3-Coder-Next-FP8 · FP8 · 3B/80B · parser qwen3_coder · smoke probes · live
qwen3-coder-next-fp8
Qwen's official FP8 build of Qwen3-Coder-Next. A/B partner to the NVFP4 variant — FP8 kernels are better-tested on SM120 than NVFP4.
5/20 tasks passed · 25% pass rate
5 pass 7 grader fail 6 timeout 2 no tool calls
vLLM metrics · 4877 samples over 619m
gen tokens/s
4.0prompt tokens/s
217mean TTFT
7.70smean TPOT
—peak concurrency
8peak KV cache
—total gen tokens
148,323total requests
540| Task | Category | Status | Turns | Time | Tool calls | What the model ended with |
|---|---|---|---|---|---|---|
| amuse-install | scientific-computing | ✓ pass | 39 | — | 38 | Perfect! The installation is complete and working. Let me create a summary of what was done: ## Summary I successfully created a Python virtual environment and installed AMUSE at `/app/amuse/.venv`. Here's what was done: 1. **Created the… |
| basic-message-queue | backend | ✓ pass | 60 | — | 60 | Let me try a different approach - I'll run the server with a shorter timeout and then test it: |
| broken-python | software-engineering | ✓ pass | 14 | — | 13 | Perfect! Your pip installation is now working correctly. The issue was that pip was only partially installed - only the metadata directory existed but not the actual pip package files. The get-pip.py script has properly installed pip 26.0.1… |
| build-merkle-tree-cli-sha512 | software-engineering | ✓ pass | 44 | — | 43 | The Merkle CLI tool is complete and working. Here's a summary of what was implemented: ## Summary I created `/app/merkle_cli.py` - a Python CLI tool that: ### Commands: 1. **`scan <directory>`** - Scans a directory, computes SHA-512 hash… |
| competitive-programming-solver | CP | ✓ pass | 19 | — | 18 | The solution is complete. The CSV file has been created at `/workdir/sol.csv` with the correct format: ``` testcase,answer testcase1,YES testcase2,NO testcase3,YES testcase4,NO testcase5,NO testcase6,YES testcase7,YES testcase8,NO testcase… |
| acl-permissions-inheritance | system-administration | ✗ fail | 11 | — | 10 | The shared directory structure has been created with proper Linux ACLs. Here's what was configured:
**Directory `/srv/shared`:**
- Owned by group `research`
- Setgid bit set (shown by `s` in permissions)
- Base permissions: `2770` (rwxrws-… grader outputgrader verdict: FAIL (exit 0) tests/test_outputs.py::test_bob_can_create_and_alice_can_access PASSED ../tests/test_outputs.py::test_subdirectory_inheritance PASSED ../tests/test_outputs.py::test_outside_users_denied_access PASSED ../tests/test_outputs.py::test_execute_permissions_on_scripts FAILED ../tests/test_outputs.py::test_directory_exists_and_basic_permissions FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - Assertio... ========================= 2 failed, 7 passed in 0.49s ========================== |
| anomaly-detection-ranking | general | ✗ fail | 20 | — | 19 | The `results.json` file has been successfully created at `/workdir/results.json` with all required predictions: **Summary of predictions:** - **Sensor**: 100 binary predictions with 2 anomalies detected (high z-score values) - **Network**:… |
| api-endpoint-permission-canonicalizer | software-engineering | ⏱ timeout | 39 | 2353s | 39 | Now only one endpoint is shown for the path with variable names, and it uses the first encountered variable name `{userId}`. Let me also verify the sorting is correct. The requirement says: "Path specificity ordering: More specific paths (… |
| application-debug | debugging | ✗ fail | 10 | — | 9 | The analysis is complete. Both output files have been generated successfully:
**Summary of Analysis:**
- **3 log files** analyzed from `/app/logs/` directory
- **148 total exceptions** found across all log files
- **148 critical failures**… grader outputgrader verdict: FAIL (exit 0) utputs.py::test_critical_failures_calculation PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers PASSED ../tests/test_outputs.py::test_anti_cheat_validation FAILED ../tests/test_outputs.py::test_stack_traces_present - AssertionError: ... FAILED ../tests/test_outputs.py::test_percentage_calculations - AssertionErro... ========================= 2 failed, 11 passed in 0.11s ========================= |
| auth_token_race_condition | backend-engineering | ✗ fail | 7 | — | 6 | The file has been created successfully with the race condition fix implemented. The key changes are:
1. **Lines 34-37**: After validating `user_id` and `username`, the code acquires a lock using `await self.acquire_lock(f"refresh_lock:{ref… grader outputgrader verdict: FAIL (exit 1) Starting test suite for concurrent token refresh... Killing any existing uvicorn processes... Waiting for port 8000 to be available... Waiting for Redis to be ready... /tests/test.sh: line 27: /workspace/start.sh: No such file or directory Redis failed to start within 60 seconds |
| bandit-delayed-feedback | machine-learning | ✗ fail | 25 | — | 24 | The solution is complete and all parameters are correctly computed. Let me provide a summary of the solution:
## Summary
The solution addresses the multi-armed bandit problem with delayed feedback:
### Task 1: Delay Distribution Modeling… grader outputgrader verdict: FAIL (exit 0) D ../tests/test_outputs.py::TestDelayModel::test_sigma_parameter_accuracy PASSED ../tests/test_outputs.py::TestDelayModel::test_median_delay_error_threshold PASSED ../tests/test_outputs.py::TestDelayModel::test_conversion_rate_error_threshold PASSED ../tests/test_outputs.py::TestArmStatistics::test_conversion_rate_rmse_threshold FAILED ../tests/test_outputs.py::TestRewardCalibration::test_calibration_error_threshold ========================= 1 failed, 19 passed in 1.70s ========================= |
| bash-log-processor-fix | bash_scripting | ✗ fail | 29 | — | 28 | The script is complete and working. Let me provide a summary of the fixes made: ## Summary of Fixes The original script had several issues: 1. **Only processed top-level files** - didn't use `find -r` to recurse into subdirectories 2. **N… |
| battery-charging-optimization | general | ⏱ timeout | 6 | 1809s | 6 | Good! Since gcd(753, 357) = 3 and 159 is divisible by 3, the goal is reachable. Now let me verify the solution and also try to find a shorter, more optimal solution. The current solution seems to be using a pattern that's not optimal. Let … |
| bloom-filter-cache-penetration-prevention | software-engineering | ⏱ timeout | 24 | 2338s | 24 | I see the issue - when the module is reloaded with a different REDIS_URL, it's not properly reinitializing. The database shows A000001 is not found. Let me trace through the logic more carefully: |
| book-portfolio-analysis | data_engineering | ⏱ timeout | 11 | 1629s | 12 | Wait, I notice there's an issue - "Eleanor Oliphant Is Completely Fine" has author "Gail Honeyman" but the author should be "Gail Honeyman" (I think there's a typo in the data). Let me verify the CSV and check all constraints more carefully… |
| bracket-sequence-restoration | general | ⏱ timeout | 20 | 1901s | 20 | I see some issues. Let me check these cases manually: 1. `??` - length 2, needs 1 open, 1 close. We have 0 open, 0 close, 2 question. needed_open = 1, needed_close = 1. - Position 0: '?' with balance=0, open_remaining=1, close_remaining… |
| breast-cancer-mlflow | machine-learning | ✗ fail | 60 | — | 60 | {"output": "Background process started", "session_id": "proc_e46b102f2626", "pid": null, "exit_code": 0, "error": null} grader outputgrader verdict: FAIL (exit 0) - FileNotFoun... FAILED ../tests/test_outputs.py::test_test_predictions_f1 - FileNotFoundError... FAILED ../tests/test_outputs.py::test_test_predictions_precision_recall - Fil... FAILED ../tests/test_outputs.py::test_val_test_performance_consistency - File... FAILED ../tests/test_outputs.py::test_class_prediction_distribution - FileNot... FAILED ../tests/test_outputs.py::test_model_reproducible_from_hyperparameters ======================== 19 failed, 15 passed in 8.83s ========================= |
| build-system-task-ordering | build-and-dependency-management | ⏱ timeout | 8 | 2215s | 8 | I see the issue! The canonical name for `build`, `compile`, and `link` is `build` because `build` < `compile` < `link` lexicographically. But the expected output says the canonical name should be `compile`. Wait, let me check: `build` vs `… |
| california-housing-api | machine-learning | ✗ fail | 60 | — | 60 | Let me try a different approach - start the server with a very long timeout and use the notify_on_complete flag: grader outputgrader verdict: FAIL (exit 0) t - Failed: Fa... FAILED ../tests/test_outputs.py::test_api_predict_batch_matches_single_predict FAILED ../tests/test_outputs.py::test_api_predict_batch_order_preserved - Fai... FAILED ../tests/test_outputs.py::test_api_prediction_matches_model - Failed: ... FAILED ../tests/test_outputs.py::test_validation_vs_test_metrics - FileNotFou... FAILED ../tests/test_outputs.py::test_predictions_reproducible_from_saved_model ================== 14 failed, 16 passed, 2 warnings in 6.60s =================== |
| chained-forensic-extraction_20260101_011957 | security | ✗ fail | 28 | — | 27 | All validations passed. The solution is complete. Let me provide a summary:
## Summary
I successfully completed all 5 stages of the forensic extraction challenge:
### Stage 1: Polyglot ZIP Detection
- Found 2 ZIP signatures in the artifa… grader outputgrader verdict: FAIL (exit 0) es_ground_truth PASSED ../tests/test_outputs.py::test_chain_stage2_uses_stage1 PASSED ../tests/test_outputs.py::test_chain_stage3_uses_stage2 PASSED ../tests/test_outputs.py::test_chain_stage4_uses_stage3 PASSED ../tests/test_outputs.py::test_chain_stage5_uses_stage4 PASSED ../tests/test_outputs.py::test_anticheat_full_chain_independent FAILED ../tests/test_outputs.py::test_stage2_crc_independently_verified - str... ========================= 1 failed, 37 passed in 0.12s ========================= |