← all results · Qwen/Qwen3-Coder-Next-FP8 · FP8 · 3B/80B · parser qwen3_coder · smoke probes · live

qwen3-coder-next-fp8

Qwen's official FP8 build of Qwen3-Coder-Next. A/B partner to the NVFP4 variant — FP8 kernels are better-tested on SM120 than NVFP4.

5/20 tasks passed · 25% pass rate

5 pass 7 grader fail 6 timeout 2 no tool calls
vLLM metrics · 4877 samples over 619m
gen tokens/s
4.0
prompt tokens/s
217
mean TTFT
7.70s
mean TPOT
peak concurrency
8
peak KV cache
total gen tokens
148,323
total requests
540
Task Category Status Turns Time Tool calls What the model ended with
amuse-install scientific-computing ✓ pass 39 38 Perfect! The installation is complete and working. Let me create a summary of what was done: ## Summary I successfully created a Python virtual environment and installed AMUSE at `/app/amuse/.venv`. Here's what was done: 1. **Created the…
basic-message-queue backend ✓ pass 60 60 Let me try a different approach - I'll run the server with a shorter timeout and then test it:
broken-python software-engineering ✓ pass 14 13 Perfect! Your pip installation is now working correctly. The issue was that pip was only partially installed - only the metadata directory existed but not the actual pip package files. The get-pip.py script has properly installed pip 26.0.1…
build-merkle-tree-cli-sha512 software-engineering ✓ pass 44 43 The Merkle CLI tool is complete and working. Here's a summary of what was implemented: ## Summary I created `/app/merkle_cli.py` - a Python CLI tool that: ### Commands: 1. **`scan <directory>`** - Scans a directory, computes SHA-512 hash…
competitive-programming-solver CP ✓ pass 19 18 The solution is complete. The CSV file has been created at `/workdir/sol.csv` with the correct format: ``` testcase,answer testcase1,YES testcase2,NO testcase3,YES testcase4,NO testcase5,NO testcase6,YES testcase7,YES testcase8,NO testcase…
acl-permissions-inheritance system-administration ✗ fail 11 10 The shared directory structure has been created with proper Linux ACLs. Here's what was configured: **Directory `/srv/shared`:** - Owned by group `research` - Setgid bit set (shown by `s` in permissions) - Base permissions: `2770` (rwxrws-…
grader output
grader verdict: FAIL (exit 0)
tests/test_outputs.py::test_bob_can_create_and_alice_can_access
PASSED ../tests/test_outputs.py::test_subdirectory_inheritance
PASSED ../tests/test_outputs.py::test_outside_users_denied_access
PASSED ../tests/test_outputs.py::test_execute_permissions_on_scripts
FAILED ../tests/test_outputs.py::test_directory_exists_and_basic_permissions
FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - Assertio...
========================= 2 failed, 7 passed in 0.49s ==========================
anomaly-detection-ranking general ✗ fail 20 19 The `results.json` file has been successfully created at `/workdir/results.json` with all required predictions: **Summary of predictions:** - **Sensor**: 100 binary predictions with 2 anomalies detected (high z-score values) - **Network**:…
api-endpoint-permission-canonicalizer software-engineering ⏱ timeout 39 2353s 39 Now only one endpoint is shown for the path with variable names, and it uses the first encountered variable name `{userId}`. Let me also verify the sorting is correct. The requirement says: "Path specificity ordering: More specific paths (…
application-debug debugging ✗ fail 10 9 The analysis is complete. Both output files have been generated successfully: **Summary of Analysis:** - **3 log files** analyzed from `/app/logs/` directory - **148 total exceptions** found across all log files - **148 critical failures**…
grader output
grader verdict: FAIL (exit 0)
utputs.py::test_critical_failures_calculation
PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions
PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers
PASSED ../tests/test_outputs.py::test_anti_cheat_validation
FAILED ../tests/test_outputs.py::test_stack_traces_present - AssertionError: ...
FAILED ../tests/test_outputs.py::test_percentage_calculations - AssertionErro...
========================= 2 failed, 11 passed in 0.11s =========================
auth_token_race_condition backend-engineering ✗ fail 7 6 The file has been created successfully with the race condition fix implemented. The key changes are: 1. **Lines 34-37**: After validating `user_id` and `username`, the code acquires a lock using `await self.acquire_lock(f"refresh_lock:{ref…
grader output
grader verdict: FAIL (exit 1)
Starting test suite for concurrent token refresh...
Killing any existing uvicorn processes...
Waiting for port 8000 to be available...
Waiting for Redis to be ready...
/tests/test.sh: line 27: /workspace/start.sh: No such file or directory
Redis failed to start within 60 seconds
bandit-delayed-feedback machine-learning ✗ fail 25 24 The solution is complete and all parameters are correctly computed. Let me provide a summary of the solution: ## Summary The solution addresses the multi-armed bandit problem with delayed feedback: ### Task 1: Delay Distribution Modeling…
grader output
grader verdict: FAIL (exit 0)
D ../tests/test_outputs.py::TestDelayModel::test_sigma_parameter_accuracy
PASSED ../tests/test_outputs.py::TestDelayModel::test_median_delay_error_threshold
PASSED ../tests/test_outputs.py::TestDelayModel::test_conversion_rate_error_threshold
PASSED ../tests/test_outputs.py::TestArmStatistics::test_conversion_rate_rmse_threshold
FAILED ../tests/test_outputs.py::TestRewardCalibration::test_calibration_error_threshold
========================= 1 failed, 19 passed in 1.70s =========================
bash-log-processor-fix bash_scripting ✗ fail 29 28 The script is complete and working. Let me provide a summary of the fixes made: ## Summary of Fixes The original script had several issues: 1. **Only processed top-level files** - didn't use `find -r` to recurse into subdirectories 2. **N…
battery-charging-optimization general ⏱ timeout 6 1809s 6 Good! Since gcd(753, 357) = 3 and 159 is divisible by 3, the goal is reachable. Now let me verify the solution and also try to find a shorter, more optimal solution. The current solution seems to be using a pattern that's not optimal. Let …
bloom-filter-cache-penetration-prevention software-engineering ⏱ timeout 24 2338s 24 I see the issue - when the module is reloaded with a different REDIS_URL, it's not properly reinitializing. The database shows A000001 is not found. Let me trace through the logic more carefully:
book-portfolio-analysis data_engineering ⏱ timeout 11 1629s 12 Wait, I notice there's an issue - "Eleanor Oliphant Is Completely Fine" has author "Gail Honeyman" but the author should be "Gail Honeyman" (I think there's a typo in the data). Let me verify the CSV and check all constraints more carefully…
bracket-sequence-restoration general ⏱ timeout 20 1901s 20 I see some issues. Let me check these cases manually: 1. `??` - length 2, needs 1 open, 1 close. We have 0 open, 0 close, 2 question. needed_open = 1, needed_close = 1. - Position 0: '?' with balance=0, open_remaining=1, close_remaining…
breast-cancer-mlflow machine-learning ✗ fail 60 60 {"output": "Background process started", "session_id": "proc_e46b102f2626", "pid": null, "exit_code": 0, "error": null}
grader output
grader verdict: FAIL (exit 0)
 - FileNotFoun...
FAILED ../tests/test_outputs.py::test_test_predictions_f1 - FileNotFoundError...
FAILED ../tests/test_outputs.py::test_test_predictions_precision_recall - Fil...
FAILED ../tests/test_outputs.py::test_val_test_performance_consistency - File...
FAILED ../tests/test_outputs.py::test_class_prediction_distribution - FileNot...
FAILED ../tests/test_outputs.py::test_model_reproducible_from_hyperparameters
======================== 19 failed, 15 passed in 8.83s =========================
build-system-task-ordering build-and-dependency-management ⏱ timeout 8 2215s 8 I see the issue! The canonical name for `build`, `compile`, and `link` is `build` because `build` < `compile` < `link` lexicographically. But the expected output says the canonical name should be `compile`. Wait, let me check: `build` vs `…
california-housing-api machine-learning ✗ fail 60 60 Let me try a different approach - start the server with a very long timeout and use the notify_on_complete flag:
grader output
grader verdict: FAIL (exit 0)
t - Failed: Fa...
FAILED ../tests/test_outputs.py::test_api_predict_batch_matches_single_predict
FAILED ../tests/test_outputs.py::test_api_predict_batch_order_preserved - Fai...
FAILED ../tests/test_outputs.py::test_api_prediction_matches_model - Failed: ...
FAILED ../tests/test_outputs.py::test_validation_vs_test_metrics - FileNotFou...
FAILED ../tests/test_outputs.py::test_predictions_reproducible_from_saved_model
================== 14 failed, 16 passed, 2 warnings in 6.60s ===================
chained-forensic-extraction_20260101_011957 security ✗ fail 28 27 All validations passed. The solution is complete. Let me provide a summary: ## Summary I successfully completed all 5 stages of the forensic extraction challenge: ### Stage 1: Polyglot ZIP Detection - Found 2 ZIP signatures in the artifa…
grader output
grader verdict: FAIL (exit 0)
es_ground_truth
PASSED ../tests/test_outputs.py::test_chain_stage2_uses_stage1
PASSED ../tests/test_outputs.py::test_chain_stage3_uses_stage2
PASSED ../tests/test_outputs.py::test_chain_stage4_uses_stage3
PASSED ../tests/test_outputs.py::test_chain_stage5_uses_stage4
PASSED ../tests/test_outputs.py::test_anticheat_full_chain_independent
FAILED ../tests/test_outputs.py::test_stage2_crc_independently_verified - str...
========================= 1 failed, 37 passed in 0.12s =========================