← all results · nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 · NVFP4 · 12B/120B · parser qwen3_coder · smoke probes · live
nemotron-3-super
NVIDIA's Nemotron-3-Super, 120 B total / 12 B active MoE (Mamba2 + attn hybrid). Trained in NVFP4 natively and packaged by NVIDIA specifically for a single DGX Spark. Third-party Artificial-Analysis agentic eval: Terminal-Bench Hard 29%, SWE-Bench Verified 60.5, PinchBench 85.6%.
2/20 tasks passed · 10% pass rate
2 pass 8 grader fail 10 timeout 0 no tool calls
vLLM metrics · 9418 samples over 240m
gen tokens/s
21.9prompt tokens/s
141mean TTFT
32.04smean TPOT
—peak concurrency
8peak KV cache
—total gen tokens
315,113total requests
118| Task | Category | Status | Turns | Time | Tool calls | What the model ended with |
|---|---|---|---|---|---|---|
| amuse-install | scientific-computing | ✓ pass | 28 | — | 27 | The task is complete. We have: 1. Created a Python virtual environment at `/app/amuse/.venv`. 2. Installed the AMUSE framework and the required packages (`amuse-bhtree` and `amuse-hermite`) so that the test script runs successfully. 3. Ver… |
| broken-python | software-engineering | ✓ pass | 26 | — | 25 | The issue was that your system had multiple Python versions (3.11 from the Debian repository and 3.13 installed locally in `/usr/local/bin`). The `pip3` command was pointing to the Python 3.13 interpreter, but the `pip` module was not insta… |
| acl-permissions-inheritance | system-administration | ✗ fail | 21 | — | 20 | The shared directory `/srv/shared` has been successfully created with the required permissions and ACLs:
1. **Directory ownership and setgid bit**:
- Owned by group `research`
- Setgid bit enabled (shown as `s` in group permissions)
… grader outputgrader verdict: FAIL (exit 0) tests/test_outputs.py::test_bob_can_create_and_alice_can_access PASSED ../tests/test_outputs.py::test_subdirectory_inheritance PASSED ../tests/test_outputs.py::test_outside_users_denied_access PASSED ../tests/test_outputs.py::test_execute_permissions_on_scripts FAILED ../tests/test_outputs.py::test_directory_exists_and_basic_permissions FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - Assertio... ========================= 2 failed, 7 passed in 0.46s ========================== |
| anomaly-detection-ranking | general | ⏱ timeout | 24 | 1920s | 24 | [tool_call] terminal({"command": "echo 'Results.json file is ready at /workdir/results.json'"}) |
| api-endpoint-permission-canonicalizer | software-engineering | ⏱ timeout | 18 | 4827s | 18 | [tool_call] write_file({"path": "/app/canonicalize.py", "content": "#!/usr/bin/env python3\nimport sys\nimport re\n\ndef normalize_method(method):\n return method.upper()\n\ndef normalize_permission(perm):\n return perm.upper()\n\ndef… |
| application-debug | debugging | ✗ fail | 60 | — | 60 | {"output": "Task completed. The incident reports have been generated and are ready for review.", "exit_code": 0, "error": null} grader outputgrader verdict: FAIL (exit 0) SED ../tests/test_outputs.py::test_critical_failures_calculation PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers PASSED ../tests/test_outputs.py::test_percentage_calculations PASSED ../tests/test_outputs.py::test_anti_cheat_validation FAILED ../tests/test_outputs.py::test_stack_traces_present - AssertionError: ... ========================= 1 failed, 12 passed in 0.10s ========================= |
| auth_token_race_condition | backend-engineering | ✗ fail | 55 | — | 54 | The fix has been successfully applied to `/workspace/auth_gateway/token_service.py`. The `refresh_tokens` method now includes:
1. Lock acquisition after checking `user_id` and `username`:
`lock_acquired = await self.acquire_lock(f"ref… grader outputgrader verdict: FAIL (exit 1) Starting test suite for concurrent token refresh... Killing any existing uvicorn processes... Waiting for port 8000 to be available... Waiting for Redis to be ready... /tests/test.sh: line 27: /workspace/start.sh: No such file or directory Redis failed to start within 60 seconds |
| bandit-delayed-feedback | machine-learning | ⏱ timeout | 17 | 4327s | 17 | [tool_call] terminal({"command": "cat /app/data/arms_metadata.json"}) |
| bash-log-processor-fix | bash_scripting | ⏱ timeout | 12 | 5025s | 12 | [tool_call] terminal({"command": "cat /workdir/data/processed.log"}) |
| basic-message-queue | backend | ✗ fail | 60 | — | 60 | {"output": "Background process started", "session_id": "proc_b44ece8669c9", "pid": null, "exit_code": 0, "error": null} grader outputgrader verdict: FAIL (exit 0)
___ test_grader_validation ____________________________
----------------------------- Captured stdout call -----------------------------
Subscores: {'acceptance': 0.0}
Feedback: Empty queue should return 204
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 3.40s =============================== |
| battery-charging-optimization | general | ⏱ timeout | 12 | 5349s | 12 | [tool_call] write_file({"path": "/app/dijkstra_full.py", "content": "import heapq\n\nCAP_A = 753\nCAP_B = 357\nGOAL_A = 159\n\ndef dijkstra():\n INF = 10**9\n dist = [[INF]*(CAP_B+1) for _ in range(CAP_A+1)]\n prev = [[None]*(CAP_B… |
| bloom-filter-cache-penetration-prevention | software-engineering | ⏱ timeout | 48 | 5354s | 48 | [tool_call] terminal({"command": "cd /app && uvicorn main:app --host 0.0.0.0 --port 8000", "background": true}) |
| book-portfolio-analysis | data_engineering | ⏱ timeout | 38 | 5246s | 38 | [tool_call] write_file({"path": "/workdir/view_original.py", "content": "with open('/workdir/data/books.json', 'r') as f:\n lines = f.readlines()\n\nprint(\"First 30 lines:\")\nfor i in range(30):\n print(f\"{i+1:3}: {repr(lines[i])}\… |
| bracket-sequence-restoration | general | ⏱ timeout | — | — | 0 | no assistant output |
| breast-cancer-mlflow | machine-learning | ✗ fail | 60 | — | 60 | {"output": "Background process started", "session_id": "proc_cb25b8ecee30", "pid": null, "exit_code": 0, "error": null} grader outputgrader verdict: FAIL (exit 0) - FileNotFoun... FAILED ../tests/test_outputs.py::test_test_predictions_f1 - FileNotFoundError... FAILED ../tests/test_outputs.py::test_test_predictions_precision_recall - Fil... FAILED ../tests/test_outputs.py::test_val_test_performance_consistency - File... FAILED ../tests/test_outputs.py::test_class_prediction_distribution - FileNot... FAILED ../tests/test_outputs.py::test_model_reproducible_from_hyperparameters ======================== 17 failed, 17 passed in 8.34s ========================= |
| build-merkle-tree-cli-sha512 | software-engineering | ⏱ timeout | — | — | 0 | no assistant output |
| build-system-task-ordering | build-and-dependency-management | ✗ fail | 1 | — | 0 | <tool_call>
function=terminal
<parameter=command>
ls -la
</parameter>
</function>
</tool_call> grader outputgrader verdict: FAIL (exit 0) FoundError:... FAILED ../tests/test_outputs.py::test_08_empty_input - FileNotFoundError: [Er... FAILED ../tests/test_outputs.py::test_09_undefined_referenced_targets - FileN... FAILED ../tests/test_outputs.py::test_10_multiple_target_blocks - FileNotFoun... FAILED ../tests/test_outputs.py::test_11_directive_accumulation - FileNotFoun... FAILED ../tests/test_outputs.py::test_12_all_dependency_types - FileNotFoundE... ============================== 12 failed in 0.31s ============================== |
| california-housing-api | machine-learning | ✗ fail | 60 | — | 60 | {"output": "", "exit_code": -1, "error": "This foreground command appears to start a long-lived server/watch process. Run it with background=true, verify readiness (health endpoint/log signal), then execute tests in a separate command.", "s… grader outputgrader verdict: FAIL (exit 0) t - Failed: Fa... FAILED ../tests/test_outputs.py::test_api_predict_batch_matches_single_predict FAILED ../tests/test_outputs.py::test_api_predict_batch_order_preserved - Fai... FAILED ../tests/test_outputs.py::test_api_prediction_matches_model - Failed: ... FAILED ../tests/test_outputs.py::test_validation_vs_test_metrics - FileNotFou... FAILED ../tests/test_outputs.py::test_predictions_reproducible_from_saved_model ================== 14 failed, 16 passed, 2 warnings in 6.80s =================== |
| chained-forensic-extraction_20260101_011957 | security | ✗ fail | 60 | — | 60 | {"output": "Forensic extraction completed successfully. The output file is ready at /app/output/extraction_chain.json", "exit_code": 0, "error": null} grader outputgrader verdict: FAIL (exit 0) PASSED ../../tests/test_outputs.py::test_chain_stage2_uses_stage1 PASSED ../../tests/test_outputs.py::test_chain_stage3_uses_stage2 PASSED ../../tests/test_outputs.py::test_chain_stage4_uses_stage3 PASSED ../../tests/test_outputs.py::test_chain_stage5_uses_stage4 PASSED ../../tests/test_outputs.py::test_anticheat_full_chain_independent FAILED ../../tests/test_outputs.py::test_stage2_crc_independently_verified - ... ========================= 1 failed, 37 passed in 0.12s ========================= |
| competitive-programming-solver | CP | ⏱ timeout | 15 | 5280s | 15 | [tool_call] terminal({"command": "cd /workdir/data"}) |