← all results · nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 · NVFP4 · 12B/120B · parser qwen3_coder · smoke probes · live

nemotron-3-super

NVIDIA's Nemotron-3-Super, 120 B total / 12 B active MoE (Mamba2 + attn hybrid). Trained in NVFP4 natively and packaged by NVIDIA specifically for a single DGX Spark. Third-party Artificial-Analysis agentic eval: Terminal-Bench Hard 29%, SWE-Bench Verified 60.5, PinchBench 85.6%.

2/20 tasks passed · 10% pass rate

2 pass 8 grader fail 10 timeout 0 no tool calls
vLLM metrics · 9418 samples over 240m
gen tokens/s
21.9
prompt tokens/s
141
mean TTFT
32.04s
mean TPOT
peak concurrency
8
peak KV cache
total gen tokens
315,113
total requests
118
Task Category Status Turns Time Tool calls What the model ended with
amuse-install scientific-computing ✓ pass 28 27 The task is complete. We have: 1. Created a Python virtual environment at `/app/amuse/.venv`. 2. Installed the AMUSE framework and the required packages (`amuse-bhtree` and `amuse-hermite`) so that the test script runs successfully. 3. Ver…
broken-python software-engineering ✓ pass 26 25 The issue was that your system had multiple Python versions (3.11 from the Debian repository and 3.13 installed locally in `/usr/local/bin`). The `pip3` command was pointing to the Python 3.13 interpreter, but the `pip` module was not insta…
acl-permissions-inheritance system-administration ✗ fail 21 20 The shared directory `/srv/shared` has been successfully created with the required permissions and ACLs: 1. **Directory ownership and setgid bit**: - Owned by group `research` - Setgid bit enabled (shown as `s` in group permissions) …
grader output
grader verdict: FAIL (exit 0)
tests/test_outputs.py::test_bob_can_create_and_alice_can_access
PASSED ../tests/test_outputs.py::test_subdirectory_inheritance
PASSED ../tests/test_outputs.py::test_outside_users_denied_access
PASSED ../tests/test_outputs.py::test_execute_permissions_on_scripts
FAILED ../tests/test_outputs.py::test_directory_exists_and_basic_permissions
FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - Assertio...
========================= 2 failed, 7 passed in 0.46s ==========================
anomaly-detection-ranking general ⏱ timeout 24 1920s 24 [tool_call] terminal({"command": "echo 'Results.json file is ready at /workdir/results.json'"})
api-endpoint-permission-canonicalizer software-engineering ⏱ timeout 18 4827s 18 [tool_call] write_file({"path": "/app/canonicalize.py", "content": "#!/usr/bin/env python3\nimport sys\nimport re\n\ndef normalize_method(method):\n return method.upper()\n\ndef normalize_permission(perm):\n return perm.upper()\n\ndef…
application-debug debugging ✗ fail 60 60 {"output": "Task completed. The incident reports have been generated and are ready for review.", "exit_code": 0, "error": null}
grader output
grader verdict: FAIL (exit 0)
SED ../tests/test_outputs.py::test_critical_failures_calculation
PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions
PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers
PASSED ../tests/test_outputs.py::test_percentage_calculations
PASSED ../tests/test_outputs.py::test_anti_cheat_validation
FAILED ../tests/test_outputs.py::test_stack_traces_present - AssertionError: ...
========================= 1 failed, 12 passed in 0.10s =========================
auth_token_race_condition backend-engineering ✗ fail 55 54 The fix has been successfully applied to `/workspace/auth_gateway/token_service.py`. The `refresh_tokens` method now includes: 1. Lock acquisition after checking `user_id` and `username`: `lock_acquired = await self.acquire_lock(f"ref…
grader output
grader verdict: FAIL (exit 1)
Starting test suite for concurrent token refresh...
Killing any existing uvicorn processes...
Waiting for port 8000 to be available...
Waiting for Redis to be ready...
/tests/test.sh: line 27: /workspace/start.sh: No such file or directory
Redis failed to start within 60 seconds
bandit-delayed-feedback machine-learning ⏱ timeout 17 4327s 17 [tool_call] terminal({"command": "cat /app/data/arms_metadata.json"})
bash-log-processor-fix bash_scripting ⏱ timeout 12 5025s 12 [tool_call] terminal({"command": "cat /workdir/data/processed.log"})
basic-message-queue backend ✗ fail 60 60 {"output": "Background process started", "session_id": "proc_b44ece8669c9", "pid": null, "exit_code": 0, "error": null}
grader output
grader verdict: FAIL (exit 0)
___ test_grader_validation ____________________________
----------------------------- Captured stdout call -----------------------------
Subscores: {'acceptance': 0.0}
Feedback: Empty queue should return 204
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 3.40s ===============================
battery-charging-optimization general ⏱ timeout 12 5349s 12 [tool_call] write_file({"path": "/app/dijkstra_full.py", "content": "import heapq\n\nCAP_A = 753\nCAP_B = 357\nGOAL_A = 159\n\ndef dijkstra():\n INF = 10**9\n dist = [[INF]*(CAP_B+1) for _ in range(CAP_A+1)]\n prev = [[None]*(CAP_B…
bloom-filter-cache-penetration-prevention software-engineering ⏱ timeout 48 5354s 48 [tool_call] terminal({"command": "cd /app && uvicorn main:app --host 0.0.0.0 --port 8000", "background": true})
book-portfolio-analysis data_engineering ⏱ timeout 38 5246s 38 [tool_call] write_file({"path": "/workdir/view_original.py", "content": "with open('/workdir/data/books.json', 'r') as f:\n lines = f.readlines()\n\nprint(\"First 30 lines:\")\nfor i in range(30):\n print(f\"{i+1:3}: {repr(lines[i])}\…
bracket-sequence-restoration general ⏱ timeout 0 no assistant output
breast-cancer-mlflow machine-learning ✗ fail 60 60 {"output": "Background process started", "session_id": "proc_cb25b8ecee30", "pid": null, "exit_code": 0, "error": null}
grader output
grader verdict: FAIL (exit 0)
 - FileNotFoun...
FAILED ../tests/test_outputs.py::test_test_predictions_f1 - FileNotFoundError...
FAILED ../tests/test_outputs.py::test_test_predictions_precision_recall - Fil...
FAILED ../tests/test_outputs.py::test_val_test_performance_consistency - File...
FAILED ../tests/test_outputs.py::test_class_prediction_distribution - FileNot...
FAILED ../tests/test_outputs.py::test_model_reproducible_from_hyperparameters
======================== 17 failed, 17 passed in 8.34s =========================
build-merkle-tree-cli-sha512 software-engineering ⏱ timeout 0 no assistant output
build-system-task-ordering build-and-dependency-management ✗ fail 1 0 <tool_call> function=terminal <parameter=command> ls -la </parameter> </function> </tool_call>
grader output
grader verdict: FAIL (exit 0)
FoundError:...
FAILED ../tests/test_outputs.py::test_08_empty_input - FileNotFoundError: [Er...
FAILED ../tests/test_outputs.py::test_09_undefined_referenced_targets - FileN...
FAILED ../tests/test_outputs.py::test_10_multiple_target_blocks - FileNotFoun...
FAILED ../tests/test_outputs.py::test_11_directive_accumulation - FileNotFoun...
FAILED ../tests/test_outputs.py::test_12_all_dependency_types - FileNotFoundE...
============================== 12 failed in 0.31s ==============================
california-housing-api machine-learning ✗ fail 60 60 {"output": "", "exit_code": -1, "error": "This foreground command appears to start a long-lived server/watch process. Run it with background=true, verify readiness (health endpoint/log signal), then execute tests in a separate command.", "s…
grader output
grader verdict: FAIL (exit 0)
t - Failed: Fa...
FAILED ../tests/test_outputs.py::test_api_predict_batch_matches_single_predict
FAILED ../tests/test_outputs.py::test_api_predict_batch_order_preserved - Fai...
FAILED ../tests/test_outputs.py::test_api_prediction_matches_model - Failed: ...
FAILED ../tests/test_outputs.py::test_validation_vs_test_metrics - FileNotFou...
FAILED ../tests/test_outputs.py::test_predictions_reproducible_from_saved_model
================== 14 failed, 16 passed, 2 warnings in 6.80s ===================
chained-forensic-extraction_20260101_011957 security ✗ fail 60 60 {"output": "Forensic extraction completed successfully. The output file is ready at /app/output/extraction_chain.json", "exit_code": 0, "error": null}
grader output
grader verdict: FAIL (exit 0)
PASSED ../../tests/test_outputs.py::test_chain_stage2_uses_stage1
PASSED ../../tests/test_outputs.py::test_chain_stage3_uses_stage2
PASSED ../../tests/test_outputs.py::test_chain_stage4_uses_stage3
PASSED ../../tests/test_outputs.py::test_chain_stage5_uses_stage4
PASSED ../../tests/test_outputs.py::test_anticheat_full_chain_independent
FAILED ../../tests/test_outputs.py::test_stage2_crc_independently_verified - ...
========================= 1 failed, 37 passed in 0.12s =========================
competitive-programming-solver CP ⏱ timeout 15 5280s 15 [tool_call] terminal({"command": "cd /workdir/data"})