← all results · nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 · NVFP4 · 12B/120B · parser qwen3_coder · smoke probes · live

nemotron-3-super

NVIDIA's Nemotron-3-Super, 120 B total / 12 B active MoE (Mamba2 + attn hybrid). Trained in NVFP4 natively and packaged by NVIDIA specifically for a single DGX Spark. Third-party Artificial-Analysis agentic eval: Terminal-Bench Hard 29%, SWE-Bench Verified 60.5, PinchBench 85.6%.

2/20 tasks passed · 10% pass rate

2 pass 8 grader fail 10 timeout 0 no tool calls

vLLM metrics · 9418 samples over 240m

gen tokens/s

21.9

prompt tokens/s

141

mean TTFT

32.04s

mean TPOT

—

peak concurrency

peak KV cache

—

total gen tokens

315,113

total requests

118

Task	Category	Status	Turns	Time	Tool calls	What the model ended with
amuse-install	scientific-computing	✓ pass	28	—	27	The task is complete. We have: 1. Created a Python virtual environment at `/app/amuse/.venv`. 2. Installed the AMUSE framework and the required packages (`amuse-bhtree` and `amuse-hermite`) so that the test script runs successfully. 3. Ver…
broken-python	software-engineering	✓ pass	26	—	25	The issue was that your system had multiple Python versions (3.11 from the Debian repository and 3.13 installed locally in `/usr/local/bin`). The `pip3` command was pointing to the Python 3.13 interpreter, but the `pip` module was not insta…
acl-permissions-inheritance	system-administration	✗ fail	21	—	20	The shared directory `/srv/shared` has been successfully created with the required permissions and ACLs: 1. Directory ownership and setgid bit: - Owned by group `research` - Setgid bit enabled (shown as `s` in group permissions) … grader output grader verdict: FAIL (exit 0) tests/test_outputs.py::test_bob_can_create_and_alice_can_access PASSED ../tests/test_outputs.py::test_subdirectory_inheritance PASSED ../tests/test_outputs.py::test_outside_users_denied_access PASSED ../tests/test_outputs.py::test_execute_permissions_on_scripts FAILED ../tests/test_outputs.py::test_directory_exists_and_basic_permissions FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - Assertio... ========================= 2 failed, 7 passed in 0.46s ==========================
anomaly-detection-ranking	general	⏱ timeout	24	1920s	24	[tool_call] terminal({"command": "echo 'Results.json file is ready at /workdir/results.json'"})
api-endpoint-permission-canonicalizer	software-engineering	⏱ timeout	18	4827s	18	[tool_call] write_file({"path": "/app/canonicalize.py", "content": "#!/usr/bin/env python3\nimport sys\nimport re\n\ndef normalize_method(method):\n return method.upper()\n\ndef normalize_permission(perm):\n return perm.upper()\n\ndef…
application-debug	debugging	✗ fail	60	—	60	{"output": "Task completed. The incident reports have been generated and are ready for review.", "exit_code": 0, "error": null} grader output grader verdict: FAIL (exit 0) SED ../tests/test_outputs.py::test_critical_failures_calculation PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers PASSED ../tests/test_outputs.py::test_percentage_calculations PASSED ../tests/test_outputs.py::test_anti_cheat_validation FAILED ../tests/test_outputs.py::test_stack_traces_present - AssertionError: ... ========================= 1 failed, 12 passed in 0.10s =========================
auth_token_race_condition	backend-engineering	✗ fail	55	—	54	The fix has been successfully applied to `/workspace/auth_gateway/token_service.py`. The `refresh_tokens` method now includes: 1. Lock acquisition after checking `user_id` and `username`: `lock_acquired = await self.acquire_lock(f"ref… grader output grader verdict: FAIL (exit 1) Starting test suite for concurrent token refresh... Killing any existing uvicorn processes... Waiting for port 8000 to be available... Waiting for Redis to be ready... /tests/test.sh: line 27: /workspace/start.sh: No such file or directory Redis failed to start within 60 seconds
bandit-delayed-feedback	machine-learning	⏱ timeout	17	4327s	17	[tool_call] terminal({"command": "cat /app/data/arms_metadata.json"})
bash-log-processor-fix	bash_scripting	⏱ timeout	12	5025s	12	[tool_call] terminal({"command": "cat /workdir/data/processed.log"})
basic-message-queue	backend	✗ fail	60	—	60	{"output": "Background process started", "session_id": "proc_b44ece8669c9", "pid": null, "exit_code": 0, "error": null} grader output grader verdict: FAIL (exit 0) ___ test_grader_validation ____________________________ ----------------------------- Captured stdout call ----------------------------- Subscores: {'acceptance': 0.0} Feedback: Empty queue should return 204 Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 3.40s ===============================
battery-charging-optimization	general	⏱ timeout	12	5349s	12	[tool_call] write_file({"path": "/app/dijkstra_full.py", "content": "import heapq\n\nCAP_A = 753\nCAP_B = 357\nGOAL_A = 159\n\ndef dijkstra():\n INF = 10*9\n dist = [[INF](CAP_B+1) for _ in range(CAP_A+1)]\n prev = [[None]*(CAP_B…
bloom-filter-cache-penetration-prevention	software-engineering	⏱ timeout	48	5354s	48	[tool_call] terminal({"command": "cd /app && uvicorn main:app --host 0.0.0.0 --port 8000", "background": true})
book-portfolio-analysis	data_engineering	⏱ timeout	38	5246s	38	[tool_call] write_file({"path": "/workdir/view_original.py", "content": "with open('/workdir/data/books.json', 'r') as f:\n lines = f.readlines()\n\nprint(\"First 30 lines:\")\nfor i in range(30):\n print(f\"{i+1:3}: {repr(lines[i])}\…
bracket-sequence-restoration	general	⏱ timeout	—	—	0	no assistant output
breast-cancer-mlflow	machine-learning	✗ fail	60	—	60	{"output": "Background process started", "session_id": "proc_cb25b8ecee30", "pid": null, "exit_code": 0, "error": null} grader output grader verdict: FAIL (exit 0) - FileNotFoun... FAILED ../tests/test_outputs.py::test_test_predictions_f1 - FileNotFoundError... FAILED ../tests/test_outputs.py::test_test_predictions_precision_recall - Fil... FAILED ../tests/test_outputs.py::test_val_test_performance_consistency - File... FAILED ../tests/test_outputs.py::test_class_prediction_distribution - FileNot... FAILED ../tests/test_outputs.py::test_model_reproducible_from_hyperparameters ======================== 17 failed, 17 passed in 8.34s =========================
build-merkle-tree-cli-sha512	software-engineering	⏱ timeout	—	—	0	no assistant output
build-system-task-ordering	build-and-dependency-management	✗ fail	1	—	0	<tool_call> function=terminal <parameter=command> ls -la </parameter> </function> </tool_call> grader output grader verdict: FAIL (exit 0) FoundError:... FAILED ../tests/test_outputs.py::test_08_empty_input - FileNotFoundError: [Er... FAILED ../tests/test_outputs.py::test_09_undefined_referenced_targets - FileN... FAILED ../tests/test_outputs.py::test_10_multiple_target_blocks - FileNotFoun... FAILED ../tests/test_outputs.py::test_11_directive_accumulation - FileNotFoun... FAILED ../tests/test_outputs.py::test_12_all_dependency_types - FileNotFoundE... ============================== 12 failed in 0.31s ==============================
california-housing-api	machine-learning	✗ fail	60	—	60	{"output": "", "exit_code": -1, "error": "This foreground command appears to start a long-lived server/watch process. Run it with background=true, verify readiness (health endpoint/log signal), then execute tests in a separate command.", "s… grader output grader verdict: FAIL (exit 0) t - Failed: Fa... FAILED ../tests/test_outputs.py::test_api_predict_batch_matches_single_predict FAILED ../tests/test_outputs.py::test_api_predict_batch_order_preserved - Fai... FAILED ../tests/test_outputs.py::test_api_prediction_matches_model - Failed: ... FAILED ../tests/test_outputs.py::test_validation_vs_test_metrics - FileNotFou... FAILED ../tests/test_outputs.py::test_predictions_reproducible_from_saved_model ================== 14 failed, 16 passed, 2 warnings in 6.80s ===================
chained-forensic-extraction_20260101_011957	security	✗ fail	60	—	60	{"output": "Forensic extraction completed successfully. The output file is ready at /app/output/extraction_chain.json", "exit_code": 0, "error": null} grader output grader verdict: FAIL (exit 0) PASSED ../../tests/test_outputs.py::test_chain_stage2_uses_stage1 PASSED ../../tests/test_outputs.py::test_chain_stage3_uses_stage2 PASSED ../../tests/test_outputs.py::test_chain_stage4_uses_stage3 PASSED ../../tests/test_outputs.py::test_chain_stage5_uses_stage4 PASSED ../../tests/test_outputs.py::test_anticheat_full_chain_independent FAILED ../../tests/test_outputs.py::test_stage2_crc_independently_verified - ... ========================= 1 failed, 37 passed in 0.12s =========================
competitive-programming-solver	CP	⏱ timeout	15	5280s	15	[tool_call] terminal({"command": "cd /workdir/data"})