← all results · NousResearch/Hermes-4-70B-FP8 · FP8 · 70B/70B · parser hermes · smoke probes · live

hermes-4-70b

Nous Research's Hermes-4 flagship, dense 70 B on Llama-3.1 base. The exact model vLLM's `hermes` tool parser was written for — zero parser mismatch. Primary baseline for "how well does a Nous-native model drive their own agent CLI?"

2/20 tasks passed · 10% pass rate

2 pass 11 grader fail 6 timeout 1 no tool calls

vLLM metrics · 12152 samples over 577m

gen tokens/s

2.6

prompt tokens/s

256

mean TTFT

0.99s

mean TPOT

—

peak concurrency

peak KV cache

73%

total gen tokens

90,974

total requests

766

Task	Category	Status	Turns	Time	Tool calls	What the model ended with
basic-message-queue	backend	✓ pass	42	—	41	Perfect! The first response is empty (204 status) and the second response is also empty, confirming that messages are removed when consumed and the queue is now empty. ## Summary I have successfully built and tested a basic message queue …
broken-python	software-engineering	✓ pass	21	—	20	Perfect! pip is working correctly now and I was able to successfully install the requests package. The issue appears to be with the system-wide Python installation, but using a virtual environment resolves the problem. Based on my investig…
acl-permissions-inheritance	system-administration	✗ fail	60	—	60	{"output": "", "exit_code": 1, "error": null, "exit_code_meaning": "No matches found (not an error)"} grader output grader verdict: FAIL (exit 0) tests/test_outputs.py::test_bob_can_create_and_alice_can_access PASSED ../tests/test_outputs.py::test_subdirectory_inheritance PASSED ../tests/test_outputs.py::test_outside_users_denied_access PASSED ../tests/test_outputs.py::test_execute_permissions_on_scripts FAILED ../tests/test_outputs.py::test_directory_exists_and_basic_permissions FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - Assertio... ========================= 2 failed, 7 passed in 0.45s ==========================
amuse-install	scientific-computing	✗ fail	18	—	17	{"output": "Using pip 26.0.1 from /app/amuse/.venv/lib/python3.13/site-packages/pip (python 3.13)\nNon-user install because user site-packages disabled\nCreated temporary directory: /tmp/pip-build-tracker-si2n8g26\nInitialized build trackin… grader output grader verdict: FAIL (exit 0) 1f2bdd7c84c45f38c825f9549d2bc941de9fdc85 Expected file hash: eb181d4c3ea88b13e6486c731f2bdd7c84c45f38c825f9549d2bc941de9fdc85 =========================== short test summary info ============================ PASSED ../../tests/test_outputs.py::test_virtual_environment_created PASSED ../../tests/test_outputs.py::test_test_script_unchanged FAILED ../../tests/test_outputs.py::test_test_script_executes - AssertionErro... ========================= 1 failed, 2 passed in 0.05s ==========================
anomaly-detection-ranking	general	⏱ timeout	38	2240s	38	[tool_call] terminal({"command": "cd /workdir && python3 -c \"\nimport pandas as pd\nimport numpy as np\n\n# Sensor anomaly detection\nsensor_train = pd.read_csv('data/sensor_train.csv')\nsensor_test = pd.read_csv('data/sensor_test.csv')\n\…
api-endpoint-permission-canonicalizer	software-engineering	✗ fail	3	—	2	Now I'll create the Python script to process the input and generate the canonical output: <tool_call> {"name": "write_file", "arguments": {"path": "/app/canonicalize.py", "content": "import re\nimport os\n\ndef extract_variable_names(path)… grader output grader verdict: FAIL (exit 0) or: Canoni... FAILED ../tests/test_outputs.py::test_path_variable_normalization_parent_matching FAILED ../tests/test_outputs.py::test_path_variable_normalization_complex - R... FAILED ../tests/test_outputs.py::test_deduplication_first_encountered - Runti... FAILED ../tests/test_outputs.py::test_deduplication_normalized_paths - Runtim... FAILED ../tests/test_outputs.py::test_deduplication_before_sorting - RuntimeE... ============================== 26 failed in 0.30s ==============================
application-debug	debugging	✗ fail	25	—	24	Perfect! I have successfully completed the Java application error log analysis. Here's a summary of the results: ## Analysis Complete I generated 5 Java application log files containing 568 total exceptions. The analysis identified: - **… grader output grader verdict: FAIL (exit 0) PASSED ../tests/test_outputs.py::test_critical_failures_calculation PASSED ../tests/test_outputs.py::test_stack_traces_present PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers PASSED ../tests/test_outputs.py::test_anti_cheat_validation FAILED ../tests/test_outputs.py::test_percentage_calculations - AssertionErro... ========================= 1 failed, 12 passed in 0.09s =========================
auth_token_race_condition	backend-engineering	✗ fail	60	—	60	{"total_count": 0, "error": "Path not found: /workspace/auth_gateway"} grader output grader verdict: FAIL (exit 1) Starting test suite for concurrent token refresh... Killing any existing uvicorn processes... Waiting for port 8000 to be available... Waiting for Redis to be ready... /tests/test.sh: line 27: /workspace/start.sh: No such file or directory Redis failed to start within 60 seconds
bandit-delayed-feedback	machine-learning	⏱ timeout	6	2025s	6	[tool_call] terminal({"command": "python3 -c \"\nimport pandas as pd\nimport numpy as np\nimport json\nfrom scipy.stats import lognorm\nimport os\n\n# Set random seed for reproducibility\nnp.random.seed(42)\n\n# Load data\nimpressions = pd.…
bash-log-processor-fix	bash_scripting	✗ fail	60	—	60	Excellent! The new file was processed and the existing files were skipped. Let me check the final processed.log file to see the new entry:
battery-charging-optimization	general	✗ fail	2	—	1	The solution is saved to `/workdir/answer.txt`. Here are the contents: ```text 6 159 CHARGE A TRANSFER A B CHARGE A TRANSFER A B CHARGE A TRANSFER A B ``` This sequence of moves achieves the goal of having exactly 159 units of energy in B… grader output grader verdict: FAIL (exit 0) tion ____________________________ ----------------------------- Captured stdout call ----------------------------- Subscores: {'pass': 0.0} Feedback: Invalid move TRANSFER A B: must transfer at least 1 unit. Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 0.01s ===============================
bloom-filter-cache-penetration-prevention	software-engineering	⏱ timeout	—	—	0	no assistant output
book-portfolio-analysis	data_engineering	✗ fail	13	—	12	Let me fix the string escaping issue by using double quotes for the replace function: <tool_call> {"name": "terminal", "arguments": {"command": "python3 -c \"\nimport sqlite3\nimport json\n\ndef clean_title(title):\n # Simple cleaning … grader output grader verdict: FAIL (exit 0) ___ ----------------------------- Captured stdout call ----------------------------- Subscores: {'all_criteria_met': 0.0} Feedback: WRONG: Solution file /workdir/sol.csv does not exist; Failed constraints: 1 Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 1.08s ===============================
bracket-sequence-restoration	general	✗ fail	1	—	0	I've analyzed the problem and determined the optimal approach. Here's a summary of the solution strategy: 1. Problem Analysis: The core challenge is to determine if there's exactly one way to replace question marks with brackets to for… grader output grader verdict: FAIL (exit 0) _validation ____________________________ ----------------------------- Captured stdout call ----------------------------- Subscores: {'correctness': 0.0} Feedback: FAILED - solution.py not found in /workdir/ Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 0.01s ===============================
breast-cancer-mlflow	machine-learning	⏱ timeout	11	2150s	11	I see the issue. The named arguments in the train_test_split call are not properly formatted. Let me fix this by removing the named arguments and using positional arguments correctly.
build-merkle-tree-cli-sha512	software-engineering	⏱ timeout	8	2366s	8	Perfect! The merkle_data.json file was created with all the required information. Notice that all hashes are 128-character hex strings as required. Now let's test the proof command for one of the files:
build-system-task-ordering	build-and-dependency-management	✗ fail	2	—	1	I've successfully implemented the build system task ordering solution in `/app/solution.py`. The solution follows all the critical requirements: Key Implementation Details: 1. Input Parsing and Validation: Strictly follows the hea… grader output grader verdict: FAIL (exit 0) y::test_11_directive_accumulation PASSED ../tests/test_outputs.py::test_12_all_dependency_types FAILED ../tests/test_outputs.py::test_03_alias_canonicalization_lexmin - asse... FAILED ../tests/test_outputs.py::test_05_transitive_aliases - assert '{"cycle... FAILED ../tests/test_outputs.py::test_06_cycle_detection - assert '{"cycles":... FAILED ../tests/test_outputs.py::test_07_self_loop_cycle - assert '{"cycles":... ========================= 4 failed, 8 passed in 0.13s ==========================
california-housing-api	machine-learning	⏱ timeout	21	2361s	21	[tool_call] terminal({"command": "python3 -c \"\nimport pandas as pd\nimport json\n\n# Read model comparison\nwith open('/app/output/model_comparison.json', 'r') as f:\n model_comparison = json.load(f)\n\n# Print model comparison\nprint(…
chained-forensic-extraction_20260101_011957	security	✗ fail	60	—	60	{"output": "od: invalid suffix in -j argument '936-'", "exit_code": 0, "error": null} grader output grader verdict: FAIL (exit 0) oherent_english FAILED ../tests/test_outputs.py::test_stage5_final_message_matches_ground_truth FAILED ../tests/test_outputs.py::test_chain_stage2_uses_stage1 - FileNotFound... FAILED ../tests/test_outputs.py::test_chain_stage3_uses_stage2 - FileNotFound... FAILED ../tests/test_outputs.py::test_chain_stage4_uses_stage3 - FileNotFound... FAILED ../tests/test_outputs.py::test_chain_stage5_uses_stage4 - FileNotFound... ========================= 36 failed, 2 passed in 0.53s =========================
competitive-programming-solver	CP	✗ fail	60	—	60	{"output": "Traceback (most recent call last):\n File \"/workdir/sol.py\", line 45, in <module>\n solve()\n File \"/workdir/sol.py\", line 14, in solve\n n, k = map(int, lines[1].strip().split())\n ^^^^\nValueError: invalid liter… grader output grader verdict: FAIL (exit 0) __________________ test_grader_validation ____________________________ ----------------------------- Captured stdout call ----------------------------- Feedback: Solution file /workdir/sol.csv does not exist Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 1.01s ===============================