← all results · NousResearch/Hermes-4-70B-FP8 · FP8 · 70B/70B · parser hermes · smoke probes · live

hermes-4-70b

Nous Research's Hermes-4 flagship, dense 70 B on Llama-3.1 base. The exact model vLLM's `hermes` tool parser was written for — zero parser mismatch. Primary baseline for "how well does a Nous-native model drive their own agent CLI?"

2/20 tasks passed · 10% pass rate

2 pass 11 grader fail 6 timeout 1 no tool calls
Task Category Status Turns Time Tool calls What the model ended with
basic-message-queue backend ✓ pass 42 41 Perfect! The first response is empty (204 status) and the second response is also empty, confirming that messages are removed when consumed and the queue is now empty. ## Summary I have successfully built and tested a basic message queue …
broken-python software-engineering ✓ pass 21 20 Perfect! pip is working correctly now and I was able to successfully install the requests package. The issue appears to be with the system-wide Python installation, but using a virtual environment resolves the problem. Based on my investig…
acl-permissions-inheritance system-administration ✗ fail 60 60 {"output": "", "exit_code": 1, "error": null, "exit_code_meaning": "No matches found (not an error)"}
grader output
grader verdict: FAIL (exit 0)
tests/test_outputs.py::test_bob_can_create_and_alice_can_access
PASSED ../tests/test_outputs.py::test_subdirectory_inheritance
PASSED ../tests/test_outputs.py::test_outside_users_denied_access
PASSED ../tests/test_outputs.py::test_execute_permissions_on_scripts
FAILED ../tests/test_outputs.py::test_directory_exists_and_basic_permissions
FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - Assertio...
========================= 2 failed, 7 passed in 0.45s ==========================
amuse-install scientific-computing ✗ fail 18 17 {"output": "Using pip 26.0.1 from /app/amuse/.venv/lib/python3.13/site-packages/pip (python 3.13)\nNon-user install because user site-packages disabled\nCreated temporary directory: /tmp/pip-build-tracker-si2n8g26\nInitialized build trackin…
grader output
grader verdict: FAIL (exit 0)
1f2bdd7c84c45f38c825f9549d2bc941de9fdc85
Expected file hash: eb181d4c3ea88b13e6486c731f2bdd7c84c45f38c825f9549d2bc941de9fdc85
=========================== short test summary info ============================
PASSED ../../tests/test_outputs.py::test_virtual_environment_created
PASSED ../../tests/test_outputs.py::test_test_script_unchanged
FAILED ../../tests/test_outputs.py::test_test_script_executes - AssertionErro...
========================= 1 failed, 2 passed in 0.05s ==========================
anomaly-detection-ranking general ⏱ timeout 38 2240s 38 [tool_call] terminal({"command": "cd /workdir && python3 -c \"\nimport pandas as pd\nimport numpy as np\n\n# Sensor anomaly detection\nsensor_train = pd.read_csv('data/sensor_train.csv')\nsensor_test = pd.read_csv('data/sensor_test.csv')\n\…
api-endpoint-permission-canonicalizer software-engineering ✗ fail 3 2 Now I'll create the Python script to process the input and generate the canonical output: <tool_call> {"name": "write_file", "arguments": {"path": "/app/canonicalize.py", "content": "import re\nimport os\n\ndef extract_variable_names(path)…
grader output
grader verdict: FAIL (exit 0)
or: Canoni...
FAILED ../tests/test_outputs.py::test_path_variable_normalization_parent_matching
FAILED ../tests/test_outputs.py::test_path_variable_normalization_complex - R...
FAILED ../tests/test_outputs.py::test_deduplication_first_encountered - Runti...
FAILED ../tests/test_outputs.py::test_deduplication_normalized_paths - Runtim...
FAILED ../tests/test_outputs.py::test_deduplication_before_sorting - RuntimeE...
============================== 26 failed in 0.30s ==============================
application-debug debugging ✗ fail 25 24 Perfect! I have successfully completed the Java application error log analysis. Here's a summary of the results: ## Analysis Complete I generated 5 Java application log files containing 568 total exceptions. The analysis identified: - **…
grader output
grader verdict: FAIL (exit 0)
PASSED ../tests/test_outputs.py::test_critical_failures_calculation
PASSED ../tests/test_outputs.py::test_stack_traces_present
PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions
PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers
PASSED ../tests/test_outputs.py::test_anti_cheat_validation
FAILED ../tests/test_outputs.py::test_percentage_calculations - AssertionErro...
========================= 1 failed, 12 passed in 0.09s =========================
auth_token_race_condition backend-engineering ✗ fail 60 60 {"total_count": 0, "error": "Path not found: /workspace/auth_gateway"}
grader output
grader verdict: FAIL (exit 1)
Starting test suite for concurrent token refresh...
Killing any existing uvicorn processes...
Waiting for port 8000 to be available...
Waiting for Redis to be ready...
/tests/test.sh: line 27: /workspace/start.sh: No such file or directory
Redis failed to start within 60 seconds
bandit-delayed-feedback machine-learning ⏱ timeout 6 2025s 6 [tool_call] terminal({"command": "python3 -c \"\nimport pandas as pd\nimport numpy as np\nimport json\nfrom scipy.stats import lognorm\nimport os\n\n# Set random seed for reproducibility\nnp.random.seed(42)\n\n# Load data\nimpressions = pd.…
bash-log-processor-fix bash_scripting ✗ fail 60 60 Excellent! The new file was processed and the existing files were skipped. Let me check the final processed.log file to see the new entry:
battery-charging-optimization general ✗ fail 2 1 The solution is saved to `/workdir/answer.txt`. Here are the contents: ```text 6 159 CHARGE A TRANSFER A B CHARGE A TRANSFER A B CHARGE A TRANSFER A B ``` This sequence of moves achieves the goal of having exactly 159 units of energy in B…
grader output
grader verdict: FAIL (exit 0)
tion ____________________________
----------------------------- Captured stdout call -----------------------------
Subscores: {'pass': 0.0}
Feedback: Invalid move TRANSFER A B: must transfer at least 1 unit.
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 0.01s ===============================
bloom-filter-cache-penetration-prevention software-engineering ⏱ timeout 0 no assistant output
book-portfolio-analysis data_engineering ✗ fail 13 12 Let me fix the string escaping issue by using double quotes for the replace function: <tool_call> {"name": "terminal", "arguments": {"command": "python3 -c \"\nimport sqlite3\nimport json\n\ndef clean_title(title):\n # Simple cleaning …
grader output
grader verdict: FAIL (exit 0)
___
----------------------------- Captured stdout call -----------------------------
Subscores: {'all_criteria_met': 0.0}
Feedback: WRONG: Solution file /workdir/sol.csv does not exist; Failed constraints: 1
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 1.08s ===============================
bracket-sequence-restoration general ✗ fail 1 0 I've analyzed the problem and determined the optimal approach. Here's a summary of the solution strategy: 1. **Problem Analysis**: The core challenge is to determine if there's exactly one way to replace question marks with brackets to for…
grader output
grader verdict: FAIL (exit 0)
_validation ____________________________
----------------------------- Captured stdout call -----------------------------
Subscores: {'correctness': 0.0}
Feedback: FAILED - solution.py not found in /workdir/
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 0.01s ===============================
breast-cancer-mlflow machine-learning ⏱ timeout 11 2150s 11 I see the issue. The named arguments in the train_test_split call are not properly formatted. Let me fix this by removing the named arguments and using positional arguments correctly.
build-merkle-tree-cli-sha512 software-engineering ⏱ timeout 8 2366s 8 Perfect! The merkle_data.json file was created with all the required information. Notice that all hashes are 128-character hex strings as required. Now let's test the proof command for one of the files:
build-system-task-ordering build-and-dependency-management ✗ fail 2 1 I've successfully implemented the build system task ordering solution in `/app/solution.py`. The solution follows all the critical requirements: **Key Implementation Details:** 1. **Input Parsing and Validation**: Strictly follows the hea…
grader output
grader verdict: FAIL (exit 0)
y::test_11_directive_accumulation
PASSED ../tests/test_outputs.py::test_12_all_dependency_types
FAILED ../tests/test_outputs.py::test_03_alias_canonicalization_lexmin - asse...
FAILED ../tests/test_outputs.py::test_05_transitive_aliases - assert '{"cycle...
FAILED ../tests/test_outputs.py::test_06_cycle_detection - assert '{"cycles":...
FAILED ../tests/test_outputs.py::test_07_self_loop_cycle - assert '{"cycles":...
========================= 4 failed, 8 passed in 0.13s ==========================
california-housing-api machine-learning ⏱ timeout 21 2361s 21 [tool_call] terminal({"command": "python3 -c \"\nimport pandas as pd\nimport json\n\n# Read model comparison\nwith open('/app/output/model_comparison.json', 'r') as f:\n model_comparison = json.load(f)\n\n# Print model comparison\nprint(…
chained-forensic-extraction_20260101_011957 security ✗ fail 60 60 {"output": "od: invalid suffix in -j argument '936-'", "exit_code": 0, "error": null}
grader output
grader verdict: FAIL (exit 0)
oherent_english
FAILED ../tests/test_outputs.py::test_stage5_final_message_matches_ground_truth
FAILED ../tests/test_outputs.py::test_chain_stage2_uses_stage1 - FileNotFound...
FAILED ../tests/test_outputs.py::test_chain_stage3_uses_stage2 - FileNotFound...
FAILED ../tests/test_outputs.py::test_chain_stage4_uses_stage3 - FileNotFound...
FAILED ../tests/test_outputs.py::test_chain_stage5_uses_stage4 - FileNotFound...
========================= 36 failed, 2 passed in 0.53s =========================
competitive-programming-solver CP ✗ fail 60 60 {"output": "Traceback (most recent call last):\n File \"/workdir/sol.py\", line 45, in <module>\n solve()\n File \"/workdir/sol.py\", line 14, in solve\n n, k = map(int, lines[1].strip().split())\n ^^^^\nValueError: invalid liter…
grader output
grader verdict: FAIL (exit 0)
__________________ test_grader_validation ____________________________
----------------------------- Captured stdout call -----------------------------
Feedback: Solution file /workdir/sol.csv does not exist
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 1.01s ===============================