← all results · mistralai/Devstral-Small-2-24B-Instruct-2512 · BF16 · 24B/24B · parser mistral · smoke probes · live

devstral-small-2

Mistral's Devstral-Small-2 (24 B dense, Apache-2.0). Co-designed with All-Hands for OpenHands agentic loops; Aider community's top local pick for multi-file refactors in 2026.

2/20 tasks passed · 10% pass rate

2 pass 5 grader fail 12 timeout 1 no tool calls
vLLM metrics · 1705 samples over 47m
gen tokens/s
37.1
prompt tokens/s
2189
mean TTFT
1.01s
mean TPOT
peak concurrency
8
peak KV cache
total gen tokens
104,302
total requests
503
Task Category Status Turns Time Tool calls What the model ended with
application-debug debugging ✓ pass 28 27 Perfect! All requirements have been met. Let me provide a final summary: ## Summary I have successfully completed the Java application error log analysis and generated the required incident reports. Here's what was accomplished: ### ✅ **…
broken-python software-engineering ✓ pass 30 29 Perfect! Let me create a summary of what was done: ## Summary Your pip installation was corrupted. The issue was that the pip package files were incorrectly stored inside `/usr/local/lib/python3.13/site-packages/pip-25.3.dist-info/license…
acl-permissions-inheritance system-administration ✗ fail 60 60 The setgid bit is still not showing. Let me check the actual permissions:
grader output
grader verdict: FAIL (exit 0)
tests/test_outputs.py::test_bob_can_create_and_alice_can_access
PASSED ../tests/test_outputs.py::test_subdirectory_inheritance
PASSED ../tests/test_outputs.py::test_outside_users_denied_access
PASSED ../tests/test_outputs.py::test_execute_permissions_on_scripts
FAILED ../tests/test_outputs.py::test_directory_exists_and_basic_permissions
FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - Assertio...
========================= 2 failed, 7 passed in 0.45s ==========================
amuse-install scientific-computing ✗ fail 18 17 I see that we need to install system dependencies. Let me install the required build tools and dependencies:
grader output
grader verdict: FAIL (exit 0)
1f2bdd7c84c45f38c825f9549d2bc941de9fdc85
Expected file hash: eb181d4c3ea88b13e6486c731f2bdd7c84c45f38c825f9549d2bc941de9fdc85
=========================== short test summary info ============================
PASSED ../../tests/test_outputs.py::test_virtual_environment_created
PASSED ../../tests/test_outputs.py::test_test_script_unchanged
FAILED ../../tests/test_outputs.py::test_test_script_executes - AssertionErro...
========================= 1 failed, 2 passed in 0.12s ==========================
anomaly-detection-ranking general ✗ fail 24 23 {"output": "Task 1: Sensor anomaly detection\nSensor predictions: 100 predictions made\n\nTask 2: Network anomaly detection\nNetwork predictions: 150 predictions made\n\nTask 3: Movie ranking\nMovie rankings: 3 queries ranked\n\nTask 4: New…
grader output
grader verdict: FAIL (exit 0)
============
____________________________ test_grader_validation ____________________________
----------------------------- Captured stdout call -----------------------------
Feedback: results.json not found
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 0.25s ===============================
api-endpoint-permission-canonicalizer software-engineering ⏱ timeout 22 1145s 22 Excellent! Let me test with an empty PLAN:
auth_token_race_condition backend-engineering ✗ fail 17 16 Perfect! I've successfully created the token_service.py file with the race condition fix. Let me verify that the implementation includes both parts of the fix: 1. ✅ **Lock Acquisition**: After checking `user_id` and `username` (lines 48-49…
grader output
grader verdict: FAIL (exit 1)
Starting test suite for concurrent token refresh...
Killing any existing uvicorn processes...
Waiting for port 8000 to be available...
Waiting for Redis to be ready...
/tests/test.sh: line 27: /workspace/start.sh: No such file or directory
Redis failed to start within 60 seconds
bandit-delayed-feedback machine-learning ⏱ timeout 18 931s 18 Let me check the end of the predictions file to make sure all 2000 contexts are covered:
bash-log-processor-fix bash_scripting ✗ fail 56 55 Perfect! The script is working correctly. Let me create a final summary of what was implemented: ## Summary The script has been successfully fixed to address all the requirements: ### ✅ Requirements Met: 1. **Fixed race condition**: Add…
basic-message-queue backend ⏱ timeout 26 1029s 26 The server seems to be running but not showing output. Let me create a test script to verify the functionality:
battery-charging-optimization general ⏱ timeout 0 no assistant output
bloom-filter-cache-penetration-prevention software-engineering ⏱ timeout 15 1100s 15 Great! Now let me create a simple test file to verify the basic functionality:
book-portfolio-analysis data_engineering ✗ fail 58 57 Let me continue fixing these issues one by one:
grader output
grader verdict: FAIL (exit 0)
___
----------------------------- Captured stdout call -----------------------------
Subscores: {'all_criteria_met': 0.0}
Feedback: WRONG: Solution file /workdir/sol.csv does not exist; Failed constraints: 1
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 0.37s ===============================
bracket-sequence-restoration general ⏱ timeout 19 1092s 19 Ah! I see the issue. At position 3, we have balance=3, and we can place both '(' (balance would be 4, which is <= total_open=4) and ')' (balance would be 2, which is >= 0). So my algorithm correctly detects a choice. But when I manually enu…
breast-cancer-mlflow machine-learning ⏱ timeout 19 1137s 19 Great! Now let me verify the JSON files were created correctly:
build-merkle-tree-cli-sha512 software-engineering ⏱ timeout 19 1141s 19 Perfect! Now let's test the verify command. First, let's save the proof to a file:
build-system-task-ordering build-and-dependency-management ⏱ timeout 8 913s 8 I see the issue! In the alias collection, I'm using `current_target` which is from the parsing loop, but when processing overrides, `current_target` might not be set correctly. Let me trace through the logic more carefully and fix the alias…
california-housing-api machine-learning ⏱ timeout 29 869s 29 Perfect! Now let me create the script for Step 6 to test the API with the validation set:
chained-forensic-extraction_20260101_011957 security ⏱ timeout 22 1133s 22 Let me check what the actual encoded data looks like and try a more systematic approach.
competitive-programming-solver CP ⏱ timeout 17 1099s 26 Ah! So from 6 with k=4, we can only reach numbers that are either 6 or 2 mod 8. And both of these have the same equivalence class (min(6, 2) = 2, min(2, 6) = 2). So the equivalence class is consistent! Let me also check what happens if we …