← all results · mistralai/Devstral-Small-2-24B-Instruct-2512 · BF16 · 24B/24B · parser mistral · smoke probes · live
devstral-small-2
Mistral's Devstral-Small-2 (24 B dense, Apache-2.0). Co-designed with All-Hands for OpenHands agentic loops; Aider community's top local pick for multi-file refactors in 2026.
2/20 tasks passed · 10% pass rate
2 pass 5 grader fail 12 timeout 1 no tool calls
vLLM metrics · 1705 samples over 47m
gen tokens/s
37.1prompt tokens/s
2189mean TTFT
1.01smean TPOT
—peak concurrency
8peak KV cache
—total gen tokens
104,302total requests
503| Task | Category | Status | Turns | Time | Tool calls | What the model ended with |
|---|---|---|---|---|---|---|
| application-debug | debugging | ✓ pass | 28 | — | 27 | Perfect! All requirements have been met. Let me provide a final summary: ## Summary I have successfully completed the Java application error log analysis and generated the required incident reports. Here's what was accomplished: ### ✅ **… |
| broken-python | software-engineering | ✓ pass | 30 | — | 29 | Perfect! Let me create a summary of what was done: ## Summary Your pip installation was corrupted. The issue was that the pip package files were incorrectly stored inside `/usr/local/lib/python3.13/site-packages/pip-25.3.dist-info/license… |
| acl-permissions-inheritance | system-administration | ✗ fail | 60 | — | 60 | The setgid bit is still not showing. Let me check the actual permissions: grader outputgrader verdict: FAIL (exit 0) tests/test_outputs.py::test_bob_can_create_and_alice_can_access PASSED ../tests/test_outputs.py::test_subdirectory_inheritance PASSED ../tests/test_outputs.py::test_outside_users_denied_access PASSED ../tests/test_outputs.py::test_execute_permissions_on_scripts FAILED ../tests/test_outputs.py::test_directory_exists_and_basic_permissions FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - Assertio... ========================= 2 failed, 7 passed in 0.45s ========================== |
| amuse-install | scientific-computing | ✗ fail | 18 | — | 17 | I see that we need to install system dependencies. Let me install the required build tools and dependencies: grader outputgrader verdict: FAIL (exit 0) 1f2bdd7c84c45f38c825f9549d2bc941de9fdc85 Expected file hash: eb181d4c3ea88b13e6486c731f2bdd7c84c45f38c825f9549d2bc941de9fdc85 =========================== short test summary info ============================ PASSED ../../tests/test_outputs.py::test_virtual_environment_created PASSED ../../tests/test_outputs.py::test_test_script_unchanged FAILED ../../tests/test_outputs.py::test_test_script_executes - AssertionErro... ========================= 1 failed, 2 passed in 0.12s ========================== |
| anomaly-detection-ranking | general | ✗ fail | 24 | — | 23 | {"output": "Task 1: Sensor anomaly detection\nSensor predictions: 100 predictions made\n\nTask 2: Network anomaly detection\nNetwork predictions: 150 predictions made\n\nTask 3: Movie ranking\nMovie rankings: 3 queries ranked\n\nTask 4: New… grader outputgrader verdict: FAIL (exit 0) ============ ____________________________ test_grader_validation ____________________________ ----------------------------- Captured stdout call ----------------------------- Feedback: results.json not found Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 0.25s =============================== |
| api-endpoint-permission-canonicalizer | software-engineering | ⏱ timeout | 22 | 1145s | 22 | Excellent! Let me test with an empty PLAN: |
| auth_token_race_condition | backend-engineering | ✗ fail | 17 | — | 16 | Perfect! I've successfully created the token_service.py file with the race condition fix. Let me verify that the implementation includes both parts of the fix:
1. ✅ **Lock Acquisition**: After checking `user_id` and `username` (lines 48-49… grader outputgrader verdict: FAIL (exit 1) Starting test suite for concurrent token refresh... Killing any existing uvicorn processes... Waiting for port 8000 to be available... Waiting for Redis to be ready... /tests/test.sh: line 27: /workspace/start.sh: No such file or directory Redis failed to start within 60 seconds |
| bandit-delayed-feedback | machine-learning | ⏱ timeout | 18 | 931s | 18 | Let me check the end of the predictions file to make sure all 2000 contexts are covered: |
| bash-log-processor-fix | bash_scripting | ✗ fail | 56 | — | 55 | Perfect! The script is working correctly. Let me create a final summary of what was implemented: ## Summary The script has been successfully fixed to address all the requirements: ### ✅ Requirements Met: 1. **Fixed race condition**: Add… |
| basic-message-queue | backend | ⏱ timeout | 26 | 1029s | 26 | The server seems to be running but not showing output. Let me create a test script to verify the functionality: |
| battery-charging-optimization | general | ⏱ timeout | — | — | 0 | no assistant output |
| bloom-filter-cache-penetration-prevention | software-engineering | ⏱ timeout | 15 | 1100s | 15 | Great! Now let me create a simple test file to verify the basic functionality: |
| book-portfolio-analysis | data_engineering | ✗ fail | 58 | — | 57 | Let me continue fixing these issues one by one: grader outputgrader verdict: FAIL (exit 0)
___
----------------------------- Captured stdout call -----------------------------
Subscores: {'all_criteria_met': 0.0}
Feedback: WRONG: Solution file /workdir/sol.csv does not exist; Failed constraints: 1
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 0.37s =============================== |
| bracket-sequence-restoration | general | ⏱ timeout | 19 | 1092s | 19 | Ah! I see the issue. At position 3, we have balance=3, and we can place both '(' (balance would be 4, which is <= total_open=4) and ')' (balance would be 2, which is >= 0). So my algorithm correctly detects a choice. But when I manually enu… |
| breast-cancer-mlflow | machine-learning | ⏱ timeout | 19 | 1137s | 19 | Great! Now let me verify the JSON files were created correctly: |
| build-merkle-tree-cli-sha512 | software-engineering | ⏱ timeout | 19 | 1141s | 19 | Perfect! Now let's test the verify command. First, let's save the proof to a file: |
| build-system-task-ordering | build-and-dependency-management | ⏱ timeout | 8 | 913s | 8 | I see the issue! In the alias collection, I'm using `current_target` which is from the parsing loop, but when processing overrides, `current_target` might not be set correctly. Let me trace through the logic more carefully and fix the alias… |
| california-housing-api | machine-learning | ⏱ timeout | 29 | 869s | 29 | Perfect! Now let me create the script for Step 6 to test the API with the validation set: |
| chained-forensic-extraction_20260101_011957 | security | ⏱ timeout | 22 | 1133s | 22 | Let me check what the actual encoded data looks like and try a more systematic approach. |
| competitive-programming-solver | CP | ⏱ timeout | 17 | 1099s | 26 | Ah! So from 6 with k=4, we can only reach numbers that are either 6 or 2 mod 8. And both of these have the same equivalence class (min(6, 2) = 2, min(2, 6) = 2). So the equivalence class is consistent! Let me also check what happens if we … |