← all results · MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16 · W4A16 · 10B/172B · parser minimax_m2 · smoke probes · live
minimax-m2.7-mjpansa
Same 172 B / 10 B-active REAP prune as saricles's working variant, but AutoRound W4A16 quantisation instead of NVFP4. The 7.5 GB savings is exactly enough to cross the hermes-agent 64K ctx floor (toy probes blocked on saricles because we had to cut ctx to 32K). Different kernel path than NVFP4 — a cleaner test of "is the M2.7 REAP quality good, separate from Blackwell-native quant choices?"
4/20 tasks passed · 20% pass rate
4 pass 4 grader fail 12 timeout 0 no tool calls
vLLM metrics · 3843 samples over 95m
gen tokens/s
28.5prompt tokens/s
927mean TTFT
9.30smean TPOT
—peak concurrency
8peak KV cache
—total gen tokens
162,763total requests
395| Task | Category | Status | Turns | Time | Tool calls | What the model ended with |
|---|---|---|---|---|---|---|
| anomaly-detection-ranking | general | ✓ pass | 15 | — | 22 | <think>The results.json file is valid and has the correct structure: - All 5 required keys are present - sensor has 100 predictions (matching 100 test rows) - network has 150 predictions (matching 150 test rows) - All ranking queries are pr… |
| basic-message-queue | backend | ✓ pass | 58 | — | 57 | <think>The solution is complete and working. Let me provide a summary of what was implemented. </think> The message queue service is complete and running on port 8000. Here's a summary: ## Implementation Summary **File:** `/workdir/solut… |
| broken-python | software-engineering | ✓ pass | 10 | — | 9 | <think>Both `python3 -m pip` and `pip3` are now working. The issue has been resolved. </think> **Fixed!** The problem was that your pip installation was corrupted - the package metadata existed but the actual pip module files were missing … |
| competitive-programming-solver | CP | ✓ pass | 7 | — | 14 | <think>The solution has been created and the output file is in the correct format. Let me summarize what I did: The key insight is that from any element x, we can only reach values whose residues modulo k are either r or k-r, where r = x m… |
| acl-permissions-inheritance | system-administration | ⏱ timeout | 37 | 2383s | 36 | <think>The configuration is complete. Let me summarize what we have: 1. **Directory /srv/shared**: - Owned by group "research" ✓ - Group ID is 1001 (research) - The setgid bit could not be set due to container filesystem limitatio… |
| amuse-install | scientific-computing | ✗ fail | 17 | — | 34 | <think>I see - bhtree and hermite are compiled C++ extensions that need to be built and installed separately. Let me check if they have Makefiles and what the installation process looks like.
Looking at the test_script.py imports again:
- … grader outputgrader verdict: FAIL (exit 0) 486c731f2bdd7c84c45f38c825f9549d2bc941de9fdc85 Expected file hash: eb181d4c3ea88b13e6486c731f2bdd7c84c45f38c825f9549d2bc941de9fdc85 =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_virtual_environment_created PASSED ../tests/test_outputs.py::test_test_script_unchanged FAILED ../tests/test_outputs.py::test_test_script_executes - AssertionError: ... ========================= 1 failed, 2 passed in 0.34s ========================== |
| api-endpoint-permission-canonicalizer | software-engineering | ⏱ timeout | 28 | 2259s | 28 | <think>Now I have: - `/api/users/{userId}/profile` - normalized to `/api/users/{var}/profile` - `/api/users/{id}` - normalized to `/api/users/{var}` - `/api/users/{id}/posts` - normalized to `/api/users/{var}/posts` The PLAN entry `/api/us… |
| application-debug | debugging | ✗ fail | 7 | — | 11 | <think>The analysis is complete. Let me verify the output files exist and have the correct content:
1. **incident_report.json** - Contains:
- Summary with total_files (5), total_exceptions (203), critical_failures (203), most_common_exc… grader outputgrader verdict: FAIL (exit 0) SED ../tests/test_outputs.py::test_critical_failures_calculation PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers PASSED ../tests/test_outputs.py::test_percentage_calculations PASSED ../tests/test_outputs.py::test_anti_cheat_validation FAILED ../tests/test_outputs.py::test_stack_traces_present - AssertionError: ... ========================= 1 failed, 12 passed in 0.10s ========================= |
| auth_token_race_condition | backend-engineering | ✗ fail | 9 | — | 8 | <think>The file is created with the race condition fix:
1. After the initial check for `user_id` and `username` (line 36-37), the code acquires a lock using `await self.acquire_lock(f"refresh_lock:{refresh_token}")` (line 41)
2. If the loc… grader outputgrader verdict: FAIL (exit 1) Starting test suite for concurrent token refresh... Killing any existing uvicorn processes... Waiting for port 8000 to be available... Waiting for Redis to be ready... /tests/test.sh: line 27: /workspace/start.sh: No such file or directory Redis failed to start within 60 seconds |
| bandit-delayed-feedback | machine-learning | ⏱ timeout | 9 | 1977s | 15 | <think>The results are better now - the conversion rates look more reasonable (around 0.04 to 0.14). But I notice that all predictions are going to arms 5 and 7, which seems like the model might be too simplistic. Also, the arm statistics s… |
| bash-log-processor-fix | bash_scripting | ⏱ timeout | 18 | 2123s | 25 | <think>The find command is working. The issue must be in the function. Let me test the function directly: </think> |
| battery-charging-optimization | general | ⏱ timeout | 1 | 1839s | 1 | <think>We have a problem: We have two batteries A and B with capacities 753 and 357 units, respectively. Initially both empty. We can perform operations: CHARGE A (fully charge A to capacity, i.e., fill it to 753), CHARGE B (fill to 357), D… |
| bloom-filter-cache-penetration-prevention | software-engineering | ⏱ timeout | 3 | 156s | 5 | <think>The data directory exists but products.txt doesn't exist yet. I'll need to create the full implementation. Let me create the directory structure and implement the solution. Let me create: 1. `/app/data/` directory (already exists) 2… |
| book-portfolio-analysis | data_engineering | ⏱ timeout | 26 | 2291s | 29 | <think>The regex isn't matching. Let me check what the actual content looks like and adjust my patterns. </think> |
| bracket-sequence-restoration | general | ⏱ timeout | 4 | 1629s | 4 | <think>Hmm, the third test case shows "NO" but the expected output is "YES". Let me re-examine. For `((?)`: - Position 0 '(' : mn=1, mx=1 - Position 1 '(' : mn=2, mx=2 - Position 2 '?' : mn=1, mx=3 - Position 3 ')' : mn=0, mx=2 Wait, the … |
| breast-cancer-mlflow | machine-learning | ⏱ timeout | 38 | 2261s | 47 | <think>Let me test if the model loads now. </think> |
| build-merkle-tree-cli-sha512 | software-engineering | ⏱ timeout | 13 | 2051s | 13 | <think>Verification works. Let me also test with a nested file and test failure cases. </think> |
| build-system-task-ordering | build-and-dependency-management | ⏱ timeout | 5 | 2039s | 5 | <think>Let me test again to see the output. The key issue is that I'm working with the universe which may contain non-canonical names. But the output should use canonical names. Wait, I think the issue is more fundamental. Let me re-read t… |
| california-housing-api | machine-learning | ✗ fail | 60 | — | 73 | <think>The split_info.json file looks correct. Let me also verify the API is still running and test the batch endpoint.
</think>
Let me verify the API is still running and do a final verification: grader outputgrader verdict: FAIL (exit 0) ./tests/test_outputs.py::test_output_directory_structure PASSED ../tests/test_outputs.py::test_model_prediction_on_hidden_test PASSED ../tests/test_outputs.py::test_api_prediction_matches_model PASSED ../tests/test_outputs.py::test_validation_vs_test_metrics FAILED ../tests/test_outputs.py::test_split_indices_reproducible - AssertionE... FAILED ../tests/test_outputs.py::test_predictions_reproducible_from_saved_model ================== 2 failed, 28 passed, 11 warnings in 6.04s =================== |
| chained-forensic-extraction_20260101_011957 | security | ⏱ timeout | 24 | 2318s | 24 | <think>Let me run this to understand the data better. </think> |