← all results · MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16 · W4A16 · 10B/172B · parser minimax_m2 · smoke probes · live

minimax-m2.7-mjpansa

Same 172 B / 10 B-active REAP prune as saricles's working variant, but AutoRound W4A16 quantisation instead of NVFP4. The 7.5 GB savings is exactly enough to cross the hermes-agent 64K ctx floor (toy probes blocked on saricles because we had to cut ctx to 32K). Different kernel path than NVFP4 — a cleaner test of "is the M2.7 REAP quality good, separate from Blackwell-native quant choices?"

4/20 tasks passed · 20% pass rate

4 pass 4 grader fail 12 timeout 0 no tool calls
vLLM metrics · 3843 samples over 95m
gen tokens/s
28.5
prompt tokens/s
927
mean TTFT
9.30s
mean TPOT
peak concurrency
8
peak KV cache
total gen tokens
162,763
total requests
395
Task Category Status Turns Time Tool calls What the model ended with
anomaly-detection-ranking general ✓ pass 15 22 <think>The results.json file is valid and has the correct structure: - All 5 required keys are present - sensor has 100 predictions (matching 100 test rows) - network has 150 predictions (matching 150 test rows) - All ranking queries are pr…
basic-message-queue backend ✓ pass 58 57 <think>The solution is complete and working. Let me provide a summary of what was implemented. </think> The message queue service is complete and running on port 8000. Here's a summary: ## Implementation Summary **File:** `/workdir/solut…
broken-python software-engineering ✓ pass 10 9 <think>Both `python3 -m pip` and `pip3` are now working. The issue has been resolved. </think> **Fixed!** The problem was that your pip installation was corrupted - the package metadata existed but the actual pip module files were missing …
competitive-programming-solver CP ✓ pass 7 14 <think>The solution has been created and the output file is in the correct format. Let me summarize what I did: The key insight is that from any element x, we can only reach values whose residues modulo k are either r or k-r, where r = x m…
acl-permissions-inheritance system-administration ⏱ timeout 37 2383s 36 <think>The configuration is complete. Let me summarize what we have: 1. **Directory /srv/shared**: - Owned by group "research" ✓ - Group ID is 1001 (research) - The setgid bit could not be set due to container filesystem limitatio…
amuse-install scientific-computing ✗ fail 17 34 <think>I see - bhtree and hermite are compiled C++ extensions that need to be built and installed separately. Let me check if they have Makefiles and what the installation process looks like. Looking at the test_script.py imports again: - …
grader output
grader verdict: FAIL (exit 0)
486c731f2bdd7c84c45f38c825f9549d2bc941de9fdc85
Expected file hash: eb181d4c3ea88b13e6486c731f2bdd7c84c45f38c825f9549d2bc941de9fdc85
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_virtual_environment_created
PASSED ../tests/test_outputs.py::test_test_script_unchanged
FAILED ../tests/test_outputs.py::test_test_script_executes - AssertionError: ...
========================= 1 failed, 2 passed in 0.34s ==========================
api-endpoint-permission-canonicalizer software-engineering ⏱ timeout 28 2259s 28 <think>Now I have: - `/api/users/{userId}/profile` - normalized to `/api/users/{var}/profile` - `/api/users/{id}` - normalized to `/api/users/{var}` - `/api/users/{id}/posts` - normalized to `/api/users/{var}/posts` The PLAN entry `/api/us…
application-debug debugging ✗ fail 7 11 <think>The analysis is complete. Let me verify the output files exist and have the correct content: 1. **incident_report.json** - Contains: - Summary with total_files (5), total_exceptions (203), critical_failures (203), most_common_exc…
grader output
grader verdict: FAIL (exit 0)
SED ../tests/test_outputs.py::test_critical_failures_calculation
PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions
PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers
PASSED ../tests/test_outputs.py::test_percentage_calculations
PASSED ../tests/test_outputs.py::test_anti_cheat_validation
FAILED ../tests/test_outputs.py::test_stack_traces_present - AssertionError: ...
========================= 1 failed, 12 passed in 0.10s =========================
auth_token_race_condition backend-engineering ✗ fail 9 8 <think>The file is created with the race condition fix: 1. After the initial check for `user_id` and `username` (line 36-37), the code acquires a lock using `await self.acquire_lock(f"refresh_lock:{refresh_token}")` (line 41) 2. If the loc…
grader output
grader verdict: FAIL (exit 1)
Starting test suite for concurrent token refresh...
Killing any existing uvicorn processes...
Waiting for port 8000 to be available...
Waiting for Redis to be ready...
/tests/test.sh: line 27: /workspace/start.sh: No such file or directory
Redis failed to start within 60 seconds
bandit-delayed-feedback machine-learning ⏱ timeout 9 1977s 15 <think>The results are better now - the conversion rates look more reasonable (around 0.04 to 0.14). But I notice that all predictions are going to arms 5 and 7, which seems like the model might be too simplistic. Also, the arm statistics s…
bash-log-processor-fix bash_scripting ⏱ timeout 18 2123s 25 <think>The find command is working. The issue must be in the function. Let me test the function directly: </think>
battery-charging-optimization general ⏱ timeout 1 1839s 1 <think>We have a problem: We have two batteries A and B with capacities 753 and 357 units, respectively. Initially both empty. We can perform operations: CHARGE A (fully charge A to capacity, i.e., fill it to 753), CHARGE B (fill to 357), D…
bloom-filter-cache-penetration-prevention software-engineering ⏱ timeout 3 156s 5 <think>The data directory exists but products.txt doesn't exist yet. I'll need to create the full implementation. Let me create the directory structure and implement the solution. Let me create: 1. `/app/data/` directory (already exists) 2…
book-portfolio-analysis data_engineering ⏱ timeout 26 2291s 29 <think>The regex isn't matching. Let me check what the actual content looks like and adjust my patterns. </think>
bracket-sequence-restoration general ⏱ timeout 4 1629s 4 <think>Hmm, the third test case shows "NO" but the expected output is "YES". Let me re-examine. For `((?)`: - Position 0 '(' : mn=1, mx=1 - Position 1 '(' : mn=2, mx=2 - Position 2 '?' : mn=1, mx=3 - Position 3 ')' : mn=0, mx=2 Wait, the …
breast-cancer-mlflow machine-learning ⏱ timeout 38 2261s 47 <think>Let me test if the model loads now. </think>
build-merkle-tree-cli-sha512 software-engineering ⏱ timeout 13 2051s 13 <think>Verification works. Let me also test with a nested file and test failure cases. </think>
build-system-task-ordering build-and-dependency-management ⏱ timeout 5 2039s 5 <think>Let me test again to see the output. The key issue is that I'm working with the universe which may contain non-canonical names. But the output should use canonical names. Wait, I think the issue is more fundamental. Let me re-read t…
california-housing-api machine-learning ✗ fail 60 73 <think>The split_info.json file looks correct. Let me also verify the API is still running and test the batch endpoint. </think> Let me verify the API is still running and do a final verification:
grader output
grader verdict: FAIL (exit 0)
./tests/test_outputs.py::test_output_directory_structure
PASSED ../tests/test_outputs.py::test_model_prediction_on_hidden_test
PASSED ../tests/test_outputs.py::test_api_prediction_matches_model
PASSED ../tests/test_outputs.py::test_validation_vs_test_metrics
FAILED ../tests/test_outputs.py::test_split_indices_reproducible - AssertionE...
FAILED ../tests/test_outputs.py::test_predictions_reproducible_from_saved_model
================== 2 failed, 28 passed, 11 warnings in 6.04s ===================
chained-forensic-extraction_20260101_011957 security ⏱ timeout 24 2318s 24 <think>Let me run this to understand the data better. </think>