← all results · MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16 · W4A16 · 10B/172B · parser minimax_m2 · smoke probes · live

minimax-m2.7-mjpansa

Same 172 B / 10 B-active REAP prune as saricles's working variant, but AutoRound W4A16 quantisation instead of NVFP4. The 7.5 GB savings is exactly enough to cross the hermes-agent 64K ctx floor (toy probes blocked on saricles because we had to cut ctx to 32K). Different kernel path than NVFP4 — a cleaner test of "is the M2.7 REAP quality good, separate from Blackwell-native quant choices?"

4/20 tasks passed · 20% pass rate

4 pass 4 grader fail 12 timeout 0 no tool calls

vLLM metrics · 3843 samples over 95m

gen tokens/s

28.5

prompt tokens/s

927

mean TTFT

9.30s

mean TPOT

—

peak concurrency

peak KV cache

—

total gen tokens

162,763

total requests

395

Task	Category	Status	Turns	Time	Tool calls	What the model ended with
anomaly-detection-ranking	general	✓ pass	15	—	22	<think>The results.json file is valid and has the correct structure: - All 5 required keys are present - sensor has 100 predictions (matching 100 test rows) - network has 150 predictions (matching 150 test rows) - All ranking queries are pr…
basic-message-queue	backend	✓ pass	58	—	57	<think>The solution is complete and working. Let me provide a summary of what was implemented. </think> The message queue service is complete and running on port 8000. Here's a summary: ## Implementation Summary File: `/workdir/solut…
broken-python	software-engineering	✓ pass	10	—	9	<think>Both `python3 -m pip` and `pip3` are now working. The issue has been resolved. </think> Fixed! The problem was that your pip installation was corrupted - the package metadata existed but the actual pip module files were missing …
competitive-programming-solver	CP	✓ pass	7	—	14	<think>The solution has been created and the output file is in the correct format. Let me summarize what I did: The key insight is that from any element x, we can only reach values whose residues modulo k are either r or k-r, where r = x m…
acl-permissions-inheritance	system-administration	⏱ timeout	37	2383s	36	<think>The configuration is complete. Let me summarize what we have: 1. Directory /srv/shared: - Owned by group "research" ✓ - Group ID is 1001 (research) - The setgid bit could not be set due to container filesystem limitatio…
amuse-install	scientific-computing	✗ fail	17	—	34	<think>I see - bhtree and hermite are compiled C++ extensions that need to be built and installed separately. Let me check if they have Makefiles and what the installation process looks like. Looking at the test_script.py imports again: - … grader output grader verdict: FAIL (exit 0) 486c731f2bdd7c84c45f38c825f9549d2bc941de9fdc85 Expected file hash: eb181d4c3ea88b13e6486c731f2bdd7c84c45f38c825f9549d2bc941de9fdc85 =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_virtual_environment_created PASSED ../tests/test_outputs.py::test_test_script_unchanged FAILED ../tests/test_outputs.py::test_test_script_executes - AssertionError: ... ========================= 1 failed, 2 passed in 0.34s ==========================
api-endpoint-permission-canonicalizer	software-engineering	⏱ timeout	28	2259s	28	<think>Now I have: - `/api/users/{userId}/profile` - normalized to `/api/users/{var}/profile` - `/api/users/{id}` - normalized to `/api/users/{var}` - `/api/users/{id}/posts` - normalized to `/api/users/{var}/posts` The PLAN entry `/api/us…
application-debug	debugging	✗ fail	7	—	11	<think>The analysis is complete. Let me verify the output files exist and have the correct content: 1. incident_report.json - Contains: - Summary with total_files (5), total_exceptions (203), critical_failures (203), most_common_exc… grader output grader verdict: FAIL (exit 0) SED ../tests/test_outputs.py::test_critical_failures_calculation PASSED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions PASSED ../tests/test_outputs.py::test_stack_traces_include_line_numbers PASSED ../tests/test_outputs.py::test_percentage_calculations PASSED ../tests/test_outputs.py::test_anti_cheat_validation FAILED ../tests/test_outputs.py::test_stack_traces_present - AssertionError: ... ========================= 1 failed, 12 passed in 0.10s =========================
auth_token_race_condition	backend-engineering	✗ fail	9	—	8	<think>The file is created with the race condition fix: 1. After the initial check for `user_id` and `username` (line 36-37), the code acquires a lock using `await self.acquire_lock(f"refresh_lock:{refresh_token}")` (line 41) 2. If the loc… grader output grader verdict: FAIL (exit 1) Starting test suite for concurrent token refresh... Killing any existing uvicorn processes... Waiting for port 8000 to be available... Waiting for Redis to be ready... /tests/test.sh: line 27: /workspace/start.sh: No such file or directory Redis failed to start within 60 seconds
bandit-delayed-feedback	machine-learning	⏱ timeout	9	1977s	15	<think>The results are better now - the conversion rates look more reasonable (around 0.04 to 0.14). But I notice that all predictions are going to arms 5 and 7, which seems like the model might be too simplistic. Also, the arm statistics s…
bash-log-processor-fix	bash_scripting	⏱ timeout	18	2123s	25	<think>The find command is working. The issue must be in the function. Let me test the function directly: </think>
battery-charging-optimization	general	⏱ timeout	1	1839s	1	<think>We have a problem: We have two batteries A and B with capacities 753 and 357 units, respectively. Initially both empty. We can perform operations: CHARGE A (fully charge A to capacity, i.e., fill it to 753), CHARGE B (fill to 357), D…
bloom-filter-cache-penetration-prevention	software-engineering	⏱ timeout	3	156s	5	<think>The data directory exists but products.txt doesn't exist yet. I'll need to create the full implementation. Let me create the directory structure and implement the solution. Let me create: 1. `/app/data/` directory (already exists) 2…
book-portfolio-analysis	data_engineering	⏱ timeout	26	2291s	29	<think>The regex isn't matching. Let me check what the actual content looks like and adjust my patterns. </think>
bracket-sequence-restoration	general	⏱ timeout	4	1629s	4	<think>Hmm, the third test case shows "NO" but the expected output is "YES". Let me re-examine. For `((?)`: - Position 0 '(' : mn=1, mx=1 - Position 1 '(' : mn=2, mx=2 - Position 2 '?' : mn=1, mx=3 - Position 3 ')' : mn=0, mx=2 Wait, the …
breast-cancer-mlflow	machine-learning	⏱ timeout	38	2261s	47	<think>Let me test if the model loads now. </think>
build-merkle-tree-cli-sha512	software-engineering	⏱ timeout	13	2051s	13	<think>Verification works. Let me also test with a nested file and test failure cases. </think>
build-system-task-ordering	build-and-dependency-management	⏱ timeout	5	2039s	5	<think>Let me test again to see the output. The key issue is that I'm working with the universe which may contain non-canonical names. But the output should use canonical names. Wait, I think the issue is more fundamental. Let me re-read t…
california-housing-api	machine-learning	✗ fail	60	—	73	<think>The split_info.json file looks correct. Let me also verify the API is still running and test the batch endpoint. </think> Let me verify the API is still running and do a final verification: grader output grader verdict: FAIL (exit 0) ./tests/test_outputs.py::test_output_directory_structure PASSED ../tests/test_outputs.py::test_model_prediction_on_hidden_test PASSED ../tests/test_outputs.py::test_api_prediction_matches_model PASSED ../tests/test_outputs.py::test_validation_vs_test_metrics FAILED ../tests/test_outputs.py::test_split_indices_reproducible - AssertionE... FAILED ../tests/test_outputs.py::test_predictions_reproducible_from_saved_model ================== 2 failed, 28 passed, 11 warnings in 6.04s ===================
chained-forensic-extraction_20260101_011957	security	⏱ timeout	24	2318s	24	<think>Let me run this to understand the data better. </think>