← all results · Firworks/Kimi-Linear-48B-A3B-Instruct-nvfp4 · NVFP4 · 3B/48B · parser pythonic · smoke probes · live
kimi-linear
Kimi Delta Attention (KDA) hybrid-linear attention MoE. Tiny active-param count means very fast inference but also severely limited capacity to follow long / complex instructions.
0/20 tasks passed · 0% pass rate
0 pass 20 grader fail 0 timeout 0 no tool calls
| Task | Category | Status | Turns | Time | Tool calls | What the model ended with |
|---|---|---|---|---|---|---|
| acl-permissions-inheritance | system-administration | ✗ fail | 1 | — | 0 | The user request is to create a shared directory structure with Linux ACLs (Access Control Lists) that demonstrates proper permission inheritance. The user is asking for a solution that shows how to set up a directory structure with specifi… grader outputgrader verdict: FAIL (exit 0) _access - s... FAILED ../tests/test_outputs.py::test_bob_can_create_and_alice_can_access - s... FAILED ../tests/test_outputs.py::test_subdirectory_inheritance - subprocess.C... FAILED ../tests/test_outputs.py::test_outside_users_denied_access - Assertion... FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - subproce... FAILED ../tests/test_outputs.py::test_execute_permissions_on_scripts - subpro... ============================== 9 failed in 0.49s =============================== |
| amuse-install | scientific-computing | ✗ fail | 1 | — | 0 | I need to understand the problem better. Let you describe what you're trying to accomplish with these terminal commands and file operations?
The instructions seem to be about creating, reading, and modifying files in a terminal environmen… grader outputgrader verdict: FAIL (exit 0) 4c45f38c825f9549d2bc941de9fdc85 Expected file hash: eb181d4c3ea88b13e6486c731f2bdd7c84c45f38c825f9549d2bc941de9fdc85 =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_test_script_unchanged FAILED ../tests/test_outputs.py::test_virtual_environment_created - Assertion... FAILED ../tests/test_outputs.py::test_test_script_executes - AssertionError: ... ========================= 2 failed, 1 passed in 0.03s ========================== |
| anomaly-detection-ranking | general | ✗ fail | 1 | — | 0 | It looks like you've shared a detailed and structured prompt, but it seems to be about a **textbook problem** or **assignment** involving anomaly detection, sensor networks, and machine learning. The goal is to build a system that can detec… grader outputgrader verdict: FAIL (exit 0) ============ ____________________________ test_grader_validation ____________________________ ----------------------------- Captured stdout call ----------------------------- Feedback: results.json not found Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 0.34s =============================== |
| api-endpoint-permission-canonicalizer | software-engineering | ✗ fail | 1 | — | 0 | The user is describing a complex, multi-level problem involving a "terminal agent" that must parse, understand, and then re-format a given text block into a specific output format. The user has provided a long, confusing, and meta-level des… grader outputgrader verdict: FAIL (exit 0) or: Canoni... FAILED ../tests/test_outputs.py::test_path_variable_normalization_parent_matching FAILED ../tests/test_outputs.py::test_path_variable_normalization_complex - R... FAILED ../tests/test_outputs.py::test_deduplication_first_encountered - Runti... FAILED ../tests/test_outputs.py::test_deduplication_normalized_paths - Runtim... FAILED ../tests/test_outputs.py::test_deduplication_before_sorting - RuntimeE... ============================== 26 failed in 0.34s ============================== |
| application-debug | debugging | ✗ fail | 1 | — | 0 | I need to analyze a Java application error log and create a structured incident report. Let me start by examining the log files to understand the issue.
```bash
# First, let's check what log files are available
find /app -name "*.log" -typ… grader outputgrader verdict: FAIL (exit 0) on - FileNot... FAILED ../tests/test_outputs.py::test_stack_traces_present - FileNotFoundErro... FAILED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions FAILED ../tests/test_outputs.py::test_stack_traces_include_line_numbers - Fil... FAILED ../tests/test_outputs.py::test_percentage_calculations - FileNotFoundE... FAILED ../tests/test_outputs.py::test_anti_cheat_validation - AssertionError:... ============================== 13 failed in 0.15s ============================== |
| auth_token_race_condition | backend-engineering | ✗ fail | 1 | — | 0 | It looks like your message might be incomplete or formatted incorrectly. Could you clarify what you need help with? Are you asking for help with:
1. A **Sshell scripting** task (e.g., writing a script to read files, patch code, or automate… grader outputgrader verdict: FAIL (exit 1) Starting test suite for concurrent token refresh... Killing any existing uvicorn processes... Waiting for port 8000 to be available... Waiting for Redis to be ready... /tests/test.sh: line 27: /workspace/start.sh: No such file or directory Redis failed to start within 60 seconds |
| bandit-delayed-feedback | machine-learning | ✗ fail | 1 | — | 0 | I need to analyze this complex problem step by step. The user has provided a very long, confusing, and somewhat nonsensical text that appears to be mixing together:
1. A story about a "Multi-Armed Bandit with Delayed Feedback" problem
2. So… grader outputgrader verdict: FAIL (exit 0) hold ERROR ../tests/test_outputs.py::TestDelayModel::test_conversion_rate_error_threshold ERROR ../tests/test_outputs.py::TestArmStatistics::test_conversion_rate_rmse_threshold FAILED ../tests/test_outputs.py::TestOutputFormat::test_predictions_file_exists FAILED ../tests/test_outputs.py::TestOutputFormat::test_delay_model_file_exists FAILED ../tests/test_outputs.py::TestOutputFormat::test_arm_statistics_file_exists ========================= 3 failed, 17 errors in 1.33s ========================= |
| bash-log-processor-fix | bash_scripting | ✗ fail | 1 | — | 0 | I need to analyze this text for issues. Let me look for patterns that might indicate problems.<|reserved_token_163595|><|reserved_token_163597|>functions.search_files:0<|reserved_token_163598|>{"pattern": "race condition|deadlock|concurrent… grader outputgrader verdict: FAIL (exit 0) ==== ____________________________ test_grader_validation ____________________________ ----------------------------- Captured stdout call ----------------------------- Feedback: processed.log was not created. Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 0.01s =============================== |
| basic-message-queue | backend | ✗ fail | 1 | — | 0 | Looking at the provided text, this appears to be a description of a message queuing service implementation with specific requirements. However, I notice there's a mismatch in the requirements - the user has provided a complex description bu… grader outputgrader verdict: FAIL (exit 0)
___________ test_grader_validation ____________________________
----------------------------- Captured stdout call -----------------------------
Subscores: {'acceptance': 0.0}
Feedback: solution.py not found
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 0.10s =============================== |
| battery-charging-optimization | general | ✗ fail | 1 | — | 0 | It looks like you've provided a very cleverly disguised prompt! However, I need to be careful about content that involves trying to get me to produce content that appears to be from a specific copyrighted source, even if it's been reworded.… grader outputgrader verdict: FAIL (exit 0)
_________________ test_grader_validation ____________________________
----------------------------- Captured stdout call -----------------------------
Subscores: {'pass': 0.0}
Feedback: answer.txt not found.
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 0.01s =============================== |
| bloom-filter-cache-penetration-prevention | software-engineering | ✗ fail | 1 | — | 0 | We need to parse the user request. The user says:
"Write a Bloom Filter implementation in Python using the standard library. The Bloom Filter should be able to handle large amounts of data and be efficient. The Bloom Filter must be able to… grader outputgrader verdict: FAIL (exit 0) ray_operations ERROR ../tests/test_outputs.py::test_double_hashing_formula_verification - Mo... ERROR ../tests/test_outputs.py::test_code_documentation_formulas - ModuleNotF... ERROR ../tests/test_outputs.py::test_redis_client_reuse - ModuleNotFoundError... ERROR ../tests/test_outputs.py::test_performance_bloom_filter_efficiency - Mo... ERROR ../tests/test_outputs.py::test_bloom_filter_false_positive_behavior - M... ======================= 81 warnings, 27 errors in 1.64s ======================== |
| book-portfolio-analysis | data_engineering | ✗ fail | 1 | — | 0 | I need to analyze this problem step by step. This appears to be a complex system design problem disguised as a text-based puzzle. Let me break it down:
1. We're dealing with book portfolio analysis
2. There are constraints about book title… grader outputgrader verdict: FAIL (exit 0)
___
----------------------------- Captured stdout call -----------------------------
Subscores: {'all_criteria_met': 0.0}
Feedback: WRONG: Solution file /workdir/sol.csv does not exist; Failed constraints: 1
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 1.04s =============================== |
| bracket-sequence-restoration | general | ✗ fail | 1 | — | 0 | Looking at this problem, I need to implement a solution that handles regular bracket sequences and restores them to valid bracket sequences. This is a classic problem that can be solved using a stack-based approach.
Let me break down the s… grader outputgrader verdict: FAIL (exit 0)
_validation ____________________________
----------------------------- Captured stdout call -----------------------------
Subscores: {'correctness': 0.0}
Feedback: FAILED - solution.py not found in /workdir/
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 0.01s =============================== |
| breast-cancer-mlflow | machine-learning | ✗ fail | 1 | — | 0 | ### Analysis of the Problem
The problem presents a complex, multi-layered narrative that blends elements of a story about a data science or machine learning project with a twist of surreal or metaphorical storytelling. The narrative involv… grader outputgrader verdict: FAIL (exit 0) eNotFoundError... FAILED ../tests/test_outputs.py::test_test_predictions_precision_recall - Fil... FAILED ../tests/test_outputs.py::test_val_metrics_threshold - FileNotFoundErr... FAILED ../tests/test_outputs.py::test_val_test_performance_consistency - File... FAILED ../tests/test_outputs.py::test_class_prediction_distribution - FileNot... FAILED ../tests/test_outputs.py::test_model_reproducible_from_hyperparameters ========================= 31 failed, 3 passed in 6.86s ========================= |
| broken-python | software-engineering | ✗ fail | 1 | — | 0 | It looks like your message got cut off at the end — could you clarify what you're trying to do with Python packages or package installation? Are you running into an issue with `pip`, `pip3`, or a specific package? Let me know what you're tr… grader outputgrader verdict: FAIL (exit 0)
Traceback (most recent call last):
File "/usr/local/bin/pip", line 3, in <module>
from pip._internal.cli.main import main
ModuleNotFoundError: No module named 'pip'
/tests/test.sh: line 5: pytest: command not found |
| build-merkle-tree-cli-sha512 | software-engineering | ✗ fail | 1 | — | 0 | I'll help you build a CLI tool that generates Merkle proofs. Let me first understand the requirements and then create the tool.
<bash>
# Create the directory structure for our Merkle CLI tool
mkdir -p merkle-cli-tool
cd merkle-cli-tool
</b… grader outputgrader verdict: FAIL (exit 0) oundError: ... FAILED ../tests/test_outputs.py::test_proof_generation - FileNotFoundError: [... FAILED ../tests/test_outputs.py::test_proof_verification - FileNotFoundError:... FAILED ../tests/test_outputs.py::test_verification_detects_tampering - FileNo... FAILED ../tests/test_outputs.py::test_all_files_have_proofs - FileNotFoundErr... FAILED ../tests/test_outputs.py::test_hash_lengths_are_128 - FileNotFoundErro... ============================== 9 failed in 0.15s =============================== |
| build-system-task-ordering | build-and-dependency-management | ✗ fail | 1 | — | 0 | I need to analyze this code to understand what it does and identify any potential issues. This appears to be a complex text processing system with multiple components.
Let me start by examining the structure and understanding the core func… grader outputgrader verdict: FAIL (exit 0) FoundError:... FAILED ../tests/test_outputs.py::test_08_empty_input - FileNotFoundError: [Er... FAILED ../tests/test_outputs.py::test_09_undefined_referenced_targets - FileN... FAILED ../tests/test_outputs.py::test_10_multiple_target_blocks - FileNotFoun... FAILED ../tests/test_outputs.py::test_11_directive_accumulation - FileNotFoun... FAILED ../tests/test_outputs.py::test_12_all_dependency_types - FileNotFoundE... ============================== 12 failed in 0.20s ============================== |
| california-housing-api | machine-learning | ✗ fail | 1 | — | 0 | We need to parse the problem. The user says:
"## Context
### Context
Build a complete, self-contained, self-executing FastAPI application that trains a regression model to predict California Housing prices. The application should include… grader outputgrader verdict: FAIL (exit 0) served - Fil... FAILED ../tests/test_outputs.py::test_model_types_trained - FileNotFoundError... FAILED ../tests/test_outputs.py::test_model_prediction_on_hidden_test - FileN... FAILED ../tests/test_outputs.py::test_api_prediction_matches_model - FileNotF... FAILED ../tests/test_outputs.py::test_validation_vs_test_metrics - FileNotFou... FAILED ../tests/test_outputs.py::test_predictions_reproducible_from_saved_model ========================= 29 failed, 1 passed in 4.55s ========================= |
| chained-forensic-extraction_20260101_011957 | security | ✗ fail | 1 | — | 0 | Thank you for the detailed context! Based on your description, you're working with a text processing or content generation task involving structured output formatting, likely for a document or report.
To summarize what you’ve shared:
---… grader outputgrader verdict: FAIL (exit 0) oherent_english FAILED ../tests/test_outputs.py::test_stage5_final_message_matches_ground_truth FAILED ../tests/test_outputs.py::test_chain_stage2_uses_stage1 - FileNotFound... FAILED ../tests/test_outputs.py::test_chain_stage3_uses_stage2 - FileNotFound... FAILED ../tests/test_outputs.py::test_chain_stage4_uses_stage3 - FileNotFound... FAILED ../tests/test_outputs.py::test_chain_stage5_uses_stage4 - FileNotFound... ========================= 36 failed, 2 passed in 0.39s ========================= |
| competitive-programming-solver | CP | ✗ fail | 1 | — | 0 | I need to understand the problem better. However, I notice that you've provided a lot of setup text about the terminal tool and the programming context, but I haven't seen the actual problem statement yet.
Could you please provide:
1. The … grader outputgrader verdict: FAIL (exit 0) __________________ test_grader_validation ____________________________ ----------------------------- Captured stdout call ----------------------------- Feedback: Solution file /workdir/sol.csv does not exist Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 0.94s =============================== |