← all results · Firworks/Kimi-Linear-48B-A3B-Instruct-nvfp4 · NVFP4 · 3B/48B · parser pythonic · smoke probes · live

kimi-linear

Kimi Delta Attention (KDA) hybrid-linear attention MoE. Tiny active-param count means very fast inference but also severely limited capacity to follow long / complex instructions.

0/20 tasks passed · 0% pass rate

0 pass 20 grader fail 0 timeout 0 no tool calls

Task	Category	Status	Turns	Time	What the model ended with
acl-permissions-inheritance	system-administration	✗ fail	1	—	The user request is to create a shared directory structure with Linux ACLs (Access Control Lists) that demonstrates proper permission inheritance. The user is asking for a solution that shows how to set up a directory structure with specifi… grader output grader verdict: FAIL (exit 0) _access - s... FAILED ../tests/test_outputs.py::test_bob_can_create_and_alice_can_access - s... FAILED ../tests/test_outputs.py::test_subdirectory_inheritance - subprocess.C... FAILED ../tests/test_outputs.py::test_outside_users_denied_access - Assertion... FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - subproce... FAILED ../tests/test_outputs.py::test_execute_permissions_on_scripts - subpro... ============================== 9 failed in 0.49s ===============================
amuse-install	scientific-computing	✗ fail	1	—	I need to understand the problem better. Let you describe what you're trying to accomplish with these terminal commands and file operations? The instructions seem to be about creating, reading, and modifying files in a terminal environmen… grader output grader verdict: FAIL (exit 0) 4c45f38c825f9549d2bc941de9fdc85 Expected file hash: eb181d4c3ea88b13e6486c731f2bdd7c84c45f38c825f9549d2bc941de9fdc85 =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_test_script_unchanged FAILED ../tests/test_outputs.py::test_virtual_environment_created - Assertion... FAILED ../tests/test_outputs.py::test_test_script_executes - AssertionError: ... ========================= 2 failed, 1 passed in 0.03s ==========================
anomaly-detection-ranking	general	✗ fail	1	—	It looks like you've shared a detailed and structured prompt, but it seems to be about a textbook problem or assignment involving anomaly detection, sensor networks, and machine learning. The goal is to build a system that can detec… grader output grader verdict: FAIL (exit 0) ============ ____________________________ test_grader_validation ____________________________ ----------------------------- Captured stdout call ----------------------------- Feedback: results.json not found Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 0.34s ===============================
api-endpoint-permission-canonicalizer	software-engineering	✗ fail	1	—	The user is describing a complex, multi-level problem involving a "terminal agent" that must parse, understand, and then re-format a given text block into a specific output format. The user has provided a long, confusing, and meta-level des… grader output grader verdict: FAIL (exit 0) or: Canoni... FAILED ../tests/test_outputs.py::test_path_variable_normalization_parent_matching FAILED ../tests/test_outputs.py::test_path_variable_normalization_complex - R... FAILED ../tests/test_outputs.py::test_deduplication_first_encountered - Runti... FAILED ../tests/test_outputs.py::test_deduplication_normalized_paths - Runtim... FAILED ../tests/test_outputs.py::test_deduplication_before_sorting - RuntimeE... ============================== 26 failed in 0.34s ==============================
application-debug	debugging	✗ fail	1	—	I need to analyze a Java application error log and create a structured incident report. Let me start by examining the log files to understand the issue. ```bash # First, let's check what log files are available find /app -name "*.log" -typ… grader output grader verdict: FAIL (exit 0) on - FileNot... FAILED ../tests/test_outputs.py::test_stack_traces_present - FileNotFoundErro... FAILED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions FAILED ../tests/test_outputs.py::test_stack_traces_include_line_numbers - Fil... FAILED ../tests/test_outputs.py::test_percentage_calculations - FileNotFoundE... FAILED ../tests/test_outputs.py::test_anti_cheat_validation - AssertionError:... ============================== 13 failed in 0.15s ==============================
auth_token_race_condition	backend-engineering	✗ fail	1	—	It looks like your message might be incomplete or formatted incorrectly. Could you clarify what you need help with? Are you asking for help with: 1. A Sshell scripting task (e.g., writing a script to read files, patch code, or automate… grader output grader verdict: FAIL (exit 1) Starting test suite for concurrent token refresh... Killing any existing uvicorn processes... Waiting for port 8000 to be available... Waiting for Redis to be ready... /tests/test.sh: line 27: /workspace/start.sh: No such file or directory Redis failed to start within 60 seconds
bandit-delayed-feedback	machine-learning	✗ fail	1	—	I need to analyze this complex problem step by step. The user has provided a very long, confusing, and somewhat nonsensical text that appears to be mixing together: 1. A story about a "Multi-Armed Bandit with Delayed Feedback" problem 2. So… grader output grader verdict: FAIL (exit 0) hold ERROR ../tests/test_outputs.py::TestDelayModel::test_conversion_rate_error_threshold ERROR ../tests/test_outputs.py::TestArmStatistics::test_conversion_rate_rmse_threshold FAILED ../tests/test_outputs.py::TestOutputFormat::test_predictions_file_exists FAILED ../tests/test_outputs.py::TestOutputFormat::test_delay_model_file_exists FAILED ../tests/test_outputs.py::TestOutputFormat::test_arm_statistics_file_exists ========================= 3 failed, 17 errors in 1.33s =========================
bash-log-processor-fix	bash_scripting	✗ fail	1	—	I need to analyze this text for issues. Let me look for patterns that might indicate problems.<\|reserved_token_163595\|><\|reserved_token_163597\|>functions.search_files:0<\|reserved_token_163598\|>{"pattern": "race condition\|deadlock\|concurrent… grader output grader verdict: FAIL (exit 0) ==== ____________________________ test_grader_validation ____________________________ ----------------------------- Captured stdout call ----------------------------- Feedback: processed.log was not created. Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 0.01s ===============================
basic-message-queue	backend	✗ fail	1	—	Looking at the provided text, this appears to be a description of a message queuing service implementation with specific requirements. However, I notice there's a mismatch in the requirements - the user has provided a complex description bu… grader output grader verdict: FAIL (exit 0) ___________ test_grader_validation ____________________________ ----------------------------- Captured stdout call ----------------------------- Subscores: {'acceptance': 0.0} Feedback: solution.py not found Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 0.10s ===============================
battery-charging-optimization	general	✗ fail	1	—	It looks like you've provided a very cleverly disguised prompt! However, I need to be careful about content that involves trying to get me to produce content that appears to be from a specific copyrighted source, even if it's been reworded.… grader output grader verdict: FAIL (exit 0) _________________ test_grader_validation ____________________________ ----------------------------- Captured stdout call ----------------------------- Subscores: {'pass': 0.0} Feedback: answer.txt not found. Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 0.01s ===============================
bloom-filter-cache-penetration-prevention	software-engineering	✗ fail	1	—	We need to parse the user request. The user says: "Write a Bloom Filter implementation in Python using the standard library. The Bloom Filter should be able to handle large amounts of data and be efficient. The Bloom Filter must be able to… grader output grader verdict: FAIL (exit 0) ray_operations ERROR ../tests/test_outputs.py::test_double_hashing_formula_verification - Mo... ERROR ../tests/test_outputs.py::test_code_documentation_formulas - ModuleNotF... ERROR ../tests/test_outputs.py::test_redis_client_reuse - ModuleNotFoundError... ERROR ../tests/test_outputs.py::test_performance_bloom_filter_efficiency - Mo... ERROR ../tests/test_outputs.py::test_bloom_filter_false_positive_behavior - M... ======================= 81 warnings, 27 errors in 1.64s ========================
book-portfolio-analysis	data_engineering	✗ fail	1	—	I need to analyze this problem step by step. This appears to be a complex system design problem disguised as a text-based puzzle. Let me break it down: 1. We're dealing with book portfolio analysis 2. There are constraints about book title… grader output grader verdict: FAIL (exit 0) ___ ----------------------------- Captured stdout call ----------------------------- Subscores: {'all_criteria_met': 0.0} Feedback: WRONG: Solution file /workdir/sol.csv does not exist; Failed constraints: 1 Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 1.04s ===============================
bracket-sequence-restoration	general	✗ fail	1	—	Looking at this problem, I need to implement a solution that handles regular bracket sequences and restores them to valid bracket sequences. This is a classic problem that can be solved using a stack-based approach. Let me break down the s… grader output grader verdict: FAIL (exit 0) _validation ____________________________ ----------------------------- Captured stdout call ----------------------------- Subscores: {'correctness': 0.0} Feedback: FAILED - solution.py not found in /workdir/ Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 0.01s ===============================
breast-cancer-mlflow	machine-learning	✗ fail	1	—	### Analysis of the Problem The problem presents a complex, multi-layered narrative that blends elements of a story about a data science or machine learning project with a twist of surreal or metaphorical storytelling. The narrative involv… grader output grader verdict: FAIL (exit 0) eNotFoundError... FAILED ../tests/test_outputs.py::test_test_predictions_precision_recall - Fil... FAILED ../tests/test_outputs.py::test_val_metrics_threshold - FileNotFoundErr... FAILED ../tests/test_outputs.py::test_val_test_performance_consistency - File... FAILED ../tests/test_outputs.py::test_class_prediction_distribution - FileNot... FAILED ../tests/test_outputs.py::test_model_reproducible_from_hyperparameters ========================= 31 failed, 3 passed in 6.86s =========================
broken-python	software-engineering	✗ fail	1	—	It looks like your message got cut off at the end — could you clarify what you're trying to do with Python packages or package installation? Are you running into an issue with `pip`, `pip3`, or a specific package? Let me know what you're tr… grader output grader verdict: FAIL (exit 0) Traceback (most recent call last): File "/usr/local/bin/pip", line 3, in <module> from pip._internal.cli.main import main ModuleNotFoundError: No module named 'pip' /tests/test.sh: line 5: pytest: command not found
build-merkle-tree-cli-sha512	software-engineering	✗ fail	1	—	I'll help you build a CLI tool that generates Merkle proofs. Let me first understand the requirements and then create the tool. <bash> # Create the directory structure for our Merkle CLI tool mkdir -p merkle-cli-tool cd merkle-cli-tool </b… grader output grader verdict: FAIL (exit 0) oundError: ... FAILED ../tests/test_outputs.py::test_proof_generation - FileNotFoundError: [... FAILED ../tests/test_outputs.py::test_proof_verification - FileNotFoundError:... FAILED ../tests/test_outputs.py::test_verification_detects_tampering - FileNo... FAILED ../tests/test_outputs.py::test_all_files_have_proofs - FileNotFoundErr... FAILED ../tests/test_outputs.py::test_hash_lengths_are_128 - FileNotFoundErro... ============================== 9 failed in 0.15s ===============================
build-system-task-ordering	build-and-dependency-management	✗ fail	1	—	I need to analyze this code to understand what it does and identify any potential issues. This appears to be a complex text processing system with multiple components. Let me start by examining the structure and understanding the core func… grader output grader verdict: FAIL (exit 0) FoundError:... FAILED ../tests/test_outputs.py::test_08_empty_input - FileNotFoundError: [Er... FAILED ../tests/test_outputs.py::test_09_undefined_referenced_targets - FileN... FAILED ../tests/test_outputs.py::test_10_multiple_target_blocks - FileNotFoun... FAILED ../tests/test_outputs.py::test_11_directive_accumulation - FileNotFoun... FAILED ../tests/test_outputs.py::test_12_all_dependency_types - FileNotFoundE... ============================== 12 failed in 0.20s ==============================
california-housing-api	machine-learning	✗ fail	1	—	We need to parse the problem. The user says: "## Context ### Context Build a complete, self-contained, self-executing FastAPI application that trains a regression model to predict California Housing prices. The application should include… grader output grader verdict: FAIL (exit 0) served - Fil... FAILED ../tests/test_outputs.py::test_model_types_trained - FileNotFoundError... FAILED ../tests/test_outputs.py::test_model_prediction_on_hidden_test - FileN... FAILED ../tests/test_outputs.py::test_api_prediction_matches_model - FileNotF... FAILED ../tests/test_outputs.py::test_validation_vs_test_metrics - FileNotFou... FAILED ../tests/test_outputs.py::test_predictions_reproducible_from_saved_model ========================= 29 failed, 1 passed in 4.55s =========================
chained-forensic-extraction_20260101_011957	security	✗ fail	1	—	Thank you for the detailed context! Based on your description, you're working with a text processing or content generation task involving structured output formatting, likely for a document or report. To summarize what you’ve shared: ---… grader output grader verdict: FAIL (exit 0) oherent_english FAILED ../tests/test_outputs.py::test_stage5_final_message_matches_ground_truth FAILED ../tests/test_outputs.py::test_chain_stage2_uses_stage1 - FileNotFound... FAILED ../tests/test_outputs.py::test_chain_stage3_uses_stage2 - FileNotFound... FAILED ../tests/test_outputs.py::test_chain_stage4_uses_stage3 - FileNotFound... FAILED ../tests/test_outputs.py::test_chain_stage5_uses_stage4 - FileNotFound... ========================= 36 failed, 2 passed in 0.39s =========================
competitive-programming-solver	CP	✗ fail	1	—	I need to understand the problem better. However, I notice that you've provided a lot of setup text about the terminal tool and the programming context, but I haven't seen the actual problem statement yet. Could you please provide: 1. The … grader output grader verdict: FAIL (exit 0) __________________ test_grader_validation ____________________________ ----------------------------- Captured stdout call ----------------------------- Feedback: Solution file /workdir/sol.csv does not exist Score: 0.0 ✓ Grader validation completed. Score: 0.0 written to reward.txt =========================== short test summary info ============================ PASSED ../tests/test_outputs.py::test_grader_validation ============================== 1 passed in 0.94s ===============================