← all results · Firworks/Kimi-Linear-48B-A3B-Instruct-nvfp4 · NVFP4 · 3B/48B · parser pythonic · smoke probes · live

kimi-linear

Kimi Delta Attention (KDA) hybrid-linear attention MoE. Tiny active-param count means very fast inference but also severely limited capacity to follow long / complex instructions.

0/20 tasks passed · 0% pass rate

0 pass 20 grader fail 0 timeout 0 no tool calls
Task Category Status Turns Time Tool calls What the model ended with
acl-permissions-inheritance system-administration ✗ fail 1 0 The user request is to create a shared directory structure with Linux ACLs (Access Control Lists) that demonstrates proper permission inheritance. The user is asking for a solution that shows how to set up a directory structure with specifi…
grader output
grader verdict: FAIL (exit 0)
_access - s...
FAILED ../tests/test_outputs.py::test_bob_can_create_and_alice_can_access - s...
FAILED ../tests/test_outputs.py::test_subdirectory_inheritance - subprocess.C...
FAILED ../tests/test_outputs.py::test_outside_users_denied_access - Assertion...
FAILED ../tests/test_outputs.py::test_file_permissions_inheritance - subproce...
FAILED ../tests/test_outputs.py::test_execute_permissions_on_scripts - subpro...
============================== 9 failed in 0.49s ===============================
amuse-install scientific-computing ✗ fail 1 0 I need to understand the problem better. Let you describe what you're trying to accomplish with these terminal commands and file operations? The instructions seem to be about creating, reading, and modifying files in a terminal environmen…
grader output
grader verdict: FAIL (exit 0)
4c45f38c825f9549d2bc941de9fdc85
Expected file hash: eb181d4c3ea88b13e6486c731f2bdd7c84c45f38c825f9549d2bc941de9fdc85
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_test_script_unchanged
FAILED ../tests/test_outputs.py::test_virtual_environment_created - Assertion...
FAILED ../tests/test_outputs.py::test_test_script_executes - AssertionError: ...
========================= 2 failed, 1 passed in 0.03s ==========================
anomaly-detection-ranking general ✗ fail 1 0 It looks like you've shared a detailed and structured prompt, but it seems to be about a **textbook problem** or **assignment** involving anomaly detection, sensor networks, and machine learning. The goal is to build a system that can detec…
grader output
grader verdict: FAIL (exit 0)
============
____________________________ test_grader_validation ____________________________
----------------------------- Captured stdout call -----------------------------
Feedback: results.json not found
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 0.34s ===============================
api-endpoint-permission-canonicalizer software-engineering ✗ fail 1 0 The user is describing a complex, multi-level problem involving a "terminal agent" that must parse, understand, and then re-format a given text block into a specific output format. The user has provided a long, confusing, and meta-level des…
grader output
grader verdict: FAIL (exit 0)
or: Canoni...
FAILED ../tests/test_outputs.py::test_path_variable_normalization_parent_matching
FAILED ../tests/test_outputs.py::test_path_variable_normalization_complex - R...
FAILED ../tests/test_outputs.py::test_deduplication_first_encountered - Runti...
FAILED ../tests/test_outputs.py::test_deduplication_normalized_paths - Runtim...
FAILED ../tests/test_outputs.py::test_deduplication_before_sorting - RuntimeE...
============================== 26 failed in 0.34s ==============================
application-debug debugging ✗ fail 1 0 I need to analyze a Java application error log and create a structured incident report. Let me start by examining the log files to understand the issue. ```bash # First, let's check what log files are available find /app -name "*.log" -typ…
grader output
grader verdict: FAIL (exit 0)
on - FileNot...
FAILED ../tests/test_outputs.py::test_stack_traces_present - FileNotFoundErro...
FAILED ../tests/test_outputs.py::test_stack_traces_correspond_to_top_exceptions
FAILED ../tests/test_outputs.py::test_stack_traces_include_line_numbers - Fil...
FAILED ../tests/test_outputs.py::test_percentage_calculations - FileNotFoundE...
FAILED ../tests/test_outputs.py::test_anti_cheat_validation - AssertionError:...
============================== 13 failed in 0.15s ==============================
auth_token_race_condition backend-engineering ✗ fail 1 0 It looks like your message might be incomplete or formatted incorrectly. Could you clarify what you need help with? Are you asking for help with: 1. A **Sshell scripting** task (e.g., writing a script to read files, patch code, or automate…
grader output
grader verdict: FAIL (exit 1)
Starting test suite for concurrent token refresh...
Killing any existing uvicorn processes...
Waiting for port 8000 to be available...
Waiting for Redis to be ready...
/tests/test.sh: line 27: /workspace/start.sh: No such file or directory
Redis failed to start within 60 seconds
bandit-delayed-feedback machine-learning ✗ fail 1 0 I need to analyze this complex problem step by step. The user has provided a very long, confusing, and somewhat nonsensical text that appears to be mixing together: 1. A story about a "Multi-Armed Bandit with Delayed Feedback" problem 2. So…
grader output
grader verdict: FAIL (exit 0)
hold
ERROR ../tests/test_outputs.py::TestDelayModel::test_conversion_rate_error_threshold
ERROR ../tests/test_outputs.py::TestArmStatistics::test_conversion_rate_rmse_threshold
FAILED ../tests/test_outputs.py::TestOutputFormat::test_predictions_file_exists
FAILED ../tests/test_outputs.py::TestOutputFormat::test_delay_model_file_exists
FAILED ../tests/test_outputs.py::TestOutputFormat::test_arm_statistics_file_exists
========================= 3 failed, 17 errors in 1.33s =========================
bash-log-processor-fix bash_scripting ✗ fail 1 0 I need to analyze this text for issues. Let me look for patterns that might indicate problems.<|reserved_token_163595|><|reserved_token_163597|>functions.search_files:0<|reserved_token_163598|>{"pattern": "race condition|deadlock|concurrent…
grader output
grader verdict: FAIL (exit 0)
====
____________________________ test_grader_validation ____________________________
----------------------------- Captured stdout call -----------------------------
Feedback: processed.log was not created.
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 0.01s ===============================
basic-message-queue backend ✗ fail 1 0 Looking at the provided text, this appears to be a description of a message queuing service implementation with specific requirements. However, I notice there's a mismatch in the requirements - the user has provided a complex description bu…
grader output
grader verdict: FAIL (exit 0)
___________ test_grader_validation ____________________________
----------------------------- Captured stdout call -----------------------------
Subscores: {'acceptance': 0.0}
Feedback: solution.py not found
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 0.10s ===============================
battery-charging-optimization general ✗ fail 1 0 It looks like you've provided a very cleverly disguised prompt! However, I need to be careful about content that involves trying to get me to produce content that appears to be from a specific copyrighted source, even if it's been reworded.…
grader output
grader verdict: FAIL (exit 0)
_________________ test_grader_validation ____________________________
----------------------------- Captured stdout call -----------------------------
Subscores: {'pass': 0.0}
Feedback: answer.txt not found.
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 0.01s ===============================
bloom-filter-cache-penetration-prevention software-engineering ✗ fail 1 0 We need to parse the user request. The user says: "Write a Bloom Filter implementation in Python using the standard library. The Bloom Filter should be able to handle large amounts of data and be efficient. The Bloom Filter must be able to…
grader output
grader verdict: FAIL (exit 0)
ray_operations
ERROR ../tests/test_outputs.py::test_double_hashing_formula_verification - Mo...
ERROR ../tests/test_outputs.py::test_code_documentation_formulas - ModuleNotF...
ERROR ../tests/test_outputs.py::test_redis_client_reuse - ModuleNotFoundError...
ERROR ../tests/test_outputs.py::test_performance_bloom_filter_efficiency - Mo...
ERROR ../tests/test_outputs.py::test_bloom_filter_false_positive_behavior - M...
======================= 81 warnings, 27 errors in 1.64s ========================
book-portfolio-analysis data_engineering ✗ fail 1 0 I need to analyze this problem step by step. This appears to be a complex system design problem disguised as a text-based puzzle. Let me break it down: 1. We're dealing with book portfolio analysis 2. There are constraints about book title…
grader output
grader verdict: FAIL (exit 0)
___
----------------------------- Captured stdout call -----------------------------
Subscores: {'all_criteria_met': 0.0}
Feedback: WRONG: Solution file /workdir/sol.csv does not exist; Failed constraints: 1
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 1.04s ===============================
bracket-sequence-restoration general ✗ fail 1 0 Looking at this problem, I need to implement a solution that handles regular bracket sequences and restores them to valid bracket sequences. This is a classic problem that can be solved using a stack-based approach. Let me break down the s…
grader output
grader verdict: FAIL (exit 0)
_validation ____________________________
----------------------------- Captured stdout call -----------------------------
Subscores: {'correctness': 0.0}
Feedback: FAILED - solution.py not found in /workdir/
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 0.01s ===============================
breast-cancer-mlflow machine-learning ✗ fail 1 0 ### Analysis of the Problem The problem presents a complex, multi-layered narrative that blends elements of a story about a data science or machine learning project with a twist of surreal or metaphorical storytelling. The narrative involv…
grader output
grader verdict: FAIL (exit 0)
eNotFoundError...
FAILED ../tests/test_outputs.py::test_test_predictions_precision_recall - Fil...
FAILED ../tests/test_outputs.py::test_val_metrics_threshold - FileNotFoundErr...
FAILED ../tests/test_outputs.py::test_val_test_performance_consistency - File...
FAILED ../tests/test_outputs.py::test_class_prediction_distribution - FileNot...
FAILED ../tests/test_outputs.py::test_model_reproducible_from_hyperparameters
========================= 31 failed, 3 passed in 6.86s =========================
broken-python software-engineering ✗ fail 1 0 It looks like your message got cut off at the end — could you clarify what you're trying to do with Python packages or package installation? Are you running into an issue with `pip`, `pip3`, or a specific package? Let me know what you're tr…
grader output
grader verdict: FAIL (exit 0)
Traceback (most recent call last):
  File "/usr/local/bin/pip", line 3, in <module>
    from pip._internal.cli.main import main
ModuleNotFoundError: No module named 'pip'
/tests/test.sh: line 5: pytest: command not found
build-merkle-tree-cli-sha512 software-engineering ✗ fail 1 0 I'll help you build a CLI tool that generates Merkle proofs. Let me first understand the requirements and then create the tool. <bash> # Create the directory structure for our Merkle CLI tool mkdir -p merkle-cli-tool cd merkle-cli-tool </b…
grader output
grader verdict: FAIL (exit 0)
oundError: ...
FAILED ../tests/test_outputs.py::test_proof_generation - FileNotFoundError: [...
FAILED ../tests/test_outputs.py::test_proof_verification - FileNotFoundError:...
FAILED ../tests/test_outputs.py::test_verification_detects_tampering - FileNo...
FAILED ../tests/test_outputs.py::test_all_files_have_proofs - FileNotFoundErr...
FAILED ../tests/test_outputs.py::test_hash_lengths_are_128 - FileNotFoundErro...
============================== 9 failed in 0.15s ===============================
build-system-task-ordering build-and-dependency-management ✗ fail 1 0 I need to analyze this code to understand what it does and identify any potential issues. This appears to be a complex text processing system with multiple components. Let me start by examining the structure and understanding the core func…
grader output
grader verdict: FAIL (exit 0)
FoundError:...
FAILED ../tests/test_outputs.py::test_08_empty_input - FileNotFoundError: [Er...
FAILED ../tests/test_outputs.py::test_09_undefined_referenced_targets - FileN...
FAILED ../tests/test_outputs.py::test_10_multiple_target_blocks - FileNotFoun...
FAILED ../tests/test_outputs.py::test_11_directive_accumulation - FileNotFoun...
FAILED ../tests/test_outputs.py::test_12_all_dependency_types - FileNotFoundE...
============================== 12 failed in 0.20s ==============================
california-housing-api machine-learning ✗ fail 1 0 We need to parse the problem. The user says: "## Context ### Context Build a complete, self-contained, self-executing FastAPI application that trains a regression model to predict California Housing prices. The application should include…
grader output
grader verdict: FAIL (exit 0)
served - Fil...
FAILED ../tests/test_outputs.py::test_model_types_trained - FileNotFoundError...
FAILED ../tests/test_outputs.py::test_model_prediction_on_hidden_test - FileN...
FAILED ../tests/test_outputs.py::test_api_prediction_matches_model - FileNotF...
FAILED ../tests/test_outputs.py::test_validation_vs_test_metrics - FileNotFou...
FAILED ../tests/test_outputs.py::test_predictions_reproducible_from_saved_model
========================= 29 failed, 1 passed in 4.55s =========================
chained-forensic-extraction_20260101_011957 security ✗ fail 1 0 Thank you for the detailed context! Based on your description, you're working with a text processing or content generation task involving structured output formatting, likely for a document or report. To summarize what you’ve shared: ---…
grader output
grader verdict: FAIL (exit 0)
oherent_english
FAILED ../tests/test_outputs.py::test_stage5_final_message_matches_ground_truth
FAILED ../tests/test_outputs.py::test_chain_stage2_uses_stage1 - FileNotFound...
FAILED ../tests/test_outputs.py::test_chain_stage3_uses_stage2 - FileNotFound...
FAILED ../tests/test_outputs.py::test_chain_stage4_uses_stage3 - FileNotFound...
FAILED ../tests/test_outputs.py::test_chain_stage5_uses_stage4 - FileNotFound...
========================= 36 failed, 2 passed in 0.39s =========================
competitive-programming-solver CP ✗ fail 1 0 I need to understand the problem better. However, I notice that you've provided a lot of setup text about the terminal tool and the programming context, but I haven't seen the actual problem statement yet. Could you please provide: 1. The …
grader output
grader verdict: FAIL (exit 0)
__________________ test_grader_validation ____________________________
----------------------------- Captured stdout call -----------------------------
Feedback: Solution file /workdir/sol.csv does not exist
Score: 0.0
✓ Grader validation completed. Score: 0.0 written to reward.txt
=========================== short test summary info ============================
PASSED ../tests/test_outputs.py::test_grader_validation
============================== 1 passed in 0.94s ===============================