bloom-filter-cache-penetration-prevention
software-engineering fail 2 turns 0s 1 tool calls
grader output
from
/tests/test.sh in the task sandbox
grader verdict: FAIL (exit 0) ray_operations ERROR ../tests/test_outputs.py::test_double_hashing_formula_verification - Mo... ERROR ../tests/test_outputs.py::test_code_documentation_formulas - ModuleNotF... ERROR ../tests/test_outputs.py::test_redis_client_reuse - ModuleNotFoundError... ERROR ../tests/test_outputs.py::test_performance_bloom_filter_efficiency - Mo... ERROR ../tests/test_outputs.py::test_bloom_filter_false_positive_behavior - M... ======================= 81 warnings, 27 errors in 1.23s ========================
system · msg 0
You are an expert terminal agent. You MUST use the provided tools to complete tasks. Use the terminal tool to run shell commands, read_file to read files, write_file to write files, search_files to search, and patch to edit files. Do NOT write out solutions as text - execute them using the tools. Always start by exploring the environment with terminal commands.
user · msg 1
You are a backend engineer building a **Bloom Filter-based cache penetration prevention system** for an e-commerce platform.
**Core Objective**: The primary goal of this task is to **reduce database roundtrips** by using Bloom Filter to pre-filter non-existent product ID requests. When a product ID is not in the Bloom Filter, the system can immediately return 404 without querying the database, significantly reducing DB load and improving system performance.
**Problem Context:**
An e-commerce platform is experiencing high database load due to cache penetration attacks:
- Redis cache is in use
- Massive requests for non-existent `productId` values
- Redis miss → DB query → No result found
- DB becomes overloaded from these unnecessary queries
**Core Requirements:**
**1. FastAPI Service Setup**
- The service must be runnable with:
```
uvicorn main:app --host 0.0.0.0 --port 8000
```
- **CRITICAL**: The ASGI application must be importable as `main:app`
- Create `/app/main.py` file containing `app = FastAPI()`
- **CRITICAL**: The database and Bloom Filter must be initialized at module load time (when the module is imported), not in startup events, to work with FastAPI's TestClient
- **Data Preparation**: The test framework will provide `/app/data/products.txt` with initial product IDs. Your code must load data from this file if it exists.
**2. Bloom Filter Implementation**
- Implement a Bloom Filter from scratch (do NOT use external Bloom Filter libraries like `pybloom`, `bloom-filter`, etc.)
- **CRITICAL Design Requirements:**
- Expected data size (n): 1,000,000 product IDs
- False Positive rate (p): Design for p ≤ 0.01 (1%)
- **You must calculate the optimal bit array size (m) and hash function count (k) using standard Bloom Filter formulas**
- **The actual false positive probability must be calculated and reported correctly using the standard formula**
- **You MUST document these calculations in code comments, showing the formulas used and the values calculated**
- **CRITICAL Constraints:**
- False Positive is ALLOWED (expected behavior)
- False Negative is ABSOLUTELY FORBIDDEN (must never occur)
- Bloom Filter must be used ONLY as a pre-filter, not as the source of truth
- **Hash Function Implementation:**
- **CRITICAL**: You MUST implement double hashing technique to efficiently generate k hash indices
- Double hashing: Use two base hash functions (h1, h2) and derive k indices as: `h1 + i * h2 mod m` for i in [0, k-1]
- This is REQUIRED for performance - computing k independent hash functions is too slow for production use
- Base hash functions: Use efficient non-cryptographic hash functions (e.g., `hashlib.md5()`, `hashlib.sha256()` with different seeds, or Python's built-in `hash()` with different seeds)
- **Hash Function Requirements**: Hash functions must be deterministic (same input always produces same output) and provide good distribution across the bit array
**3. Request Processing Flow**
The system must follow this exact flow:
```
Client Request (productId)
↓
Bloom Filter Check
↓ false → Return 404 immediately (no DB query)
↓ true
Cache (Redis) Check
↓ hit → Return cached data
↓ miss
Database Query
↓
If found: Update cache + Return data
If not found: Cache negative result (with short TTL) + Return 404
```
**Important Notes:**
- **Negative Caching**: When a product is not found in the database, cache the negative result (e.g., `cache:product:{productId}` → `null` or special marker) with a short TTL (e.g., 60-300 seconds) to prevent repeated database queries for the same non-existent keys
- This complements the Bloom Filter by handling false positives that pass through the filter
**4. Redis Backend Integration**
- Connect to Redis using `REDIS_URL` environment variable (default: `fakeredis://`)
- **CRITICAL**: The code MUST use fakeredis for all Redis operations. When `REDIS_URL=fakeredis://`, use fakeredis library. This simplifies the implementation and eliminate
… (21747 more chars) assistant · msg 2 1 tool call
I'll implement the Bloom Filter-based cache penetration prevention system according to the detailed requirements. Let's break this down step by step. First, I need to create the directory structure and files:
terminal({"command": "mkdir -p /app/data"}) tool · msg 3
{"output": "", "exit_code": 0, "error": null} assistant · msg 4
Now I'll create the necessary files and implement the solution according to the requirements:
<tool_call>
{"name": "write_file", "arguments": {"path": "/app/main.py", "content": "\"\"\"\nBloom Filter-based cache penetration prevention system for e-commerce platform.\n\nThis implementation uses a Bloom Filter to pre-filter non-existent product ID requests\nbefore querying Redis cache or database, significantly reducing database load.\n\nKey design decisions:\n- Expected data size (n): 1,000,000 product IDs\n- False positive rate (p): ≤ 0.01 (1%)\n- Optimal bit array size (m) and hash function count (k) calculated using standard formulas\n- Double hashing technique for efficient hash function generation\n- Bloom Filter state persisted to Redis for persistence across restarts\n- Negative caching for false positives that pass Bloom Filter\n- Metrics tracking for performance monitoring\n\"\"\"\n\nimport os\nimport math\nimport pickle\nimport base64\nimport hashlib\nimport sqlite3\nimport redis\nfrom bitarray import bitarray\nfrom fastapi import FastAPI, HTTPException, Body\nfrom pydantic import BaseModel\nfrom typing import Optional\nimport logging\n\n# Configure logging\nlogging.basicConfig(level=logging.INFO)\nlogger = logging.getLogger(__name__)\n\n# Environment variables\nREDIS_URL = os.getenv(\"REDIS_URL\", \"fakeredis://\")\nBLOOM_STATE_NAMESPACE = os.getenv(\"BLOOM_STATE_NAMESPACE\", \"\")\n\n# Define Bloom Filter state key based on namespace\nif BLOOM_STATE_NAMESPACE:\n BLOOM_STATE_KEY = f\"bloom:state:{BLOOM_STATE_NAMESPACE}\"\nelse:\n BLOOM_STATE_KEY = \"bloom:filter\"\n\n# Constants for Bloom Filter\nEXPECTED_ITEMS = 1000000 # n\nFALSE_POSITIVE_RATE = 0.01 # p\n\n# Calculate optimal bit array size (m) and hash function count (k)\n# Formula for m: m = -n * ln(p) / (ln(2)^2)\n# Formula for k: k = (m/n) * ln(2)\n\n# Calculate m\nm = int(-EXPECTED_ITEMS * math.log(FALSE_POSITIVE_RATE) / (math.log(2)**2))\n\n# Calculate k using the actual integer value of m\nk = int((m / EXPECTED_ITEMS) * math.log(2))\n\n# Verify that the calculated parameters meet requirements\n# The actual false positive rate should be approximately: p ≈ (1 - e^(-kn/m))^k\n# We'll calculate this later for the stats endpoint\n\n# Initialize Redis client at module load time\nredis_client = redis.from_url(REDIS_URL, decode_responses=True)\n\n# Initialize database connection and create table\nDB_PATH = \"/app/data/products.db\"\n\n# Ensure data directory exists\nos.makedirs(os.path.dirname(DB_PATH), exist_ok=True)\n\n# Initialize database connection\nconn = sqlite3.connect(DB_PATH)\ncursor = conn.cursor()\n\n# Create products table if not exists\ncursor.execute('''\nCREATE TABLE IF NOT EXISTS products (\n id TEXT PRIMARY KEY,\n name TEXT NOT NULL,\n price REAL NOT NULL\n)\n''')\nconn.commit()\nconn.close()\n\n# Define product file path\nPRODUCTS_FILE = \"/app/data/products.txt\"\n\n# Initialize Bloom Filter\n# We'll create a BloomFilter class\n\nclass BloomFilter:\n \"\"\"\n A Bloom Filter implementation with double hashing.\n \n Uses the following parameters:\n - m: bit array size (calculated from n=1,000,000 and p=0.01)\n - k: number of hash functions (calculated from m and n)\n \n Double hashing technique: h(i) = (h1 + i * h2) mod m for i in [0, k-1]\n \"\"\"\n \n def __init__(self, m=m, k=k):\n \"\"\"\n Initialize Bloom Filter with given parameters.\n \n Args:\n m (int): Size of bit array\n k (int): Number of hash functions\n \"\"\"\n self.m = m\n self.k = k\n self.bit_array = bitarray(m)\n self.bit_array.setall(0)\n self.count = 0 # Number of items added\n \n # Use two base hash functions for double hashing\n # We'll use hashlib with different seeds for h1 and h2\n self.hash_seed_1 = 1\n self.hash_seed_2 = 2\n \n def _hash(self, item, seed):\n \"\"\"\n Generat
… (12718 more chars)