fix: bound memory for large DataFrame/Arrow caching (was OOMing at real sizes) by 27Bslash6 · Pull Request #152 · cachekit-io/cachekit-py

27Bslash6 · 2026-06-06T14:04:30Z

Problem

Large-object (Arrow/DataFrame) caching was brittle and OOM'd at real data sizes. Measured on representative numeric DataFrames:

Metric	Before	After
Write peak RSS / logical	9.4×	~6×
Read peak RSS / logical	7.2×	3.5×
Python-tracked peak (tracemalloc) / logical	5.7× (write)	2.0×
Wire / L1 size / logical	1.33×	~1.0× (less on compressible)
300 MB frame under a 1.5 GB budget	OOM-killed (exit 137)	needs 1.9 GB (was 3.0 GB)
200 MB frame under a 1.5 GB budget	OOM-killed	caches in 1.2 GB

Root cause

Two stacked, independent amplifiers plus a cluster of silent-failure bugs:

SerializationWrapper base64-over-JSON envelope — applied to every serializer's output before storage. base64.b64encode → .decode("ascii") → json.dumps → .encode() forced ~4 full-size copies at peak and a permanent 1.33× wire/L1 inflation. This dominated both directions (the read peak was set during unwrap, not to_pandas).
ArrowSerializer copy chain — to_pybytes() + a checksum + arrow_data concat on write; a full-body data[8:] bytes slice on read; to_pandas() without low-peak flags; preserve_index=True materializing even a RangeIndex.

Separately surfaced: memcached silently dropped any value >1 MB (the set inherited noreply=True, so the server's "object too large" error was never read → permanent silent cache miss); L1 stored a single entry larger than its entire budget (an OOM vector that also evicted everything else); integrity_checking=False gave Arrow zero corruption detection (a 1-bit flip silently returned wrong data).

Changes

SerializationWrapper → length-prefixed binary frame (CK | version | u32 header-len | json header | RAW payload). Payload stored raw, copied once. unwrap() still reads the legacy base64+JSON envelope, so pre-existing cache entries remain readable — no flush required. Python-SDK-internal; the cross-SDK ByteStorage wire format is untouched.
ArrowSerializer → zstd IPC compression written in ~8 MiB record batches (bounds the compressor's working set), checksum hashed over the buffer memoryview (no copy), read envelope sliced with a memoryview (no full-body copy), to_pandas(self_destruct=True, split_blocks=True), preserve_index=None. Always checksums (the silent-corruption knob is gone; 8 bytes is free). Magic-sniffs the envelope on read so checksummed/raw/legacy all decode; normalizes dict-of-scalars to the documented TypeError and malformed input to SerializationError (no more raw OSError leak).
memcached → rejects oversized values with a clear PERMANENT BackendError and sets noreply=False so server errors surface loudly.
L1 → refuses to store an entry larger than the whole budget (value stays in L2) and drops any stale smaller entry for that key.

Backward compatibility

Old entries (base64+JSON wrap; legacy [checksum][IPC] Arrow) still deserialize. New writes use the new formats; old entries age out by TTL.
No protocol-spec / Rust / TS / SaaS changes — the envelope is opaque to backends and other SDKs.
wrap/unwrap and serializer signatures unchanged.

Testing

New: test_serialization_wrapper.py (frame round-trip, non-UTF-8 payloads, legacy back-compat, no-base64/no-inflation), test_l1_memory_bounds.py (oversized-entry rejection), test_large_object_memory.py (deterministic tracemalloc + wire-size regression guards that fail loudly if base64 returns).
Extended test_arrow_serializer.py with a dtype/index fidelity matrix (nullable, categorical ordered/unordered, tz-aware, timedelta, MultiIndex), compression, always-checksum, and exception-hygiene cases.
Full tests/unit + tests/critical suite: 1750 passed, 0 failed. ruff (incl. S) + basedpyright --level error clean.

Not in this PR (follow-up)

The residual ~6× write peak is pyarrow's in-memory IPC materialization (BufferOutputStream doubling-growth + memory-pool retention) — even uncompressed it is ~3.75×. Driving it lower requires streaming serialization straight to the backend (never holding the full blob), a backend-interface change tracked separately.

Summary by CodeRabbit

Release Notes

New Features
- Configurable Arrow compression (zstd, lz4, none) for improved storage efficiency
- Client-side Memcached item size validation before transmission
- L1 cache size enforcement to prevent memory overruns
Improvements
- Memcached now surfaces server errors instead of silently swallowing them
- Arrow serialisation enforces integrity checksums consistently
- Binary frame envelope for more efficient data wrapping
Compatibility
- Fully backwards compatible with legacy serialisation formats

…al sizes) Caching a real-sized DataFrame peaked at ~9.4x its logical size on the write path and ~7.2x on read, OOM-killing a 300MB frame under a 1.5GB budget, and silently inflated every cached value 1.33x on the wire/in L1. Two stacked amplifiers, plus a cluster of silent-failure bugs. Root cause: - SerializationWrapper base64-encoded the payload inside a JSON object for EVERY serializer, forcing ~4 full-size copies at peak (b64-bytes -> ascii-str -> json-str -> utf8-bytes) and a permanent 1.33x wire/L1 inflation. - ArrowSerializer copied the payload several more times (to_pybytes + a checksum-prepend concat on write; a full-body `data[8:]` bytes slice on read) and converted Arrow->pandas without low-peak flags. Changes: - SerializationWrapper: replace base64+JSON with a length-prefixed binary frame (CK magic | version | u32 header-len | json header | RAW payload). The payload is stored raw and copied exactly once. unwrap() still reads the legacy base64+JSON envelope, so pre-existing cache entries stay readable (no flush). Internal to the Python SDK; the cross-SDK ByteStorage wire format is unchanged. - ArrowSerializer: enable zstd IPC compression written in ~8MiB record batches (bounds the compressor's working set), hash over the buffer memoryview (no copy), slice the read envelope with a memoryview (no full-body copy), convert with to_pandas(self_destruct, split_blocks), and use preserve_index=None so a RangeIndex is metadata-only. Always checksum -- the integrity-off path silently returned corrupted DataFrames, and 8 bytes is negligible. Magic-sniff the envelope on read so checksummed/raw/legacy data all decode and malformed input raises SerializationError (no more raw OSError leak); dict-of-scalars now raises the documented TypeError. - memcached: reject values over the item-size cap with a clear PERMANENT BackendError and pass noreply=False, so oversized items fail loudly instead of being silently dropped (noreply previously swallowed the server's error -> a permanent silent cache miss for any real DataFrame). - L1: refuse to store a single entry larger than the whole L1 budget (an OOM vector that also evicted every other entry) and drop any stale smaller entry for that key; the value still lives in L2. Measured: write peak ~9.4x -> ~6x, read ~7.2x -> ~3.5x, Python-tracked peak ~5.7x -> ~2x, wire 1.33x -> ~1x (far less on compressible data). A 200MB frame that OOM'd under a 1.5GB budget now caches in ~1.2GB. The residual write peak is pyarrow's in-memory IPC materialization (BufferOutputStream growth + memory-pool retention); driving it lower needs streaming serialization straight to the backend (separate change).

coderabbitai · 2026-06-06T14:04:36Z

Walkthrough

This PR introduces configurable Arrow IPC compression with always-on xxHash3-64 integrity checking, upgrades the serialization wrapper to a binary frame format (v3), and enforces client-side size limits across Memcached backend and L1 in-memory cache layers. It includes comprehensive unit and performance tests.

Changes

Serialization, Compression, and Size Bounds for Large Object Caching

Layer / File(s)	Summary
Arrow compression config and serializer foundation `src/cachekit/config/settings.py`, `src/cachekit/serializers/arrow_serializer.py`	Adds `arrow_compression` configuration field to `CachekitConfig` with `Literal["zstd", "lz4", "none"]` defaults, and extends `ArrowSerializer.__init__` to accept a `compression` parameter with `_resolve_compression()` logic that normalises modes and resolves `"auto"` from settings.
Arrow IPC serialisation with compression and integrity envelope `src/cachekit/serializers/arrow_serializer.py`	`serialize()` now applies per-record-batch compression via `pa.ipc.IpcWriteOptions`, always computes xxHash3-64 checksum over the compressed IPC bytes, prepends the checksum for an integrity envelope, and records metadata reflecting compression mode. Bounded chunking and memory release optimise large outputs.
Arrow deserialisation with checksum detection and legacy support `src/cachekit/serializers/arrow_serializer.py`	`deserialize()` detects checksum-prefixed vs raw Arrow IPC by sniffing magic bytes, validates checksum when present, handles legacy raw IPC, converts to target format using `self_destruct` and `split_blocks` for buffer efficiency, and broadens exception handling to include `OSError`.
Serialisation wrapper v3 binary frame format `src/cachekit/serializers/wrapper.py`	`SerializationWrapper.wrap()` and `unwrap()` now use v3 binary frame: `MAGIC b"CK" \| VERSION \| HDR_LEN \| HEADER(json) \| PAYLOAD(raw bytes)`. `unwrap()` is backward compatible, detecting v3 by magic prefix and falling back to legacy base64+JSON for pre-existing cached entries.
Memcached backend client-side size enforcement `src/cachekit/backends/memcached/backend.py`, `src/cachekit/backends/memcached/config.py`	`MemcachedBackend.set()` validates payload size against `max_item_size_bytes` configuration, raising `BackendError` with `BackendErrorType.PERMANENT` before sending to Memcached. Underlying `HashClient.set()` call now uses `noreply=False` so server-side rejections surface to the caller.
L1 cache oversized entry rejection `src/cachekit/l1_cache.py`	`L1Cache.put()` checks estimated byte size against `max_memory_bytes` budget, rejecting oversized values without evicting existing entries and logging a debug message when budget would be exceeded.
Arrow serialiser unit tests: compression, integrity, compatibility `tests/unit/test_arrow_serializer.py`	New test classes validate default zstd compression with size reduction, configurable compression modes, always-on integrity checksum regardless of `enable_integrity_checking`, legacy raw and checksum-prefixed payload compatibility, exception hygiene for corrupted inputs, `RangeIndex` preservation, and dtype/index fidelity across complex types and multi-index scenarios.
Serialisation wrapper unit tests: binary frame and backward compatibility `tests/unit/test_serialization_wrapper.py`	New test classes verify binary frame round-tripping, absence of base64 encoding, bounded overhead, non-UTF-8 payload handling, legacy base64+JSON detection, garbage input rejection, and encryption metadata preservation. Integration tests confirm encrypted payloads survive the frame and legacy encrypted entries remain decryptable.
Memcached and integrity error tests `tests/critical/test_memcached_backend_critical.py`, `tests/unit/backends/test_memcached_backend.py`, `tests/unit/test_serializer_integrity.py`, `tests/unit/test_xxhash_integrity.py`	Critical-path tests verify Memcached backend rejects oversized values before reaching the client and passes `noreply=False` so errors surface. Unit tests assert `noreply=False` across all TTL scenarios. Integrity tests updated to expect `"Arrow envelope"` error messages and verify checksum prefix at offset 8.
L1 cache oversized entry unit tests `tests/unit/test_l1_memory_bounds.py`	New `TestOversizedEntryRejection` suite verifies oversized entries are rejected without evicting existing entries, exact-budget entries are stored, normal entries work, and memory budget is never exceeded under mixed workload.
Performance and memory regression tests `tests/performance/test_large_object_memory.py`	New performance module with deterministic memory assertions using `tracemalloc` and on-wire size proxies. Tests verify bounded allocations during store/load paths with correctness checks and end-to-end cache-handler roundtrips including binary-frame prefix detection and DataFrame equality.

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 51.72% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The description comprehensively documents the problem, root causes, changes made across multiple components, backward compatibility, and testing. However, the provided PR description does not follow the exact structure of the template (missing explicit Type of Change checkboxes, Security Checklist, Documentation Validation Checklist formal completion, and Testing/Backward Compatibility section checkboxes).	Whilst the content is thorough, align the description structure with the template by explicitly filling in the Type of Change, Security, Documentation Validation, Testing, and Backward Compatibility checklist sections.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarises the main change: fixing out-of-memory issues in large DataFrame/Arrow caching by bounding memory usage.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/arrow-largeobj-validation

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-06T14:26:18Z

Codecov Report

❌ Patch coverage is 88.63636% with 10 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/cachekit/serializers/arrow_serializer.py	88.67%	4 Missing and 2 partials ⚠️
src/cachekit/l1_cache.py	66.66%	1 Missing and 1 partial ⚠️
src/cachekit/serializers/wrapper.py	90.47%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Adds a `compression` option to ArrowSerializer plus a CACHEKIT_ARROW_COMPRESSION setting (default "zstd"). compression="auto" (the default) resolves from the setting; "zstd"/"lz4" shrink the stored payload but must be decompressed into the heap on read; None/"none" stores UNCOMPRESSED Arrow IPC, trading a larger payload for the ability to memory-map reads zero-copy (the lowest read-memory path, ~0.3x RSS vs ~1.3x for an in-memory decode). This makes the size-vs-read-memory tradeoff a user/per-backend choice instead of a hardcoded default, and lays the groundwork for a memory-mapped File-backend read path. Also locks in regression tests proving the binary-frame wrapper round-trips ENCRYPTED payloads end-to-end: encryption metadata (encrypted/tenant_id/ AES-256-GCM) survives the frame header, AAD binding still rejects a wrong cache_key (EncryptionError, never silent), ciphertext is not base64-inflated, and legacy base64+JSON encrypted entries still decrypt.

)

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cachekit/serializers/arrow_serializer.py`:
- Around line 165-166: The current broad except around resolving
CACHEKIT_ARROW_COMPRESSION silently masks errors and forces "zstd"; instead
narrow the exception handling and validate values: only treat a missing/unset
setting as a fallback to "zstd", but for invalid values or unexpected failures
(e.g., ValueError, TypeError, or config parsing errors) raise or log the error
so it isn't silently ignored. Locate the try/except where the variable
compression is set (the block that reads CACHEKIT_ARROW_COMPRESSION) and replace
the blanket except Exception with specific handling — check the setting against
an allowed set of compression names, handle a missing setting by returning
"zstd", and surface (re-raise or log and raise) any other exceptions or invalid
values so misconfiguration isn't hidden.

In `@src/cachekit/serializers/wrapper.py`:
- Line 73: The public functions using bare dict (wrap and the function at line
102) should declare a stricter metadata type to improve static analysis: change
metadata: dict to metadata: Mapping[str, Any] (or dict[str, Any] if you prefer
concrete type) in the signatures for wrap and the other public method, and add
the necessary imports (Any and Mapping or dict typing) at the top of wrapper.py;
ensure import style is ruff-compatible (use from typing import Any, Mapping or
from collections.abc import Mapping for Py3.9+).
- Around line 115-123: The code that parses cache envelope frames (check for
_MAGIC, read frame_version, hdr_len, header_end and slices using _PREFIX_LEN)
must validate buffer bounds before slicing to avoid low-level errors: ensure mv
is at least _PREFIX_LEN bytes before accessing mv[len(_MAGIC)] and computing
hdr_len, ensure hdr_len is non-negative and header_end = _PREFIX_LEN + hdr_len
is <= len(mv) before json.loads or extracting payload, and normalize all these
failures to raise ValueError with a clear message (e.g. "truncated or corrupt CK
frame") so functions using the _MAGIC/_FRAME_VERSION/_PREFIX_LEN logic fail
predictably on corrupt entries.

In `@tests/unit/test_arrow_serializer.py`:
- Around line 443-448: The test test_default_is_auto_zstd assumes
CACHEKIT_ARROW_COMPRESSION is unset; make it environment-independent by
explicitly ensuring the env var is removed or set to a known value before
calling reset_settings() and ArrowSerializer().serialize; modify the test
(test_default_is_auto_zstd) to delete or pop "CACHEKIT_ARROW_COMPRESSION" from
os.environ (or set it to an empty value) prior to calling reset_settings() so
the default-compression behavior is deterministic when invoking
ArrowSerializer().serialize and asserting meta.compressed is True.

In `@tests/unit/test_l1_memory_bounds.py`:
- Around line 17-53: Add a unit that verifies an oversized update for the same
key removes the existing L1 entry: in the TestOversizedEntryRejection suite
create test_oversized_update_removes_stale_l1_entry which uses
L1Cache(max_memory_mb=1), puts a small value with cache.put("key", ...), asserts
cache.get("key")[0] is True, then does an oversized cache.put("key", ...) and
asserts cache.get("key")[0] is False and cache._current_memory_bytes == 0; this
ensures the behavior implemented in L1Cache.put (removing the L1 entry on
oversized updates) is covered.

In `@tests/unit/test_serialization_wrapper.py`:
- Around line 120-133: The fixture currently unconditionally pops
CACHEKIT_MASTER_KEY and can clobber an existing process value; modify the
fixture that constructs CacheSerializationHandler so it saves the prior
os.environ.get("CACHEKIT_MASTER_KEY") into a local variable before setting it,
sets os.environ["CACHEKIT_MASTER_KEY"] = "a"*64, yields the handler, and after
yield restores the original value (reassign if prior was not None, otherwise
remove the key). Use the existing reset_settings() calls and the
CacheSerializationHandler instantiation to locate where to add the save/restore
logic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6fd12047-e3e9-40a4-81fd-d5dffe2bf306

📥 Commits

Reviewing files that changed from the base of the PR and between 5ac62d7 and e72ffff.

📒 Files selected for processing (14)

src/cachekit/backends/memcached/backend.py
src/cachekit/backends/memcached/config.py
src/cachekit/config/settings.py
src/cachekit/l1_cache.py
src/cachekit/serializers/arrow_serializer.py
src/cachekit/serializers/wrapper.py
tests/critical/test_memcached_backend_critical.py
tests/performance/test_large_object_memory.py
tests/unit/backends/test_memcached_backend.py
tests/unit/test_arrow_serializer.py
tests/unit/test_l1_memory_bounds.py
tests/unit/test_serialization_wrapper.py
tests/unit/test_serializer_integrity.py
tests/unit/test_xxhash_integrity.py

coderabbitai · 2026-06-07T00:25:29Z

+            except Exception:  # noqa: BLE001 — settings unavailable: fall back to a sane default
+                compression = "zstd"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not silently mask settings failures during compression auto-resolution.

Line 165 catches every Exception and falls back to "zstd". This can hide invalid CACHEKIT_ARROW_COMPRESSION values or other settings faults, causing silent config drift at runtime.

Suggested fix

if compression == "auto": try: from cachekit.config.singleton import get_settings compression = get_settings().arrow_compression - except Exception: # noqa: BLE001 — settings unavailable: fall back to a sane default + except ImportError: compression = "zstd"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cachekit/serializers/arrow_serializer.py` around lines 165 - 166, The current broad except around resolving CACHEKIT_ARROW_COMPRESSION silently masks errors and forces "zstd"; instead narrow the exception handling and validate values: only treat a missing/unset setting as a fallback to "zstd", but for invalid values or unexpected failures (e.g., ValueError, TypeError, or config parsing errors) raise or log the error so it isn't silently ignored. Locate the try/except where the variable compression is set (the block that reads CACHEKIT_ARROW_COMPRESSION) and replace the blanket except Exception with specific handling — check the setting against an allowed set of compression names, handle a missing setting by returning "zstd", and surface (re-raise or log and raise) any other exceptions or invalid values so misconfiguration isn't hidden.

coderabbitai · 2026-06-07T00:25:29Z

    """

    @staticmethod
    def wrap(data: bytes, metadata: dict, serializer_name: str, version: str = "2.0") -> bytes:


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Tighten public API typing for metadata contracts.

Line 73 and Line 102 use bare dict in public method signatures, which weakens the API contract for static analysis. Use typed mappings (e.g. dict[str, Any]).

As per coding guidelines, "Python code. Enforce ruff compatibility, type hints on public APIs".

Proposed diff

from __future__ import annotations import base64 import json -from typing import Union +from typing import Any, Union @@ - def wrap(data: bytes, metadata: dict, serializer_name: str, version: str = "2.0") -> bytes: + def wrap(data: bytes, metadata: dict[str, Any], serializer_name: str, version: str = "2.0") -> bytes: @@ - def unwrap(wrapped_data: Union[str, bytes]) -> tuple[bytes, dict, str]: + def unwrap(wrapped_data: Union[str, bytes]) -> tuple[bytes, dict[str, Any], str]:

Also applies to: 102-102

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cachekit/serializers/wrapper.py` at line 73, The public functions using bare dict (wrap and the function at line 102) should declare a stricter metadata type to improve static analysis: change metadata: dict to metadata: Mapping[str, Any] (or dict[str, Any] if you prefer concrete type) in the signatures for wrap and the other public method, and add the necessary imports (Any and Mapping or dict typing) at the top of wrapper.py; ensure import style is ruff-compatible (use from typing import Any, Mapping or from collections.abc import Mapping for Py3.9+).

Source: Coding guidelines

coderabbitai · 2026-06-07T00:25:29Z

+            if bytes(mv[: len(_MAGIC)]) == _MAGIC:
+                frame_version = mv[len(_MAGIC)]
+                if frame_version != _FRAME_VERSION:
+                    raise ValueError(f"Unsupported cache envelope frame version {frame_version} (expected {_FRAME_VERSION})")
+                hdr_len = int.from_bytes(mv[len(_MAGIC) + 1 : _PREFIX_LEN], "big")
+                header_end = _PREFIX_LEN + hdr_len
+                header = json.loads(bytes(mv[_PREFIX_LEN:header_end]))
+                payload = bytes(mv[header_end:])  # single copy of the raw payload
+                return payload, header.get("m", {}), header.get("s", "unknown")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate frame bounds before parsing header/payload.

Line 116 and Line 121 can currently fail with low-level exceptions on truncated/corrupt b"CK" frames. Add explicit size guards (<_PREFIX_LEN, header_end > len) and normalise to ValueError so corrupt cache entries fail predictably.

Proposed diff

if isinstance(wrapped_data, (bytes, bytearray, memoryview)): mv = memoryview(wrapped_data) if bytes(mv[: len(_MAGIC)]) == _MAGIC: + if mv.nbytes < _PREFIX_LEN: + raise ValueError( + f"Truncated cache envelope frame: {mv.nbytes} bytes (minimum {_PREFIX_LEN})" + ) frame_version = mv[len(_MAGIC)] if frame_version != _FRAME_VERSION: raise ValueError(f"Unsupported cache envelope frame version {frame_version} (expected {_FRAME_VERSION})") hdr_len = int.from_bytes(mv[len(_MAGIC) + 1 : _PREFIX_LEN], "big") header_end = _PREFIX_LEN + hdr_len - header = json.loads(bytes(mv[_PREFIX_LEN:header_end])) + if header_end > mv.nbytes: + raise ValueError( + f"Invalid cache envelope header length {hdr_len}: frame has only {mv.nbytes} bytes" + ) + header = json.loads(bytes(mv[_PREFIX_LEN:header_end])) payload = bytes(mv[header_end:]) # single copy of the raw payload return payload, header.get("m", {}), header.get("s", "unknown")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cachekit/serializers/wrapper.py` around lines 115 - 123, The code that parses cache envelope frames (check for _MAGIC, read frame_version, hdr_len, header_end and slices using _PREFIX_LEN) must validate buffer bounds before slicing to avoid low-level errors: ensure mv is at least _PREFIX_LEN bytes before accessing mv[len(_MAGIC)] and computing hdr_len, ensure hdr_len is non-negative and header_end = _PREFIX_LEN + hdr_len is <= len(mv) before json.loads or extracting payload, and normalize all these failures to raise ValueError with a clear message (e.g. "truncated or corrupt CK frame") so functions using the _MAGIC/_FRAME_VERSION/_PREFIX_LEN logic fail predictably on corrupt entries.

coderabbitai · 2026-06-07T00:25:29Z

+    def test_default_is_auto_zstd(self):
+        from cachekit.config.singleton import reset_settings
+
+        reset_settings()  # no env override -> default zstd
+        _, meta = ArrowSerializer().serialize(pd.DataFrame({"a": [1] * 1000}))
+        assert meta.compressed is True


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make the default-compression test environment-independent.

Line 446 assumes CACHEKIT_ARROW_COMPRESSION is not set. If it is set externally, this test becomes flaky.

Suggested fix

- def test_default_is_auto_zstd(self): + def test_default_is_auto_zstd(self, monkeypatch): from cachekit.config.singleton import reset_settings + monkeypatch.delenv("CACHEKIT_ARROW_COMPRESSION", raising=False) reset_settings() # no env override -> default zstd _, meta = ArrowSerializer().serialize(pd.DataFrame({"a": [1] * 1000})) assert meta.compressed is True

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_arrow_serializer.py` around lines 443 - 448, The test test_default_is_auto_zstd assumes CACHEKIT_ARROW_COMPRESSION is unset; make it environment-independent by explicitly ensuring the env var is removed or set to a known value before calling reset_settings() and ArrowSerializer().serialize; modify the test (test_default_is_auto_zstd) to delete or pop "CACHEKIT_ARROW_COMPRESSION" from os.environ (or set it to an empty value) prior to calling reset_settings() so the default-compression behavior is deterministic when invoking ArrowSerializer().serialize and asserting meta.compressed is True.

coderabbitai · 2026-06-07T00:25:29Z

+@pytest.mark.unit
+class TestOversizedEntryRejection:
+    def test_entry_larger_than_budget_is_not_stored(self):
+        cache = L1Cache(max_memory_mb=1)
+        cache.put("big", b"\x00" * (2 * MB), redis_ttl=300)
+
+        found, _ = cache.get("big")
+        assert found is False
+        assert cache._current_memory_bytes == 0
+
+    def test_rejected_oversized_put_does_not_evict_existing_entries(self):
+        """A doomed oversized put must not evict good entries on its way to failing."""
+        cache = L1Cache(max_memory_mb=1)
+        cache.put("keep", b"\x00" * (512 * 1024), redis_ttl=300)  # fits
+
+        cache.put("toobig", b"\x00" * (5 * MB), redis_ttl=300)  # cannot ever fit
+
+        assert cache.get("keep")[0] is True  # survivor
+        assert cache.get("toobig")[0] is False
+        assert cache._current_memory_bytes <= cache.max_memory_bytes
+
+    def test_entry_equal_to_budget_is_stored(self):
+        cache = L1Cache(max_memory_mb=1)
+        cache.put("exact", b"\x00" * (1 * MB), redis_ttl=300)
+        assert cache.get("exact")[0] is True
+
+    def test_normal_entry_still_stored(self):
+        cache = L1Cache(max_memory_mb=10)
+        cache.put("k", b"value", redis_ttl=300)
+        assert cache.get("k") == (True, b"value")
+
+    def test_memory_never_exceeds_budget_under_mixed_load(self):
+        cache = L1Cache(max_memory_mb=2)
+        for i in range(20):
+            cache.put(f"k{i}", b"\x00" * (300 * 1024), redis_ttl=300)  # 300KB each
+        cache.put("huge", b"\x00" * (50 * MB), redis_ttl=300)  # rejected
+        assert cache._current_memory_bytes <= cache.max_memory_bytes


🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Consider adding test for same-key oversized update scenario.

The implementation at src/cachekit/l1_cache.py:294-295 removes an existing L1 entry when an oversized put happens for the same key (to prevent serving stale data from L1 when L2 has the new oversized value). This critical behaviour isn't explicitly tested.

🧪 Suggested test to add

def test_oversized_update_removes_stale_l1_entry(self): """When updating a key with an oversized value, remove the old L1 entry.""" cache = L1Cache(max_memory_mb=1) cache.put("key", b"\x00" * (100 * 1024), redis_ttl=300) # 100 KB, fits assert cache.get("key")[0] is True # initially present cache.put("key", b"\x00" * (10 * MB), redis_ttl=300) # oversized update # Old entry must be removed so L1 doesn't serve stale data # (new oversized value lives in L2 only) assert cache.get("key")[0] is False assert cache._current_memory_bytes == 0

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_l1_memory_bounds.py` around lines 17 - 53, Add a unit that verifies an oversized update for the same key removes the existing L1 entry: in the TestOversizedEntryRejection suite create test_oversized_update_removes_stale_l1_entry which uses L1Cache(max_memory_mb=1), puts a small value with cache.put("key", ...), asserts cache.get("key")[0] is True, then does an oversized cache.put("key", ...) and asserts cache.get("key")[0] is False and cache._current_memory_bytes == 0; this ensures the behavior implemented in L1Cache.put (removing the L1 entry on oversized updates) is covered.

coderabbitai · 2026-06-07T00:25:29Z

+        reset_settings()
+        os.environ["CACHEKIT_MASTER_KEY"] = "a" * 64
+        from cachekit.cache_handler import CacheSerializationHandler
+
+        handler = CacheSerializationHandler(
+            serializer_name="default",
+            encryption=True,
+            single_tenant_mode=True,
+            deployment_uuid="00000000-0000-0000-0000-000000000001",
+        )
+        yield handler
+        reset_settings()
+        os.environ.pop("CACHEKIT_MASTER_KEY", None)
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Restore pre-existing CACHEKIT_MASTER_KEY to avoid cross-test state leakage.

Line 132 always removes the variable, so a pre-existing process value is lost after this fixture runs. Preserve and restore the prior value to keep tests isolated.

Proposed diff

`@pytest.fixture` def enc_handler(self): import os @@ - reset_settings() - os.environ["CACHEKIT_MASTER_KEY"] = "a" * 64 + reset_settings() + previous_master_key = os.environ.get("CACHEKIT_MASTER_KEY") + os.environ["CACHEKIT_MASTER_KEY"] = "a" * 64 from cachekit.cache_handler import CacheSerializationHandler handler = CacheSerializationHandler( serializer_name="default", encryption=True, single_tenant_mode=True, deployment_uuid="00000000-0000-0000-0000-000000000001", ) - yield handler - reset_settings() - os.environ.pop("CACHEKIT_MASTER_KEY", None) + try: + yield handler + finally: + reset_settings() + if previous_master_key is None: + os.environ.pop("CACHEKIT_MASTER_KEY", None) + else: + os.environ["CACHEKIT_MASTER_KEY"] = previous_master_key

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_serialization_wrapper.py` around lines 120 - 133, The fixture currently unconditionally pops CACHEKIT_MASTER_KEY and can clobber an existing process value; modify the fixture that constructs CacheSerializationHandler so it saves the prior os.environ.get("CACHEKIT_MASTER_KEY") into a local variable before setting it, sets os.environ["CACHEKIT_MASTER_KEY"] = "a"*64, yields the handler, and after yield restores the original value (reassign if prior was not None, otherwise remove the key). Use the existing reset_settings() calls and the CacheSerializationHandler instantiation to locate where to add the save/restore logic.

…172) * fix: harden cache-envelope framing and compression config resolution Addresses review feedback on #152 (merged before these landed): - wrapper.py: validate v3 frame bounds before parsing — truncated/corrupt b"CK" frames now raise a clean ValueError instead of a low-level IndexError/JSON error. Also tighten wrap()/unwrap() public signatures to dict[str, Any]. - arrow_serializer.py: narrow the compression auto-resolution except to ImportError. Invalid CACHEKIT_ARROW_COMPRESSION values now surface as a pydantic ValidationError instead of silently falling back to zstd. - tests: make the default-compression test env-independent (delenv + reset_settings); restore pre-existing CACHEKIT_MASTER_KEY via monkeypatch to stop cross-test env leakage; add same-key oversized-update L1 test. * test: cover frame-bounds guards and compression-config fallback Restores patch coverage for the hardening added in the previous commit: - wrapper.py: tests for truncated frame, unsupported frame version, and oversized header-length (the three ValueError guards) -> 100% coverage. - arrow_serializer.py: test that auto compression falls back to zstd when the settings module is unimportable (the except ImportError branch).

27Bslash6 added 2 commits June 7, 2026 00:31

merge: resolve arrow_serializer cross_sdk_compatible + compression (#153

e72ffff

)

coderabbitai Bot requested changes Jun 7, 2026

View reviewed changes

27Bslash6 merged commit ccd32c5 into main Jun 7, 2026
32 checks passed

27Bslash6 deleted the perf/arrow-largeobj-validation branch June 7, 2026 00:32

cachekit-release-bot Bot mentioned this pull request Jun 7, 2026

chore(main): release 0.9.0 #150

Open

27Bslash6 mentioned this pull request Jun 7, 2026

fix: harden cache-envelope framing and compression config resolution #172

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bound memory for large DataFrame/Arrow caching (was OOMing at real sizes)#152

fix: bound memory for large DataFrame/Arrow caching (was OOMing at real sizes)#152
27Bslash6 merged 3 commits into
mainfrom
perf/arrow-largeobj-validation

27Bslash6 commented Jun 6, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

codecov Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 7, 2026

Uh oh!

coderabbitai Bot Jun 7, 2026

Uh oh!

coderabbitai Bot Jun 7, 2026

Uh oh!

coderabbitai Bot Jun 7, 2026

Uh oh!

coderabbitai Bot Jun 7, 2026

Uh oh!

coderabbitai Bot Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		except Exception: # noqa: BLE001 — settings unavailable: fall back to a sane default
		compression = "zstd"

Conversation

27Bslash6 commented Jun 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Changes

Backward compatibility

Testing

Not in this PR (follow-up)

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

codecov Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

27Bslash6 commented Jun 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading

codecov Bot commented Jun 6, 2026 •

edited

Loading