Skip to content

fix: bound memory for large DataFrame/Arrow caching (was OOMing at real sizes)#152

Merged
27Bslash6 merged 3 commits into
mainfrom
perf/arrow-largeobj-validation
Jun 7, 2026
Merged

fix: bound memory for large DataFrame/Arrow caching (was OOMing at real sizes)#152
27Bslash6 merged 3 commits into
mainfrom
perf/arrow-largeobj-validation

Conversation

@27Bslash6

@27Bslash6 27Bslash6 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Problem

Large-object (Arrow/DataFrame) caching was brittle and OOM'd at real data sizes. Measured on representative numeric DataFrames:

Metric Before After
Write peak RSS / logical 9.4× ~6×
Read peak RSS / logical 7.2× 3.5×
Python-tracked peak (tracemalloc) / logical 5.7× (write) 2.0×
Wire / L1 size / logical 1.33× ~1.0× (less on compressible)
300 MB frame under a 1.5 GB budget OOM-killed (exit 137) needs 1.9 GB (was 3.0 GB)
200 MB frame under a 1.5 GB budget OOM-killed caches in 1.2 GB

Root cause

Two stacked, independent amplifiers plus a cluster of silent-failure bugs:

  1. SerializationWrapper base64-over-JSON envelope — applied to every serializer's output before storage. base64.b64encode → .decode("ascii") → json.dumps → .encode() forced ~4 full-size copies at peak and a permanent 1.33× wire/L1 inflation. This dominated both directions (the read peak was set during unwrap, not to_pandas).
  2. ArrowSerializer copy chainto_pybytes() + a checksum + arrow_data concat on write; a full-body data[8:] bytes slice on read; to_pandas() without low-peak flags; preserve_index=True materializing even a RangeIndex.

Separately surfaced: memcached silently dropped any value >1 MB (the set inherited noreply=True, so the server's "object too large" error was never read → permanent silent cache miss); L1 stored a single entry larger than its entire budget (an OOM vector that also evicted everything else); integrity_checking=False gave Arrow zero corruption detection (a 1-bit flip silently returned wrong data).

Changes

  • SerializationWrapper → length-prefixed binary frame (CK | version | u32 header-len | json header | RAW payload). Payload stored raw, copied once. unwrap() still reads the legacy base64+JSON envelope, so pre-existing cache entries remain readable — no flush required. Python-SDK-internal; the cross-SDK ByteStorage wire format is untouched.
  • ArrowSerializer → zstd IPC compression written in ~8 MiB record batches (bounds the compressor's working set), checksum hashed over the buffer memoryview (no copy), read envelope sliced with a memoryview (no full-body copy), to_pandas(self_destruct=True, split_blocks=True), preserve_index=None. Always checksums (the silent-corruption knob is gone; 8 bytes is free). Magic-sniffs the envelope on read so checksummed/raw/legacy all decode; normalizes dict-of-scalars to the documented TypeError and malformed input to SerializationError (no more raw OSError leak).
  • memcached → rejects oversized values with a clear PERMANENT BackendError and sets noreply=False so server errors surface loudly.
  • L1 → refuses to store an entry larger than the whole budget (value stays in L2) and drops any stale smaller entry for that key.

Backward compatibility

  • Old entries (base64+JSON wrap; legacy [checksum][IPC] Arrow) still deserialize. New writes use the new formats; old entries age out by TTL.
  • No protocol-spec / Rust / TS / SaaS changes — the envelope is opaque to backends and other SDKs.
  • wrap/unwrap and serializer signatures unchanged.

Testing

  • New: test_serialization_wrapper.py (frame round-trip, non-UTF-8 payloads, legacy back-compat, no-base64/no-inflation), test_l1_memory_bounds.py (oversized-entry rejection), test_large_object_memory.py (deterministic tracemalloc + wire-size regression guards that fail loudly if base64 returns).
  • Extended test_arrow_serializer.py with a dtype/index fidelity matrix (nullable, categorical ordered/unordered, tz-aware, timedelta, MultiIndex), compression, always-checksum, and exception-hygiene cases.
  • Full tests/unit + tests/critical suite: 1750 passed, 0 failed. ruff (incl. S) + basedpyright --level error clean.

Not in this PR (follow-up)

The residual ~6× write peak is pyarrow's in-memory IPC materialization (BufferOutputStream doubling-growth + memory-pool retention) — even uncompressed it is ~3.75×. Driving it lower requires streaming serialization straight to the backend (never holding the full blob), a backend-interface change tracked separately.

Summary by CodeRabbit

Release Notes

  • New Features

    • Configurable Arrow compression (zstd, lz4, none) for improved storage efficiency
    • Client-side Memcached item size validation before transmission
    • L1 cache size enforcement to prevent memory overruns
  • Improvements

    • Memcached now surfaces server errors instead of silently swallowing them
    • Arrow serialisation enforces integrity checksums consistently
    • Binary frame envelope for more efficient data wrapping
  • Compatibility

    • Fully backwards compatible with legacy serialisation formats

…al sizes)

Caching a real-sized DataFrame peaked at ~9.4x its logical size on the write
path and ~7.2x on read, OOM-killing a 300MB frame under a 1.5GB budget, and
silently inflated every cached value 1.33x on the wire/in L1. Two stacked
amplifiers, plus a cluster of silent-failure bugs.

Root cause:
- SerializationWrapper base64-encoded the payload inside a JSON object for
  EVERY serializer, forcing ~4 full-size copies at peak (b64-bytes -> ascii-str
  -> json-str -> utf8-bytes) and a permanent 1.33x wire/L1 inflation.
- ArrowSerializer copied the payload several more times (to_pybytes + a
  checksum-prepend concat on write; a full-body `data[8:]` bytes slice on read)
  and converted Arrow->pandas without low-peak flags.

Changes:
- SerializationWrapper: replace base64+JSON with a length-prefixed binary frame
  (CK magic | version | u32 header-len | json header | RAW payload). The payload
  is stored raw and copied exactly once. unwrap() still reads the legacy
  base64+JSON envelope, so pre-existing cache entries stay readable (no flush).
  Internal to the Python SDK; the cross-SDK ByteStorage wire format is unchanged.
- ArrowSerializer: enable zstd IPC compression written in ~8MiB record batches
  (bounds the compressor's working set), hash over the buffer memoryview (no
  copy), slice the read envelope with a memoryview (no full-body copy), convert
  with to_pandas(self_destruct, split_blocks), and use preserve_index=None so a
  RangeIndex is metadata-only. Always checksum -- the integrity-off path silently
  returned corrupted DataFrames, and 8 bytes is negligible. Magic-sniff the
  envelope on read so checksummed/raw/legacy data all decode and malformed input
  raises SerializationError (no more raw OSError leak); dict-of-scalars now raises
  the documented TypeError.
- memcached: reject values over the item-size cap with a clear PERMANENT
  BackendError and pass noreply=False, so oversized items fail loudly instead of
  being silently dropped (noreply previously swallowed the server's error -> a
  permanent silent cache miss for any real DataFrame).
- L1: refuse to store a single entry larger than the whole L1 budget (an OOM
  vector that also evicted every other entry) and drop any stale smaller entry
  for that key; the value still lives in L2.

Measured: write peak ~9.4x -> ~6x, read ~7.2x -> ~3.5x, Python-tracked peak
~5.7x -> ~2x, wire 1.33x -> ~1x (far less on compressible data). A 200MB frame
that OOM'd under a 1.5GB budget now caches in ~1.2GB.

The residual write peak is pyarrow's in-memory IPC materialization
(BufferOutputStream growth + memory-pool retention); driving it lower needs
streaming serialization straight to the backend (separate change).
@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

This PR introduces configurable Arrow IPC compression with always-on xxHash3-64 integrity checking, upgrades the serialization wrapper to a binary frame format (v3), and enforces client-side size limits across Memcached backend and L1 in-memory cache layers. It includes comprehensive unit and performance tests.

Changes

Serialization, Compression, and Size Bounds for Large Object Caching

Layer / File(s) Summary
Arrow compression config and serializer foundation
src/cachekit/config/settings.py, src/cachekit/serializers/arrow_serializer.py
Adds arrow_compression configuration field to CachekitConfig with Literal["zstd", "lz4", "none"] defaults, and extends ArrowSerializer.__init__ to accept a compression parameter with _resolve_compression() logic that normalises modes and resolves "auto" from settings.
Arrow IPC serialisation with compression and integrity envelope
src/cachekit/serializers/arrow_serializer.py
serialize() now applies per-record-batch compression via pa.ipc.IpcWriteOptions, always computes xxHash3-64 checksum over the compressed IPC bytes, prepends the checksum for an integrity envelope, and records metadata reflecting compression mode. Bounded chunking and memory release optimise large outputs.
Arrow deserialisation with checksum detection and legacy support
src/cachekit/serializers/arrow_serializer.py
deserialize() detects checksum-prefixed vs raw Arrow IPC by sniffing magic bytes, validates checksum when present, handles legacy raw IPC, converts to target format using self_destruct and split_blocks for buffer efficiency, and broadens exception handling to include OSError.
Serialisation wrapper v3 binary frame format
src/cachekit/serializers/wrapper.py
SerializationWrapper.wrap() and unwrap() now use v3 binary frame: MAGIC b"CK" | VERSION | HDR_LEN | HEADER(json) | PAYLOAD(raw bytes). unwrap() is backward compatible, detecting v3 by magic prefix and falling back to legacy base64+JSON for pre-existing cached entries.
Memcached backend client-side size enforcement
src/cachekit/backends/memcached/backend.py, src/cachekit/backends/memcached/config.py
MemcachedBackend.set() validates payload size against max_item_size_bytes configuration, raising BackendError with BackendErrorType.PERMANENT before sending to Memcached. Underlying HashClient.set() call now uses noreply=False so server-side rejections surface to the caller.
L1 cache oversized entry rejection
src/cachekit/l1_cache.py
L1Cache.put() checks estimated byte size against max_memory_bytes budget, rejecting oversized values without evicting existing entries and logging a debug message when budget would be exceeded.
Arrow serialiser unit tests: compression, integrity, compatibility
tests/unit/test_arrow_serializer.py
New test classes validate default zstd compression with size reduction, configurable compression modes, always-on integrity checksum regardless of enable_integrity_checking, legacy raw and checksum-prefixed payload compatibility, exception hygiene for corrupted inputs, RangeIndex preservation, and dtype/index fidelity across complex types and multi-index scenarios.
Serialisation wrapper unit tests: binary frame and backward compatibility
tests/unit/test_serialization_wrapper.py
New test classes verify binary frame round-tripping, absence of base64 encoding, bounded overhead, non-UTF-8 payload handling, legacy base64+JSON detection, garbage input rejection, and encryption metadata preservation. Integration tests confirm encrypted payloads survive the frame and legacy encrypted entries remain decryptable.
Memcached and integrity error tests
tests/critical/test_memcached_backend_critical.py, tests/unit/backends/test_memcached_backend.py, tests/unit/test_serializer_integrity.py, tests/unit/test_xxhash_integrity.py
Critical-path tests verify Memcached backend rejects oversized values before reaching the client and passes noreply=False so errors surface. Unit tests assert noreply=False across all TTL scenarios. Integrity tests updated to expect "Arrow envelope" error messages and verify checksum prefix at offset 8.
L1 cache oversized entry unit tests
tests/unit/test_l1_memory_bounds.py
New TestOversizedEntryRejection suite verifies oversized entries are rejected without evicting existing entries, exact-budget entries are stored, normal entries work, and memory budget is never exceeded under mixed workload.
Performance and memory regression tests
tests/performance/test_large_object_memory.py
New performance module with deterministic memory assertions using tracemalloc and on-wire size proxies. Tests verify bounded allocations during store/load paths with correctness checks and end-to-end cache-handler roundtrips including binary-frame prefix detection and DataFrame equality.

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.72% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description comprehensively documents the problem, root causes, changes made across multiple components, backward compatibility, and testing. However, the provided PR description does not follow the exact structure of the template (missing explicit Type of Change checkboxes, Security Checklist, Documentation Validation Checklist formal completion, and Testing/Backward Compatibility section checkboxes). Whilst the content is thorough, align the description structure with the template by explicitly filling in the Type of Change, Security, Documentation Validation, Testing, and Backward Compatibility checklist sections.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarises the main change: fixing out-of-memory issues in large DataFrame/Arrow caching by bounding memory usage.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/arrow-largeobj-validation

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Jun 6, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.63636% with 10 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/cachekit/serializers/arrow_serializer.py 88.67% 4 Missing and 2 partials ⚠️
src/cachekit/l1_cache.py 66.66% 1 Missing and 1 partial ⚠️
src/cachekit/serializers/wrapper.py 90.47% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

27Bslash6 added 2 commits June 7, 2026 00:31
Adds a `compression` option to ArrowSerializer plus a CACHEKIT_ARROW_COMPRESSION
setting (default "zstd"). compression="auto" (the default) resolves from the
setting; "zstd"/"lz4" shrink the stored payload but must be decompressed into the
heap on read; None/"none" stores UNCOMPRESSED Arrow IPC, trading a larger payload
for the ability to memory-map reads zero-copy (the lowest read-memory path, ~0.3x
RSS vs ~1.3x for an in-memory decode). This makes the size-vs-read-memory tradeoff
a user/per-backend choice instead of a hardcoded default, and lays the groundwork
for a memory-mapped File-backend read path.

Also locks in regression tests proving the binary-frame wrapper round-trips
ENCRYPTED payloads end-to-end: encryption metadata (encrypted/tenant_id/
AES-256-GCM) survives the frame header, AAD binding still rejects a wrong cache_key
(EncryptionError, never silent), ciphertext is not base64-inflated, and legacy
base64+JSON encrypted entries still decrypt.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cachekit/serializers/arrow_serializer.py`:
- Around line 165-166: The current broad except around resolving
CACHEKIT_ARROW_COMPRESSION silently masks errors and forces "zstd"; instead
narrow the exception handling and validate values: only treat a missing/unset
setting as a fallback to "zstd", but for invalid values or unexpected failures
(e.g., ValueError, TypeError, or config parsing errors) raise or log the error
so it isn't silently ignored. Locate the try/except where the variable
compression is set (the block that reads CACHEKIT_ARROW_COMPRESSION) and replace
the blanket except Exception with specific handling — check the setting against
an allowed set of compression names, handle a missing setting by returning
"zstd", and surface (re-raise or log and raise) any other exceptions or invalid
values so misconfiguration isn't hidden.

In `@src/cachekit/serializers/wrapper.py`:
- Line 73: The public functions using bare dict (wrap and the function at line
102) should declare a stricter metadata type to improve static analysis: change
metadata: dict to metadata: Mapping[str, Any] (or dict[str, Any] if you prefer
concrete type) in the signatures for wrap and the other public method, and add
the necessary imports (Any and Mapping or dict typing) at the top of wrapper.py;
ensure import style is ruff-compatible (use from typing import Any, Mapping or
from collections.abc import Mapping for Py3.9+).
- Around line 115-123: The code that parses cache envelope frames (check for
_MAGIC, read frame_version, hdr_len, header_end and slices using _PREFIX_LEN)
must validate buffer bounds before slicing to avoid low-level errors: ensure mv
is at least _PREFIX_LEN bytes before accessing mv[len(_MAGIC)] and computing
hdr_len, ensure hdr_len is non-negative and header_end = _PREFIX_LEN + hdr_len
is <= len(mv) before json.loads or extracting payload, and normalize all these
failures to raise ValueError with a clear message (e.g. "truncated or corrupt CK
frame") so functions using the _MAGIC/_FRAME_VERSION/_PREFIX_LEN logic fail
predictably on corrupt entries.

In `@tests/unit/test_arrow_serializer.py`:
- Around line 443-448: The test test_default_is_auto_zstd assumes
CACHEKIT_ARROW_COMPRESSION is unset; make it environment-independent by
explicitly ensuring the env var is removed or set to a known value before
calling reset_settings() and ArrowSerializer().serialize; modify the test
(test_default_is_auto_zstd) to delete or pop "CACHEKIT_ARROW_COMPRESSION" from
os.environ (or set it to an empty value) prior to calling reset_settings() so
the default-compression behavior is deterministic when invoking
ArrowSerializer().serialize and asserting meta.compressed is True.

In `@tests/unit/test_l1_memory_bounds.py`:
- Around line 17-53: Add a unit that verifies an oversized update for the same
key removes the existing L1 entry: in the TestOversizedEntryRejection suite
create test_oversized_update_removes_stale_l1_entry which uses
L1Cache(max_memory_mb=1), puts a small value with cache.put("key", ...), asserts
cache.get("key")[0] is True, then does an oversized cache.put("key", ...) and
asserts cache.get("key")[0] is False and cache._current_memory_bytes == 0; this
ensures the behavior implemented in L1Cache.put (removing the L1 entry on
oversized updates) is covered.

In `@tests/unit/test_serialization_wrapper.py`:
- Around line 120-133: The fixture currently unconditionally pops
CACHEKIT_MASTER_KEY and can clobber an existing process value; modify the
fixture that constructs CacheSerializationHandler so it saves the prior
os.environ.get("CACHEKIT_MASTER_KEY") into a local variable before setting it,
sets os.environ["CACHEKIT_MASTER_KEY"] = "a"*64, yields the handler, and after
yield restores the original value (reassign if prior was not None, otherwise
remove the key). Use the existing reset_settings() calls and the
CacheSerializationHandler instantiation to locate where to add the save/restore
logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6fd12047-e3e9-40a4-81fd-d5dffe2bf306

📥 Commits

Reviewing files that changed from the base of the PR and between 5ac62d7 and e72ffff.

📒 Files selected for processing (14)
  • src/cachekit/backends/memcached/backend.py
  • src/cachekit/backends/memcached/config.py
  • src/cachekit/config/settings.py
  • src/cachekit/l1_cache.py
  • src/cachekit/serializers/arrow_serializer.py
  • src/cachekit/serializers/wrapper.py
  • tests/critical/test_memcached_backend_critical.py
  • tests/performance/test_large_object_memory.py
  • tests/unit/backends/test_memcached_backend.py
  • tests/unit/test_arrow_serializer.py
  • tests/unit/test_l1_memory_bounds.py
  • tests/unit/test_serialization_wrapper.py
  • tests/unit/test_serializer_integrity.py
  • tests/unit/test_xxhash_integrity.py

Comment on lines +165 to +166
except Exception: # noqa: BLE001 — settings unavailable: fall back to a sane default
compression = "zstd"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not silently mask settings failures during compression auto-resolution.

Line 165 catches every Exception and falls back to "zstd". This can hide invalid CACHEKIT_ARROW_COMPRESSION values or other settings faults, causing silent config drift at runtime.

Suggested fix
         if compression == "auto":
             try:
                 from cachekit.config.singleton import get_settings

                 compression = get_settings().arrow_compression
-            except Exception:  # noqa: BLE001 — settings unavailable: fall back to a sane default
+            except ImportError:
                 compression = "zstd"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cachekit/serializers/arrow_serializer.py` around lines 165 - 166, The
current broad except around resolving CACHEKIT_ARROW_COMPRESSION silently masks
errors and forces "zstd"; instead narrow the exception handling and validate
values: only treat a missing/unset setting as a fallback to "zstd", but for
invalid values or unexpected failures (e.g., ValueError, TypeError, or config
parsing errors) raise or log the error so it isn't silently ignored. Locate the
try/except where the variable compression is set (the block that reads
CACHEKIT_ARROW_COMPRESSION) and replace the blanket except Exception with
specific handling — check the setting against an allowed set of compression
names, handle a missing setting by returning "zstd", and surface (re-raise or
log and raise) any other exceptions or invalid values so misconfiguration isn't
hidden.

"""

@staticmethod
def wrap(data: bytes, metadata: dict, serializer_name: str, version: str = "2.0") -> bytes:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Tighten public API typing for metadata contracts.

Line 73 and Line 102 use bare dict in public method signatures, which weakens the API contract for static analysis. Use typed mappings (e.g. dict[str, Any]).

As per coding guidelines, "Python code. Enforce ruff compatibility, type hints on public APIs".

Proposed diff
 from __future__ import annotations
 
 import base64
 import json
-from typing import Union
+from typing import Any, Union
@@
-    def wrap(data: bytes, metadata: dict, serializer_name: str, version: str = "2.0") -> bytes:
+    def wrap(data: bytes, metadata: dict[str, Any], serializer_name: str, version: str = "2.0") -> bytes:
@@
-    def unwrap(wrapped_data: Union[str, bytes]) -> tuple[bytes, dict, str]:
+    def unwrap(wrapped_data: Union[str, bytes]) -> tuple[bytes, dict[str, Any], str]:

Also applies to: 102-102

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cachekit/serializers/wrapper.py` at line 73, The public functions using
bare dict (wrap and the function at line 102) should declare a stricter metadata
type to improve static analysis: change metadata: dict to metadata: Mapping[str,
Any] (or dict[str, Any] if you prefer concrete type) in the signatures for wrap
and the other public method, and add the necessary imports (Any and Mapping or
dict typing) at the top of wrapper.py; ensure import style is ruff-compatible
(use from typing import Any, Mapping or from collections.abc import Mapping for
Py3.9+).

Source: Coding guidelines

Comment on lines +115 to +123
if bytes(mv[: len(_MAGIC)]) == _MAGIC:
frame_version = mv[len(_MAGIC)]
if frame_version != _FRAME_VERSION:
raise ValueError(f"Unsupported cache envelope frame version {frame_version} (expected {_FRAME_VERSION})")
hdr_len = int.from_bytes(mv[len(_MAGIC) + 1 : _PREFIX_LEN], "big")
header_end = _PREFIX_LEN + hdr_len
header = json.loads(bytes(mv[_PREFIX_LEN:header_end]))
payload = bytes(mv[header_end:]) # single copy of the raw payload
return payload, header.get("m", {}), header.get("s", "unknown")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate frame bounds before parsing header/payload.

Line 116 and Line 121 can currently fail with low-level exceptions on truncated/corrupt b"CK" frames. Add explicit size guards (<_PREFIX_LEN, header_end > len) and normalise to ValueError so corrupt cache entries fail predictably.

Proposed diff
         if isinstance(wrapped_data, (bytes, bytearray, memoryview)):
             mv = memoryview(wrapped_data)
             if bytes(mv[: len(_MAGIC)]) == _MAGIC:
+                if mv.nbytes < _PREFIX_LEN:
+                    raise ValueError(
+                        f"Truncated cache envelope frame: {mv.nbytes} bytes (minimum {_PREFIX_LEN})"
+                    )
                 frame_version = mv[len(_MAGIC)]
                 if frame_version != _FRAME_VERSION:
                     raise ValueError(f"Unsupported cache envelope frame version {frame_version} (expected {_FRAME_VERSION})")
                 hdr_len = int.from_bytes(mv[len(_MAGIC) + 1 : _PREFIX_LEN], "big")
                 header_end = _PREFIX_LEN + hdr_len
-                header = json.loads(bytes(mv[_PREFIX_LEN:header_end]))
+                if header_end > mv.nbytes:
+                    raise ValueError(
+                        f"Invalid cache envelope header length {hdr_len}: frame has only {mv.nbytes} bytes"
+                    )
+                header = json.loads(bytes(mv[_PREFIX_LEN:header_end]))
                 payload = bytes(mv[header_end:])  # single copy of the raw payload
                 return payload, header.get("m", {}), header.get("s", "unknown")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cachekit/serializers/wrapper.py` around lines 115 - 123, The code that
parses cache envelope frames (check for _MAGIC, read frame_version, hdr_len,
header_end and slices using _PREFIX_LEN) must validate buffer bounds before
slicing to avoid low-level errors: ensure mv is at least _PREFIX_LEN bytes
before accessing mv[len(_MAGIC)] and computing hdr_len, ensure hdr_len is
non-negative and header_end = _PREFIX_LEN + hdr_len is <= len(mv) before
json.loads or extracting payload, and normalize all these failures to raise
ValueError with a clear message (e.g. "truncated or corrupt CK frame") so
functions using the _MAGIC/_FRAME_VERSION/_PREFIX_LEN logic fail predictably on
corrupt entries.

Comment on lines +443 to +448
def test_default_is_auto_zstd(self):
from cachekit.config.singleton import reset_settings

reset_settings() # no env override -> default zstd
_, meta = ArrowSerializer().serialize(pd.DataFrame({"a": [1] * 1000}))
assert meta.compressed is True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make the default-compression test environment-independent.

Line 446 assumes CACHEKIT_ARROW_COMPRESSION is not set. If it is set externally, this test becomes flaky.

Suggested fix
-    def test_default_is_auto_zstd(self):
+    def test_default_is_auto_zstd(self, monkeypatch):
         from cachekit.config.singleton import reset_settings

+        monkeypatch.delenv("CACHEKIT_ARROW_COMPRESSION", raising=False)
         reset_settings()  # no env override -> default zstd
         _, meta = ArrowSerializer().serialize(pd.DataFrame({"a": [1] * 1000}))
         assert meta.compressed is True
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_arrow_serializer.py` around lines 443 - 448, The test
test_default_is_auto_zstd assumes CACHEKIT_ARROW_COMPRESSION is unset; make it
environment-independent by explicitly ensuring the env var is removed or set to
a known value before calling reset_settings() and ArrowSerializer().serialize;
modify the test (test_default_is_auto_zstd) to delete or pop
"CACHEKIT_ARROW_COMPRESSION" from os.environ (or set it to an empty value) prior
to calling reset_settings() so the default-compression behavior is deterministic
when invoking ArrowSerializer().serialize and asserting meta.compressed is True.

Comment on lines +17 to +53
@pytest.mark.unit
class TestOversizedEntryRejection:
def test_entry_larger_than_budget_is_not_stored(self):
cache = L1Cache(max_memory_mb=1)
cache.put("big", b"\x00" * (2 * MB), redis_ttl=300)

found, _ = cache.get("big")
assert found is False
assert cache._current_memory_bytes == 0

def test_rejected_oversized_put_does_not_evict_existing_entries(self):
"""A doomed oversized put must not evict good entries on its way to failing."""
cache = L1Cache(max_memory_mb=1)
cache.put("keep", b"\x00" * (512 * 1024), redis_ttl=300) # fits

cache.put("toobig", b"\x00" * (5 * MB), redis_ttl=300) # cannot ever fit

assert cache.get("keep")[0] is True # survivor
assert cache.get("toobig")[0] is False
assert cache._current_memory_bytes <= cache.max_memory_bytes

def test_entry_equal_to_budget_is_stored(self):
cache = L1Cache(max_memory_mb=1)
cache.put("exact", b"\x00" * (1 * MB), redis_ttl=300)
assert cache.get("exact")[0] is True

def test_normal_entry_still_stored(self):
cache = L1Cache(max_memory_mb=10)
cache.put("k", b"value", redis_ttl=300)
assert cache.get("k") == (True, b"value")

def test_memory_never_exceeds_budget_under_mixed_load(self):
cache = L1Cache(max_memory_mb=2)
for i in range(20):
cache.put(f"k{i}", b"\x00" * (300 * 1024), redis_ttl=300) # 300KB each
cache.put("huge", b"\x00" * (50 * MB), redis_ttl=300) # rejected
assert cache._current_memory_bytes <= cache.max_memory_bytes

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Consider adding test for same-key oversized update scenario.

The implementation at src/cachekit/l1_cache.py:294-295 removes an existing L1 entry when an oversized put happens for the same key (to prevent serving stale data from L1 when L2 has the new oversized value). This critical behaviour isn't explicitly tested.

🧪 Suggested test to add
def test_oversized_update_removes_stale_l1_entry(self):
    """When updating a key with an oversized value, remove the old L1 entry."""
    cache = L1Cache(max_memory_mb=1)
    cache.put("key", b"\x00" * (100 * 1024), redis_ttl=300)  # 100 KB, fits
    
    assert cache.get("key")[0] is True  # initially present
    
    cache.put("key", b"\x00" * (10 * MB), redis_ttl=300)  # oversized update
    
    # Old entry must be removed so L1 doesn't serve stale data
    # (new oversized value lives in L2 only)
    assert cache.get("key")[0] is False
    assert cache._current_memory_bytes == 0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_l1_memory_bounds.py` around lines 17 - 53, Add a unit that
verifies an oversized update for the same key removes the existing L1 entry: in
the TestOversizedEntryRejection suite create
test_oversized_update_removes_stale_l1_entry which uses
L1Cache(max_memory_mb=1), puts a small value with cache.put("key", ...), asserts
cache.get("key")[0] is True, then does an oversized cache.put("key", ...) and
asserts cache.get("key")[0] is False and cache._current_memory_bytes == 0; this
ensures the behavior implemented in L1Cache.put (removing the L1 entry on
oversized updates) is covered.

Comment on lines +120 to +133
reset_settings()
os.environ["CACHEKIT_MASTER_KEY"] = "a" * 64
from cachekit.cache_handler import CacheSerializationHandler

handler = CacheSerializationHandler(
serializer_name="default",
encryption=True,
single_tenant_mode=True,
deployment_uuid="00000000-0000-0000-0000-000000000001",
)
yield handler
reset_settings()
os.environ.pop("CACHEKIT_MASTER_KEY", None)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Restore pre-existing CACHEKIT_MASTER_KEY to avoid cross-test state leakage.

Line 132 always removes the variable, so a pre-existing process value is lost after this fixture runs. Preserve and restore the prior value to keep tests isolated.

Proposed diff
     `@pytest.fixture`
     def enc_handler(self):
         import os
@@
-        reset_settings()
-        os.environ["CACHEKIT_MASTER_KEY"] = "a" * 64
+        reset_settings()
+        previous_master_key = os.environ.get("CACHEKIT_MASTER_KEY")
+        os.environ["CACHEKIT_MASTER_KEY"] = "a" * 64
         from cachekit.cache_handler import CacheSerializationHandler
 
         handler = CacheSerializationHandler(
             serializer_name="default",
             encryption=True,
             single_tenant_mode=True,
             deployment_uuid="00000000-0000-0000-0000-000000000001",
         )
-        yield handler
-        reset_settings()
-        os.environ.pop("CACHEKIT_MASTER_KEY", None)
+        try:
+            yield handler
+        finally:
+            reset_settings()
+            if previous_master_key is None:
+                os.environ.pop("CACHEKIT_MASTER_KEY", None)
+            else:
+                os.environ["CACHEKIT_MASTER_KEY"] = previous_master_key
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_serialization_wrapper.py` around lines 120 - 133, The fixture
currently unconditionally pops CACHEKIT_MASTER_KEY and can clobber an existing
process value; modify the fixture that constructs CacheSerializationHandler so
it saves the prior os.environ.get("CACHEKIT_MASTER_KEY") into a local variable
before setting it, sets os.environ["CACHEKIT_MASTER_KEY"] = "a"*64, yields the
handler, and after yield restores the original value (reassign if prior was not
None, otherwise remove the key). Use the existing reset_settings() calls and the
CacheSerializationHandler instantiation to locate where to add the save/restore
logic.

@27Bslash6 27Bslash6 merged commit ccd32c5 into main Jun 7, 2026
32 checks passed
@27Bslash6 27Bslash6 deleted the perf/arrow-largeobj-validation branch June 7, 2026 00:32
27Bslash6 added a commit that referenced this pull request Jun 7, 2026
…172)

* fix: harden cache-envelope framing and compression config resolution

Addresses review feedback on #152 (merged before these landed):

- wrapper.py: validate v3 frame bounds before parsing — truncated/corrupt
  b"CK" frames now raise a clean ValueError instead of a low-level
  IndexError/JSON error. Also tighten wrap()/unwrap() public signatures to
  dict[str, Any].
- arrow_serializer.py: narrow the compression auto-resolution except to
  ImportError. Invalid CACHEKIT_ARROW_COMPRESSION values now surface as a
  pydantic ValidationError instead of silently falling back to zstd.
- tests: make the default-compression test env-independent (delenv +
  reset_settings); restore pre-existing CACHEKIT_MASTER_KEY via monkeypatch
  to stop cross-test env leakage; add same-key oversized-update L1 test.

* test: cover frame-bounds guards and compression-config fallback

Restores patch coverage for the hardening added in the previous commit:
- wrapper.py: tests for truncated frame, unsupported frame version, and
  oversized header-length (the three ValueError guards) -> 100% coverage.
- arrow_serializer.py: test that auto compression falls back to zstd when the
  settings module is unimportable (the except ImportError branch).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant