fix: bound memory for large DataFrame/Arrow caching (was OOMing at real sizes)#152
Conversation
…al sizes) Caching a real-sized DataFrame peaked at ~9.4x its logical size on the write path and ~7.2x on read, OOM-killing a 300MB frame under a 1.5GB budget, and silently inflated every cached value 1.33x on the wire/in L1. Two stacked amplifiers, plus a cluster of silent-failure bugs. Root cause: - SerializationWrapper base64-encoded the payload inside a JSON object for EVERY serializer, forcing ~4 full-size copies at peak (b64-bytes -> ascii-str -> json-str -> utf8-bytes) and a permanent 1.33x wire/L1 inflation. - ArrowSerializer copied the payload several more times (to_pybytes + a checksum-prepend concat on write; a full-body `data[8:]` bytes slice on read) and converted Arrow->pandas without low-peak flags. Changes: - SerializationWrapper: replace base64+JSON with a length-prefixed binary frame (CK magic | version | u32 header-len | json header | RAW payload). The payload is stored raw and copied exactly once. unwrap() still reads the legacy base64+JSON envelope, so pre-existing cache entries stay readable (no flush). Internal to the Python SDK; the cross-SDK ByteStorage wire format is unchanged. - ArrowSerializer: enable zstd IPC compression written in ~8MiB record batches (bounds the compressor's working set), hash over the buffer memoryview (no copy), slice the read envelope with a memoryview (no full-body copy), convert with to_pandas(self_destruct, split_blocks), and use preserve_index=None so a RangeIndex is metadata-only. Always checksum -- the integrity-off path silently returned corrupted DataFrames, and 8 bytes is negligible. Magic-sniff the envelope on read so checksummed/raw/legacy data all decode and malformed input raises SerializationError (no more raw OSError leak); dict-of-scalars now raises the documented TypeError. - memcached: reject values over the item-size cap with a clear PERMANENT BackendError and pass noreply=False, so oversized items fail loudly instead of being silently dropped (noreply previously swallowed the server's error -> a permanent silent cache miss for any real DataFrame). - L1: refuse to store a single entry larger than the whole L1 budget (an OOM vector that also evicted every other entry) and drop any stale smaller entry for that key; the value still lives in L2. Measured: write peak ~9.4x -> ~6x, read ~7.2x -> ~3.5x, Python-tracked peak ~5.7x -> ~2x, wire 1.33x -> ~1x (far less on compressible data). A 200MB frame that OOM'd under a 1.5GB budget now caches in ~1.2GB. The residual write peak is pyarrow's in-memory IPC materialization (BufferOutputStream growth + memory-pool retention); driving it lower needs streaming serialization straight to the backend (separate change).
WalkthroughThis PR introduces configurable Arrow IPC compression with always-on xxHash3-64 integrity checking, upgrades the serialization wrapper to a binary frame format (v3), and enforces client-side size limits across Memcached backend and L1 in-memory cache layers. It includes comprehensive unit and performance tests. ChangesSerialization, Compression, and Size Bounds for Large Object Caching
🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Adds a `compression` option to ArrowSerializer plus a CACHEKIT_ARROW_COMPRESSION setting (default "zstd"). compression="auto" (the default) resolves from the setting; "zstd"/"lz4" shrink the stored payload but must be decompressed into the heap on read; None/"none" stores UNCOMPRESSED Arrow IPC, trading a larger payload for the ability to memory-map reads zero-copy (the lowest read-memory path, ~0.3x RSS vs ~1.3x for an in-memory decode). This makes the size-vs-read-memory tradeoff a user/per-backend choice instead of a hardcoded default, and lays the groundwork for a memory-mapped File-backend read path. Also locks in regression tests proving the binary-frame wrapper round-trips ENCRYPTED payloads end-to-end: encryption metadata (encrypted/tenant_id/ AES-256-GCM) survives the frame header, AAD binding still rejects a wrong cache_key (EncryptionError, never silent), ciphertext is not base64-inflated, and legacy base64+JSON encrypted entries still decrypt.
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cachekit/serializers/arrow_serializer.py`:
- Around line 165-166: The current broad except around resolving
CACHEKIT_ARROW_COMPRESSION silently masks errors and forces "zstd"; instead
narrow the exception handling and validate values: only treat a missing/unset
setting as a fallback to "zstd", but for invalid values or unexpected failures
(e.g., ValueError, TypeError, or config parsing errors) raise or log the error
so it isn't silently ignored. Locate the try/except where the variable
compression is set (the block that reads CACHEKIT_ARROW_COMPRESSION) and replace
the blanket except Exception with specific handling — check the setting against
an allowed set of compression names, handle a missing setting by returning
"zstd", and surface (re-raise or log and raise) any other exceptions or invalid
values so misconfiguration isn't hidden.
In `@src/cachekit/serializers/wrapper.py`:
- Line 73: The public functions using bare dict (wrap and the function at line
102) should declare a stricter metadata type to improve static analysis: change
metadata: dict to metadata: Mapping[str, Any] (or dict[str, Any] if you prefer
concrete type) in the signatures for wrap and the other public method, and add
the necessary imports (Any and Mapping or dict typing) at the top of wrapper.py;
ensure import style is ruff-compatible (use from typing import Any, Mapping or
from collections.abc import Mapping for Py3.9+).
- Around line 115-123: The code that parses cache envelope frames (check for
_MAGIC, read frame_version, hdr_len, header_end and slices using _PREFIX_LEN)
must validate buffer bounds before slicing to avoid low-level errors: ensure mv
is at least _PREFIX_LEN bytes before accessing mv[len(_MAGIC)] and computing
hdr_len, ensure hdr_len is non-negative and header_end = _PREFIX_LEN + hdr_len
is <= len(mv) before json.loads or extracting payload, and normalize all these
failures to raise ValueError with a clear message (e.g. "truncated or corrupt CK
frame") so functions using the _MAGIC/_FRAME_VERSION/_PREFIX_LEN logic fail
predictably on corrupt entries.
In `@tests/unit/test_arrow_serializer.py`:
- Around line 443-448: The test test_default_is_auto_zstd assumes
CACHEKIT_ARROW_COMPRESSION is unset; make it environment-independent by
explicitly ensuring the env var is removed or set to a known value before
calling reset_settings() and ArrowSerializer().serialize; modify the test
(test_default_is_auto_zstd) to delete or pop "CACHEKIT_ARROW_COMPRESSION" from
os.environ (or set it to an empty value) prior to calling reset_settings() so
the default-compression behavior is deterministic when invoking
ArrowSerializer().serialize and asserting meta.compressed is True.
In `@tests/unit/test_l1_memory_bounds.py`:
- Around line 17-53: Add a unit that verifies an oversized update for the same
key removes the existing L1 entry: in the TestOversizedEntryRejection suite
create test_oversized_update_removes_stale_l1_entry which uses
L1Cache(max_memory_mb=1), puts a small value with cache.put("key", ...), asserts
cache.get("key")[0] is True, then does an oversized cache.put("key", ...) and
asserts cache.get("key")[0] is False and cache._current_memory_bytes == 0; this
ensures the behavior implemented in L1Cache.put (removing the L1 entry on
oversized updates) is covered.
In `@tests/unit/test_serialization_wrapper.py`:
- Around line 120-133: The fixture currently unconditionally pops
CACHEKIT_MASTER_KEY and can clobber an existing process value; modify the
fixture that constructs CacheSerializationHandler so it saves the prior
os.environ.get("CACHEKIT_MASTER_KEY") into a local variable before setting it,
sets os.environ["CACHEKIT_MASTER_KEY"] = "a"*64, yields the handler, and after
yield restores the original value (reassign if prior was not None, otherwise
remove the key). Use the existing reset_settings() calls and the
CacheSerializationHandler instantiation to locate where to add the save/restore
logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6fd12047-e3e9-40a4-81fd-d5dffe2bf306
📒 Files selected for processing (14)
src/cachekit/backends/memcached/backend.pysrc/cachekit/backends/memcached/config.pysrc/cachekit/config/settings.pysrc/cachekit/l1_cache.pysrc/cachekit/serializers/arrow_serializer.pysrc/cachekit/serializers/wrapper.pytests/critical/test_memcached_backend_critical.pytests/performance/test_large_object_memory.pytests/unit/backends/test_memcached_backend.pytests/unit/test_arrow_serializer.pytests/unit/test_l1_memory_bounds.pytests/unit/test_serialization_wrapper.pytests/unit/test_serializer_integrity.pytests/unit/test_xxhash_integrity.py
| except Exception: # noqa: BLE001 — settings unavailable: fall back to a sane default | ||
| compression = "zstd" |
There was a problem hiding this comment.
Do not silently mask settings failures during compression auto-resolution.
Line 165 catches every Exception and falls back to "zstd". This can hide invalid CACHEKIT_ARROW_COMPRESSION values or other settings faults, causing silent config drift at runtime.
Suggested fix
if compression == "auto":
try:
from cachekit.config.singleton import get_settings
compression = get_settings().arrow_compression
- except Exception: # noqa: BLE001 — settings unavailable: fall back to a sane default
+ except ImportError:
compression = "zstd"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cachekit/serializers/arrow_serializer.py` around lines 165 - 166, The
current broad except around resolving CACHEKIT_ARROW_COMPRESSION silently masks
errors and forces "zstd"; instead narrow the exception handling and validate
values: only treat a missing/unset setting as a fallback to "zstd", but for
invalid values or unexpected failures (e.g., ValueError, TypeError, or config
parsing errors) raise or log the error so it isn't silently ignored. Locate the
try/except where the variable compression is set (the block that reads
CACHEKIT_ARROW_COMPRESSION) and replace the blanket except Exception with
specific handling — check the setting against an allowed set of compression
names, handle a missing setting by returning "zstd", and surface (re-raise or
log and raise) any other exceptions or invalid values so misconfiguration isn't
hidden.
| """ | ||
|
|
||
| @staticmethod | ||
| def wrap(data: bytes, metadata: dict, serializer_name: str, version: str = "2.0") -> bytes: |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Tighten public API typing for metadata contracts.
Line 73 and Line 102 use bare dict in public method signatures, which weakens the API contract for static analysis. Use typed mappings (e.g. dict[str, Any]).
As per coding guidelines, "Python code. Enforce ruff compatibility, type hints on public APIs".
Proposed diff
from __future__ import annotations
import base64
import json
-from typing import Union
+from typing import Any, Union
@@
- def wrap(data: bytes, metadata: dict, serializer_name: str, version: str = "2.0") -> bytes:
+ def wrap(data: bytes, metadata: dict[str, Any], serializer_name: str, version: str = "2.0") -> bytes:
@@
- def unwrap(wrapped_data: Union[str, bytes]) -> tuple[bytes, dict, str]:
+ def unwrap(wrapped_data: Union[str, bytes]) -> tuple[bytes, dict[str, Any], str]:Also applies to: 102-102
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cachekit/serializers/wrapper.py` at line 73, The public functions using
bare dict (wrap and the function at line 102) should declare a stricter metadata
type to improve static analysis: change metadata: dict to metadata: Mapping[str,
Any] (or dict[str, Any] if you prefer concrete type) in the signatures for wrap
and the other public method, and add the necessary imports (Any and Mapping or
dict typing) at the top of wrapper.py; ensure import style is ruff-compatible
(use from typing import Any, Mapping or from collections.abc import Mapping for
Py3.9+).
Source: Coding guidelines
| if bytes(mv[: len(_MAGIC)]) == _MAGIC: | ||
| frame_version = mv[len(_MAGIC)] | ||
| if frame_version != _FRAME_VERSION: | ||
| raise ValueError(f"Unsupported cache envelope frame version {frame_version} (expected {_FRAME_VERSION})") | ||
| hdr_len = int.from_bytes(mv[len(_MAGIC) + 1 : _PREFIX_LEN], "big") | ||
| header_end = _PREFIX_LEN + hdr_len | ||
| header = json.loads(bytes(mv[_PREFIX_LEN:header_end])) | ||
| payload = bytes(mv[header_end:]) # single copy of the raw payload | ||
| return payload, header.get("m", {}), header.get("s", "unknown") |
There was a problem hiding this comment.
Validate frame bounds before parsing header/payload.
Line 116 and Line 121 can currently fail with low-level exceptions on truncated/corrupt b"CK" frames. Add explicit size guards (<_PREFIX_LEN, header_end > len) and normalise to ValueError so corrupt cache entries fail predictably.
Proposed diff
if isinstance(wrapped_data, (bytes, bytearray, memoryview)):
mv = memoryview(wrapped_data)
if bytes(mv[: len(_MAGIC)]) == _MAGIC:
+ if mv.nbytes < _PREFIX_LEN:
+ raise ValueError(
+ f"Truncated cache envelope frame: {mv.nbytes} bytes (minimum {_PREFIX_LEN})"
+ )
frame_version = mv[len(_MAGIC)]
if frame_version != _FRAME_VERSION:
raise ValueError(f"Unsupported cache envelope frame version {frame_version} (expected {_FRAME_VERSION})")
hdr_len = int.from_bytes(mv[len(_MAGIC) + 1 : _PREFIX_LEN], "big")
header_end = _PREFIX_LEN + hdr_len
- header = json.loads(bytes(mv[_PREFIX_LEN:header_end]))
+ if header_end > mv.nbytes:
+ raise ValueError(
+ f"Invalid cache envelope header length {hdr_len}: frame has only {mv.nbytes} bytes"
+ )
+ header = json.loads(bytes(mv[_PREFIX_LEN:header_end]))
payload = bytes(mv[header_end:]) # single copy of the raw payload
return payload, header.get("m", {}), header.get("s", "unknown")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cachekit/serializers/wrapper.py` around lines 115 - 123, The code that
parses cache envelope frames (check for _MAGIC, read frame_version, hdr_len,
header_end and slices using _PREFIX_LEN) must validate buffer bounds before
slicing to avoid low-level errors: ensure mv is at least _PREFIX_LEN bytes
before accessing mv[len(_MAGIC)] and computing hdr_len, ensure hdr_len is
non-negative and header_end = _PREFIX_LEN + hdr_len is <= len(mv) before
json.loads or extracting payload, and normalize all these failures to raise
ValueError with a clear message (e.g. "truncated or corrupt CK frame") so
functions using the _MAGIC/_FRAME_VERSION/_PREFIX_LEN logic fail predictably on
corrupt entries.
| def test_default_is_auto_zstd(self): | ||
| from cachekit.config.singleton import reset_settings | ||
|
|
||
| reset_settings() # no env override -> default zstd | ||
| _, meta = ArrowSerializer().serialize(pd.DataFrame({"a": [1] * 1000})) | ||
| assert meta.compressed is True |
There was a problem hiding this comment.
Make the default-compression test environment-independent.
Line 446 assumes CACHEKIT_ARROW_COMPRESSION is not set. If it is set externally, this test becomes flaky.
Suggested fix
- def test_default_is_auto_zstd(self):
+ def test_default_is_auto_zstd(self, monkeypatch):
from cachekit.config.singleton import reset_settings
+ monkeypatch.delenv("CACHEKIT_ARROW_COMPRESSION", raising=False)
reset_settings() # no env override -> default zstd
_, meta = ArrowSerializer().serialize(pd.DataFrame({"a": [1] * 1000}))
assert meta.compressed is True🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/test_arrow_serializer.py` around lines 443 - 448, The test
test_default_is_auto_zstd assumes CACHEKIT_ARROW_COMPRESSION is unset; make it
environment-independent by explicitly ensuring the env var is removed or set to
a known value before calling reset_settings() and ArrowSerializer().serialize;
modify the test (test_default_is_auto_zstd) to delete or pop
"CACHEKIT_ARROW_COMPRESSION" from os.environ (or set it to an empty value) prior
to calling reset_settings() so the default-compression behavior is deterministic
when invoking ArrowSerializer().serialize and asserting meta.compressed is True.
| @pytest.mark.unit | ||
| class TestOversizedEntryRejection: | ||
| def test_entry_larger_than_budget_is_not_stored(self): | ||
| cache = L1Cache(max_memory_mb=1) | ||
| cache.put("big", b"\x00" * (2 * MB), redis_ttl=300) | ||
|
|
||
| found, _ = cache.get("big") | ||
| assert found is False | ||
| assert cache._current_memory_bytes == 0 | ||
|
|
||
| def test_rejected_oversized_put_does_not_evict_existing_entries(self): | ||
| """A doomed oversized put must not evict good entries on its way to failing.""" | ||
| cache = L1Cache(max_memory_mb=1) | ||
| cache.put("keep", b"\x00" * (512 * 1024), redis_ttl=300) # fits | ||
|
|
||
| cache.put("toobig", b"\x00" * (5 * MB), redis_ttl=300) # cannot ever fit | ||
|
|
||
| assert cache.get("keep")[0] is True # survivor | ||
| assert cache.get("toobig")[0] is False | ||
| assert cache._current_memory_bytes <= cache.max_memory_bytes | ||
|
|
||
| def test_entry_equal_to_budget_is_stored(self): | ||
| cache = L1Cache(max_memory_mb=1) | ||
| cache.put("exact", b"\x00" * (1 * MB), redis_ttl=300) | ||
| assert cache.get("exact")[0] is True | ||
|
|
||
| def test_normal_entry_still_stored(self): | ||
| cache = L1Cache(max_memory_mb=10) | ||
| cache.put("k", b"value", redis_ttl=300) | ||
| assert cache.get("k") == (True, b"value") | ||
|
|
||
| def test_memory_never_exceeds_budget_under_mixed_load(self): | ||
| cache = L1Cache(max_memory_mb=2) | ||
| for i in range(20): | ||
| cache.put(f"k{i}", b"\x00" * (300 * 1024), redis_ttl=300) # 300KB each | ||
| cache.put("huge", b"\x00" * (50 * MB), redis_ttl=300) # rejected | ||
| assert cache._current_memory_bytes <= cache.max_memory_bytes |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Consider adding test for same-key oversized update scenario.
The implementation at src/cachekit/l1_cache.py:294-295 removes an existing L1 entry when an oversized put happens for the same key (to prevent serving stale data from L1 when L2 has the new oversized value). This critical behaviour isn't explicitly tested.
🧪 Suggested test to add
def test_oversized_update_removes_stale_l1_entry(self):
"""When updating a key with an oversized value, remove the old L1 entry."""
cache = L1Cache(max_memory_mb=1)
cache.put("key", b"\x00" * (100 * 1024), redis_ttl=300) # 100 KB, fits
assert cache.get("key")[0] is True # initially present
cache.put("key", b"\x00" * (10 * MB), redis_ttl=300) # oversized update
# Old entry must be removed so L1 doesn't serve stale data
# (new oversized value lives in L2 only)
assert cache.get("key")[0] is False
assert cache._current_memory_bytes == 0🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/test_l1_memory_bounds.py` around lines 17 - 53, Add a unit that
verifies an oversized update for the same key removes the existing L1 entry: in
the TestOversizedEntryRejection suite create
test_oversized_update_removes_stale_l1_entry which uses
L1Cache(max_memory_mb=1), puts a small value with cache.put("key", ...), asserts
cache.get("key")[0] is True, then does an oversized cache.put("key", ...) and
asserts cache.get("key")[0] is False and cache._current_memory_bytes == 0; this
ensures the behavior implemented in L1Cache.put (removing the L1 entry on
oversized updates) is covered.
| reset_settings() | ||
| os.environ["CACHEKIT_MASTER_KEY"] = "a" * 64 | ||
| from cachekit.cache_handler import CacheSerializationHandler | ||
|
|
||
| handler = CacheSerializationHandler( | ||
| serializer_name="default", | ||
| encryption=True, | ||
| single_tenant_mode=True, | ||
| deployment_uuid="00000000-0000-0000-0000-000000000001", | ||
| ) | ||
| yield handler | ||
| reset_settings() | ||
| os.environ.pop("CACHEKIT_MASTER_KEY", None) | ||
|
|
There was a problem hiding this comment.
Restore pre-existing CACHEKIT_MASTER_KEY to avoid cross-test state leakage.
Line 132 always removes the variable, so a pre-existing process value is lost after this fixture runs. Preserve and restore the prior value to keep tests isolated.
Proposed diff
`@pytest.fixture`
def enc_handler(self):
import os
@@
- reset_settings()
- os.environ["CACHEKIT_MASTER_KEY"] = "a" * 64
+ reset_settings()
+ previous_master_key = os.environ.get("CACHEKIT_MASTER_KEY")
+ os.environ["CACHEKIT_MASTER_KEY"] = "a" * 64
from cachekit.cache_handler import CacheSerializationHandler
handler = CacheSerializationHandler(
serializer_name="default",
encryption=True,
single_tenant_mode=True,
deployment_uuid="00000000-0000-0000-0000-000000000001",
)
- yield handler
- reset_settings()
- os.environ.pop("CACHEKIT_MASTER_KEY", None)
+ try:
+ yield handler
+ finally:
+ reset_settings()
+ if previous_master_key is None:
+ os.environ.pop("CACHEKIT_MASTER_KEY", None)
+ else:
+ os.environ["CACHEKIT_MASTER_KEY"] = previous_master_key🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/test_serialization_wrapper.py` around lines 120 - 133, The fixture
currently unconditionally pops CACHEKIT_MASTER_KEY and can clobber an existing
process value; modify the fixture that constructs CacheSerializationHandler so
it saves the prior os.environ.get("CACHEKIT_MASTER_KEY") into a local variable
before setting it, sets os.environ["CACHEKIT_MASTER_KEY"] = "a"*64, yields the
handler, and after yield restores the original value (reassign if prior was not
None, otherwise remove the key). Use the existing reset_settings() calls and the
CacheSerializationHandler instantiation to locate where to add the save/restore
logic.
…172) * fix: harden cache-envelope framing and compression config resolution Addresses review feedback on #152 (merged before these landed): - wrapper.py: validate v3 frame bounds before parsing — truncated/corrupt b"CK" frames now raise a clean ValueError instead of a low-level IndexError/JSON error. Also tighten wrap()/unwrap() public signatures to dict[str, Any]. - arrow_serializer.py: narrow the compression auto-resolution except to ImportError. Invalid CACHEKIT_ARROW_COMPRESSION values now surface as a pydantic ValidationError instead of silently falling back to zstd. - tests: make the default-compression test env-independent (delenv + reset_settings); restore pre-existing CACHEKIT_MASTER_KEY via monkeypatch to stop cross-test env leakage; add same-key oversized-update L1 test. * test: cover frame-bounds guards and compression-config fallback Restores patch coverage for the hardening added in the previous commit: - wrapper.py: tests for truncated frame, unsupported frame version, and oversized header-length (the three ValueError guards) -> 100% coverage. - arrow_serializer.py: test that auto compression falls back to zstd when the settings module is unimportable (the except ImportError branch).
Problem
Large-object (Arrow/DataFrame) caching was brittle and OOM'd at real data sizes. Measured on representative numeric DataFrames:
Root cause
Two stacked, independent amplifiers plus a cluster of silent-failure bugs:
SerializationWrapperbase64-over-JSON envelope — applied to every serializer's output before storage.base64.b64encode → .decode("ascii") → json.dumps → .encode()forced ~4 full-size copies at peak and a permanent 1.33× wire/L1 inflation. This dominated both directions (the read peak was set during unwrap, notto_pandas).ArrowSerializercopy chain —to_pybytes()+ achecksum + arrow_dataconcat on write; a full-bodydata[8:]bytes slice on read;to_pandas()without low-peak flags;preserve_index=Truematerializing even aRangeIndex.Separately surfaced: memcached silently dropped any value >1 MB (the
setinheritednoreply=True, so the server's "object too large" error was never read → permanent silent cache miss); L1 stored a single entry larger than its entire budget (an OOM vector that also evicted everything else);integrity_checking=Falsegave Arrow zero corruption detection (a 1-bit flip silently returned wrong data).Changes
SerializationWrapper→ length-prefixed binary frame (CK | version | u32 header-len | json header | RAW payload). Payload stored raw, copied once.unwrap()still reads the legacy base64+JSON envelope, so pre-existing cache entries remain readable — no flush required. Python-SDK-internal; the cross-SDK ByteStorage wire format is untouched.ArrowSerializer→ zstd IPC compression written in ~8 MiB record batches (bounds the compressor's working set), checksum hashed over the buffermemoryview(no copy), read envelope sliced with amemoryview(no full-body copy),to_pandas(self_destruct=True, split_blocks=True),preserve_index=None. Always checksums (the silent-corruption knob is gone; 8 bytes is free). Magic-sniffs the envelope on read so checksummed/raw/legacy all decode; normalizesdict-of-scalars to the documentedTypeErrorand malformed input toSerializationError(no more rawOSErrorleak).PERMANENTBackendErrorand setsnoreply=Falseso server errors surface loudly.Backward compatibility
[checksum][IPC]Arrow) still deserialize. New writes use the new formats; old entries age out by TTL.wrap/unwrapand serializer signatures unchanged.Testing
test_serialization_wrapper.py(frame round-trip, non-UTF-8 payloads, legacy back-compat, no-base64/no-inflation),test_l1_memory_bounds.py(oversized-entry rejection),test_large_object_memory.py(deterministic tracemalloc + wire-size regression guards that fail loudly if base64 returns).test_arrow_serializer.pywith a dtype/index fidelity matrix (nullable, categorical ordered/unordered, tz-aware, timedelta, MultiIndex), compression, always-checksum, and exception-hygiene cases.tests/unit+tests/criticalsuite: 1750 passed, 0 failed. ruff (incl.S) +basedpyright --level errorclean.Not in this PR (follow-up)
The residual ~6× write peak is pyarrow's in-memory IPC materialization (
BufferOutputStreamdoubling-growth + memory-pool retention) — even uncompressed it is ~3.75×. Driving it lower requires streaming serialization straight to the backend (never holding the full blob), a backend-interface change tracked separately.Summary by CodeRabbit
Release Notes
New Features
Improvements
Compatibility