Skip to content

[MP][Maru] Maru CXL shared L1 backend for MP mode#20

Draft
seohui-XCENA wants to merge 3 commits into
devfrom
maru-mp-l1
Draft

[MP][Maru] Maru CXL shared L1 backend for MP mode#20
seohui-XCENA wants to merge 3 commits into
devfrom
maru-mp-l1

Conversation

@seohui-XCENA

Copy link
Copy Markdown

No description provided.

- l1_protocol.py: structural runtime_checkable Protocol mirroring L1Manager's
  17-method surface, with per-method listener/lock contract docstrings
- config.py: MaruL1Config + maru_config field, __post_init__ DRAM-clamp skip,
  maru CLI args (--maru-server-url/--maru-pool-size-gb/--maru-instance-id),
  parse_args_to_config wiring
- tests: interface<->L1Manager method-set + signature conformance, maru
  config parsing
- thin wrapper over external maru_lmcache.CxlMemoryAdapter, lazy-imported
- two-phase init_layout: build MaruHandler + CxlMemoryAdapter on first layout
  (single-model; layout mismatch rejected)
- free/batched_free no-op (page lifecycle owned by MaruServer); abort_alloc
  discards an allocated-but-unregistered page
- MaruL1Config -> maru.MaruConfig mapping; maru-only get_by_location /
  create_store_handle / handler surface for MaruL1Manager
- tests use mocked maru runtime (no CXL required)
…ifecycle)

- sibling of L1Manager over the maru shared CXL pool: membership/read
  protection live in the MaruServer directory (pin_count), locally only
  in-flight staging (_pending_read refcount, _pending_write)
- reserve_read: per-key independent pins (1+extra_count via one RPC),
  rollback on partial pin / retrieve failure / unresolvable page
- reserve_write mode=new: local staged check + batch_exists dedup
  (cross-instance), all-or-nothing OOM; finish_write: batch_store,
  dup-skip is success, definitive False reclaims the page, unknown
  server state never recycles
- delete: staged keys and pinned keys refuse with KEY_IS_LOCKED (exists()
  disambiguates the handler's pinned/missing conflation)
- clear(force=False) keeps locked staging (stock parity); close drains
- PARITY/MARU provenance comments; RPC reply length guards
- tests: stateful fake maru runtime with fault-injection knobs,
  failure-path coverage, and a conformance suite parametrized over both
  L1Manager (CUDA) and MaruL1Manager
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant