feat(types): UUID Arrow type migration — pa.binary(16) for all UUID storage (PLT-1162) by kurodo3[bot] · Pull Request #161 · nauticalab/orcapod-python

kurodo3 · 2026-06-11T23:33:56Z

Summary

Replace pa.large_string() UUID storage with pa.binary(16) (fixed_size_binary[16]) across the entire OrcaPod codebase, introducing a semantic struct converter and updating all UUID generation and storage sites.

Changes

Area	What changed
`types.py`	No new constants added — each call site uses `pa.binary(16)` directly. Adding module-level lazy constants would trigger PyArrow import at `types` import time, defeating the `LazyModule` pattern already used there.
`semantic_struct_converters.py`	`UUIDStructConverter` with full `SemanticStructConverterProtocol` compliance
`versioned_hashers.py`	`UUIDStructConverter` registered under `"uuid"` in the default semantic type registry
`builtin_handlers.py`	`UUIDHandler.handle()` returns `obj.bytes` (16 raw bytes) instead of `str(obj)`
`arrow_utils.py`	System record_id tag column type → `pa.binary(16)`; cascading fixes to `universal_converter.py`, `join.py`, `merge_join.py`, `stream_builder.py`, `cached_source.py`
`datagram.py`, `data_function.py`	`datagram_id` now returns `uuid.UUID` (normalized from `uuid_utils.uuid7`); source-info UUID segment uses `.hex`
`logging_observer.py`, `status_observer.py`	`uuid7().bytes` (raw 16 bytes) instead of `str(uuid7())` for `_log_id` / `_status_id` Arrow columns
`postgresql_connector.py`	`uuid` PG type → `pa.binary(16)`; `_coerce_pg_value()` converts driver `uuid.UUID` objects to `.bytes`; TODO comment removed
`databases/utils.py`	`coerce_record_id()` helper: transparently encodes `str` record IDs to `bytes` at DB method boundaries
`superpowers/specs/`, `superpowers/plans/`	Design spec and implementation plan

Test plan

3616 tests pass, 17 skipped (live DB integration), 0 failures
Full UUIDStructConverter suite (19 tests) including Arrow array round-trip, bytearray handling, protocol method coverage
UUIDHandler returns bytes test
PostgreSQL type mapping unit test (pa.binary(16) for uuid)
Semantic registry registration verified end-to-end

Notes

Breaking changes (pre-v0.1.0, intentional per spec):

datagram_id is now uuid.UUID (returned by Datagram.datagram_id, Tag.datagram_id, Data.datagram_id). Any code assuming it was str will break.
record_id (DB pipeline entry IDs), _log_id, _status_id are now bytes (16 bytes). Any code assuming these were str will break.

Polars write_json() limitation: pola-rs/polars#15410 — OrcaPod has zero write_json() calls, unaffected in practice.

SQLite: No change required — TEXT columns remain pa.large_string(). Arrow type round-trip fidelity for loosely-typed backends is tracked in PLT-1615.

run_id: Intentionally unchanged — it's a transient Python string passed to observer callbacks, never stored in Arrow.

Fixes PLT-1162

🤖 Generated with Claude Code

codecov · 2026-06-11T23:35:08Z

Codecov Report

❌ Patch coverage is 96.61017% with 4 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/orcapod/databases/postgresql_connector.py	77.77%	2 Missing ⚠️
src/orcapod/pipeline/status_observer.py	50.00%	1 Missing ⚠️
...capod/semantic_types/semantic_struct_converters.py	97.05%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Formalises pa.binary(16) as the canonical Arrow type for all UUID values in OrcaPod — both system identifiers (UUID_ARROW_TYPE) and user-facing semantic round-trips (UUID_STRUCT_ARROW_TYPE). Documents the two-layer design, no-truncation rule, known Polars write_json limitation, and full implementation scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… (PLT-1162) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…y via module __getattr__ (PLT-1162) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…uct round-trips (PLT-1162) Implements UUIDStructConverter in semantic_struct_converters.py following the existing PythonPathStructConverter/UPathStructConverter pattern. Stores UUIDs as fixed 16-byte binary in a single-field Arrow struct. Duck-typed to also accept uuid_utils.UUID objects (uuid7 etc.) which do not inherit from uuid.UUID but expose .bytes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ter (PLT-1162) Adds `can_handle_python_type`, `can_handle_struct_type`, and `hash_struct_dict` to `UUIDStructConverter` so it fully satisfies `SemanticStructConverterProtocol`. Also adds 7 new tests covering the new methods.

…c registry (PLT-1162) Register UUIDStructConverter in get_versioned_semantic_arrow_hasher so that uuid.UUID values are handled by the default semantic type registry. Add three tests verifying the registration by python type, struct signature, and semantic type name.

…th binary Arrow storage (PLT-1162) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tring (PLT-1162)

…LT-1162) - Change record_id system tag column type from pa.large_string() to UUID_ARROW_TYPE (pa.binary(16)) in arrow_utils.add_system_tag_columns - Import UUID_ARROW_TYPE from orcapod.types in arrow_utils.py - Update stream_builder.py to generate deterministic UUID v5 bytes for record_id values (derived from source_id + provenance token), replacing the previous string-based provenance tokens passed to the system tag column - Add fixed_size_binary support to UniversalTypeConverter.arrow_type_to_python_type so UUID_ARROW_TYPE columns map to Python bytes - Fix _predict_system_tag_schema in Join and MergeJoin operators to carry through the actual column type (bytes for record_id) rather than hardcoding str - Fix CachedSource to exclude random system-tag record_id columns from the row hash used for deduplication (record_ids are now deterministic UUIDs per source, so dedup is stable) - Update tests that passed string record_ids to pass UUID bytes instead Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…D binary migration (PLT-1162) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- datagram.py: datagram_id property returns bytes (uuid7().bytes), update record_id parameter and internal method signatures from str|None to bytes|None - data_function.py: generate uuid7() once, use .bytes for record_id Arrow storage and .hex for source_info provenance strings (source_info stays large_string) - logging_observer.py: log_id uses uuid7().bytes with pa.binary(16) column type - status_observer.py: status_id uses uuid7().bytes with pa.binary(16) column type in both _ContextualizedStatusObserver and _ScopedStatusObserver - Cascade updates to maintain consistency: - protocols/core_protocols/datagrams.py: datagram_id protocol return type bytes - tag_data.py: record_id parameter annotations bytes|None - function_node.py: data_record_id bytes, DATA_RECORD_ID stored as pa.binary(16) - databases: _ensure_record_id_column, add_record accept str|bytes in all impls - database_protocols.py: add_record protocol updated to str|bytes - Fix two tests broken by the type change: source_info UUID format (dashes→hex) and data_record_id column type (str→bytes with len==16) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nary(16)) (PLT-1162) - Move uuid out of the grouped string branch in _pg_type_to_arrow so it returns UUID_ARROW_TYPE = pa.binary(16) instead of pa.large_string() - Add _coerce_pg_value helper to convert psycopg uuid.UUID objects to bytes before Arrow array construction in iter_batches - Update test_uuid in TestPgTypeToArrow to assert pa.binary(16) - Import _coerce_pg_value in test file for completeness

…T-1162)

eywalker

Reviewing through, I don't think that the use of UUID TYPE constant is actually helping with anything -- let's just make that fixed to be defined as binary type and drop use of constants as we as the getattr trick.

eywalker · 2026-06-11T23:52:09Z

+            c for c in live_table.column_names
+            if c.startswith(constants.SYSTEM_TAG_RECORD_ID_PREFIX)
+        ]
+        hash_table = (


was this some kind of a bug that was present before?

Yes, this was a latent pre-existing bug exposed by the migration.

Before this PR, was str(uuid7()) — a fresh random string on every run. Since the dedup hash included the record_id column and that value changed every invocation, the cached-source dedup never actually recognised repeat rows as duplicates. It was silently broken.

The migration made it visible: deterministic UUID v5 record_ids meant the hash would be stable across runs, which is correct — but we also don't need to include the record_id in the hash at all, since it's derived from the source_id + provenance token already captured in the hash. So the fix excludes record_id tag columns from the dedup hash (removing redundant information) while making the dedup work correctly for the first time.

eywalker · 2026-06-11T23:53:08Z


+# Namespace UUID for source record IDs — all source record IDs are derived
+# from this namespace and the row's source_id + provenance token.
+_SOURCE_RECORD_ID_NAMESPACE = uuid.UUID("7f8e9d0c-1b2a-3c4d-5e6f-7a8b9c0d1e2f")


how is this value chosen?

It was arbitrary — chosen once by the implementer with no principled derivation, which is why it looked opaque.

Fixed: replaced with a UUID v5 derived from uuid.NAMESPACE_URL + a stable OrcaPod-specific URL string, so the value is reproducible from its inputs rather than being a magic hex literal:

_SOURCE_RECORD_ID_NAMESPACE = uuid.uuid5( uuid.NAMESPACE_URL, "https://orcapod.io/namespaces/source-record-id", )

Added a doc comment noting this value must never change once deployed, since changing it would invalidate all existing stored record IDs.

eywalker · 2026-06-11T23:55:22Z


    @property
-    def datagram_id(self) -> str:
+    def datagram_id(self) -> bytes:


wouldn't this be cleaner to return an actual UUID object instead?

Agreed — done. datagram_id now returns uuid.UUID throughout the protocol, Datagram, and Tag/Data. Generation uses uuid7() normalised to stdlib uuid.UUID (for cross-library equality). Arrow array construction calls .bytes at the point of actual serialisation, keeping the Python interface ergonomic.

eywalker · 2026-06-11T23:57:23Z

+
+
+def __getattr__(name: str) -> object:
+    """Lazy resolution for module-level constants that depend on pyarrow.


which constant would have triggered the invocation of pyarrow and why?

Accessing either constant triggered it. The __getattr__ in types.py called import pyarrow as _pa to construct the value on first access. So any from orcapod.types import UUID_ARROW_TYPE or from orcapod import UUID_ARROW_TYPE call would load the full pyarrow C-extension immediately — which made the __init__.py eager import of the constant just as non-lazy. The __init__.py __getattr__ was needed to defer even that.

This whole mechanism is removed now per the overall review direction: the constants are gone, both __getattr__ functions are removed, and pa.binary(16) is used inline at each site.

eywalker · 2026-06-11T23:58:35Z

            assert len(values) == result_table.num_rows
            for val in values:
-                assert isinstance(val, str)
+                # source_id columns are large_string (str), record_id columns are


perform a more thorough test based on the target column

Done. The assertion now checks per column: record_id columns (matching SYSTEM_TAG_RECORD_ID_PREFIX) assert isinstance(val, bytes) and len(val) == 16; all other system tag columns assert isinstance(val, str). This catches any accidental type regression in either direction.

…atagram_id, per-column tests (PLT-1162) - Remove UUID_ARROW_TYPE and UUID_STRUCT_ARROW_TYPE constants and their lazy __getattr__ resolvers from orcapod.types and orcapod.__init__; use pa.binary(16) / pa.struct([...]) inline everywhere - Replace _SOURCE_RECORD_ID_NAMESPACE opaque hex literal with a principled uuid.uuid5 derivation with doc comment explaining it must never change - Change datagram_id property return type from bytes to uuid.UUID throughout (protocol, Datagram, Tag, Data); generation via uuid7() normalised to stdlib uuid.UUID to ensure cross-library equality; call sites that build pa.binary(16) arrays use .bytes - Strengthen system-tag column type assertion in test_operators to per-column checks: record_id columns asserted as bytes+len==16, source_id columns asserted as str Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kurodo3 · 2026-06-12T00:16:47Z

Review round 1 — addressed in commit `13a1d5a`

All comments addressed. Summary:

Drop UUID_ARROW_TYPE / UUID_STRUCT_ARROW_TYPE constants + __getattr__ trick (overall review + question on __init__.py)
Removed both named constants and both __getattr__ functions (types.py and __init__.py). Every site now uses pa.binary(16) or pa.struct([pa.field("uuid", pa.binary(16))]) inline. The __getattr__ mechanism was needed because importing either constant immediately triggered pyarrow loading — that whole layer is gone.

cached_source.py dedup hash — yes, pre-existing latent bug
The old code included the record_id column in the dedup hash. Since record_id was str(uuid7()) (fresh random on every run), the hash was always different, so cached-source dedup was silently never matching. The migration made this visible. The fix excludes record_id columns from the hash (they're derived from data already in the hash) and makes the dedup work correctly for the first time.

_SOURCE_RECORD_ID_NAMESPACE — now principled
Replaced the opaque hex literal with uuid.uuid5(uuid.NAMESPACE_URL, "https://orcapod.io/namespaces/source-record-id") plus a doc comment that this value must never change once deployed.

datagram_id returns uuid.UUID
Accepted. Protocol, Datagram, Tag, Data all return uuid.UUID. Arrow array construction calls .bytes at the serialisation point.

Per-column system-tag type assertions in test_operators.py
Replaced the weak isinstance(val, (str, bytes)) check with per-column assertions: record_id columns → bytes + len == 16; all others → str.

eywalker · 2026-06-12T00:19:11Z

        """Return (or lazily generate) the datagram's unique ID."""
        if self._datagram_id is None:
-            self._datagram_id = str(uuid7())
+            # uuid7() returns a uuid_utils.UUID; normalise to stdlib uuid.UUID


Is there a good reason to mix uuid_utils and stdlib for handling uuid? Does uuid_utils offer richer set of functions for uuid generation/handling?

Agreed — is only here for UUIDv7 generation (time-ordered, monotonic), which isn't in the stdlib before Python 3.12. However and are distinct types that don't compare equal even for identical bit patterns, so we normalise immediately to on creation. The comment has been expanded to make this explicit.

eywalker · 2026-06-12T00:21:02Z

                        f"{stream.pipeline_hash().to_hex(n_char)}:{idx}"
                    )
-                    system_tag_fields[new_name] = str
+                    system_tag_fields[new_name] = col_type


We should include tests to ensure that the removal force convresion into str is not creating a new error?

Done. Added in . It iterates over all system-tag columns in the predicted schema after a two-way join and asserts that record_id columns map to (from ) and source_id columns map to — verifying that is carried through unchanged.

eywalker · 2026-06-12T00:26:58Z

        )

-        # Compute per-row record hashes for dedup: hash(full row)
+        # Compute per-row record hashes for dedup: hash(full row excluding


I acutally don't understand the logic here. For system tag, the source_id is the unique specifier of the nature/identity of the source and the record_id is a unique specifier about the origin of the specific record. If we are deeming record_id to be truly redundant with record_hash, then we should use record_id as the record_hash. Shouldn't it be the case that we'd actually include everything that's part of the identity of the row as part of the record hash so that in case two entries with everything identical BUT the record_id, suggesting that they originated from two distinct entries in the source that happens to carry identical values, we can correctly distinguish them? Otherwise, of course the inclusion of the rest of the fields are important to detect situation where same record_id entries end up having distinct values but we still want to track it.

Agreed — I reverted the exclusion. The record_id system tag encodes row origin (source identity + provenance token), so two rows that are byte-identical in their data columns but originate from different positions in the source will have distinct record_id values and must be stored as separate entries. Hashing the full live_table (including the record_id column) is correct. The dedup logic now hashes all system tag columns.

eywalker · 2026-06-12T00:28:56Z

+# record IDs.
+_SOURCE_RECORD_ID_NAMESPACE = uuid.uuid5(
+    uuid.NAMESPACE_URL,
+    "https://orcapod.io/namespaces/source-record-id",


What is this typically set to? Also be sure to use Orcapod instead of OrcaPod throughout the codebase. If this is meant to be a namespace URL that we would maintain, make it https://orcapod.org/namespaces/source-record-id

Fixed:

URL corrected to https://orcapod.org/namespaces/source-record-id (was http://orcapod.ai/...)

Capitalisation corrected to "Orcapod"

Added the pre-computed value as a comment: uuid.UUID('877ec89b-6645-5852-ba37-a94604043f5e')

eywalker · 2026-06-12T00:29:32Z

    return f"row_{row_index}"


+def _make_record_id_bytes(source_id: str, provenance_token: str) -> bytes:


let's instead return uuid.UUID for consistency?

Fixed — _make_record_id_bytes renamed to _make_record_id and now returns a uuid.UUID rather than raw bytes. The call site does .bytes to get the 16-byte value for Arrow storage. This gives callers a real UUID object (with full UUID API surface) instead of an opaque byte string.

eywalker · 2026-06-12T00:33:01Z

        self._scoped_path = _scoped_path
        self._pending_batches: dict[str, pa.Table] = _shared_pending_batches if _shared_pending_batches is not None else {}
-        self._pending_record_ids: dict[str, set[str]] = _shared_pending_record_ids if _shared_pending_record_ids is not None else defaultdict(set)
+        self._pending_record_ids: dict[str, set[str | bytes]] = _shared_pending_record_ids if _shared_pending_record_ids is not None else defaultdict(set)


why do we still have to take in str?

Dropped str entirely. compute_pipeline_entry_id now returns bytes (a method-prefixed digest via ContentHash.to_prefixed_digest()), and UUID-based record IDs are stored as 16-byte raw values — both are bytes. There is no longer a caller path that passes a str record ID, so the str union arm was just dead code.

eywalker · 2026-06-12T00:33:46Z

+        self, arrow_data: pa.Table, record_id: str | bytes
    ) -> pa.Table:
        if self.RECORD_ID_COLUMN not in arrow_data.column_names:
+            arrow_type = pa.binary(16) if isinstance(record_id, bytes) else pa.large_string()


why is this necessary? What currently can yield incompatible str record_id?

Cleaned up — record_id is now bytes only and _ensure_record_id_column always creates a pa.large_binary() column (no conditional branch on type). pa.large_binary() is used uniformly for all database record ID columns: it accommodates both fixed-size UUID bytes (16 bytes) and variable-length method-prefixed pipeline entry digests without requiring a schema change if sizes shift.

eywalker · 2026-06-12T00:34:07Z

        if self.RECORD_ID_COLUMN not in arrow_data.column_names:
            # Add record_id column at the beginning
-            key_array = pa.array([record_id] * len(arrow_data), type=pa.large_string())
+            arrow_type = pa.binary(16) if isinstance(record_id, bytes) else pa.large_string()


we should drop support for large_string

Dropped large_string support entirely. _ensure_record_id_column now always creates a pa.large_binary() column unconditionally, and all str | bytes / str type annotations in the Delta Lake DB layer have been updated to bytes only — matching the protocol and the other database implementations.

…o_prefixed_digest() Addresses second round of PR review comments on the UUID Arrow type migration. Key changes: - Add `ContentHash.to_prefixed_digest()` → `b"{method}:{digest}"` for human-readable method-prefixed binary record IDs - `compute_pipeline_entry_id()` now returns `bytes` via `to_prefixed_digest()`; update all `dict[str, ...]` caches in FunctionNode to `dict[bytes, ...]` - Migrate all database record-ID storage from `pa.binary(16)` → `pa.large_binary()` so UUID-based and hash-based IDs share a single consistent Arrow type - Drop `str | bytes` unions from `ArrowDatabaseProtocol`, all DB implementations, and `_pending_record_ids`; record IDs are now uniformly `bytes` - `_make_record_id_bytes` → `_make_record_id` returning `uuid.UUID` (caller calls `.bytes`) - Namespace URL fixed: `orcapod.io` → `orcapod.org`, "Orcapod" capitalisation, computed UUID value added as comment - Revert C3: include all system tag columns (including record_id) in dedup hash so rows with identical data values but different provenance are not collapsed - Add test for `col_type` preservation through `_predict_system_tag_schema` - Improve datagram.py comment explaining uuid_utils vs stdlib normalisation - Update all DB and FunctionNode tests to use `bytes` record IDs Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

kurodo3 · 2026-06-12T01:41:29Z

Round 2 review — summary of changes

All review comments addressed in commit fix(uuid): migrate record IDs to pa.large_binary(); add ContentHash.to_prefixed_digest().

C1 — `uuid_utils` vs stdlib UUID (`datagram.py`)

Expanded the comment to explain both why uuid_utils is used (UUIDv7 generation, not yet in stdlib before 3.12) and why we normalise to uuid.UUID immediately (cross-library type inequality even for identical bit patterns).

C2 — `col_type` propagation through `_predict_system_tag_schema` (`join.py`)

Added test_output_schema_preserves_system_tag_col_types to TestJoinOutputSchemaSystemTags. The test verifies that record_id system-tag columns carry bytes and source_id columns carry str through output_schema().

C3 — `record_id` in dedup hash (`cached_source.py`)

Reverted the exclusion. record_id encodes row origin (source identity + provenance token), so two identical-data rows from different source positions must produce distinct hash entries. The dedup loop now hashes the full live_table including all system tag columns.

C4 — Namespace URL / capitalisation (`stream_builder.py`)

URL corrected to https://orcapod.org/namespaces/source-record-id
Capitalisation corrected to "Orcapod"
Pre-computed UUID value added as a comment: uuid.UUID('877ec89b-6645-5852-ba37-a94604043f5e')

C5 — `_make_record_id` return type (`stream_builder.py`)

Renamed _make_record_id_bytes → _make_record_id; function now returns uuid.UUID. Call site uses .bytes for Arrow storage.

C6 / C7 — Drop `str` from `connector_arrow_database.py`

record_id is now bytes only throughout. _ensure_record_id_column always creates pa.large_binary() unconditionally. No caller passes str record IDs — the union was dead code.

C8 — Drop `large_string` from `delta_lake_databases.py`

Same change applied to the Delta Lake DB layer: _ensure_record_id_column always uses pa.large_binary(), and all str | bytes / str type annotations updated to bytes.

Broader scope (discussed and agreed during review)

pa.large_binary() for all database record ID columns — replaces pa.binary(16) in arrow_utils.py. Accommodates both fixed-size UUID bytes and variable-length method-prefixed pipeline entry digests without a future schema migration.
ContentHash.to_prefixed_digest() — new method returning b"{method}:{digest}" bytes. Used by compute_pipeline_entry_id(), which now returns bytes instead of a hex string.
Protocol and all DB implementations — add_record, get_record_by_id, get_records_by_ids all updated to bytes-only signatures across database_protocols.py, noop_database.py, in_memory_databases.py, connector_arrow_database.py, delta_lake_databases.py.
All tests passing (3616 tests, 56 skipped).

…ID mapping - test-objective/unit/test_databases.py: update all string record IDs to bytes literals (b"...") to match the bytes-only record_id API introduced in the large_binary migration. String IDs were silently encoded by PyArrow but then not found during set-membership lookup (str != bytes). - postgresql_connector.py: map PostgreSQL uuid type to pa.large_string() instead of pa.binary(16), completing the TODO: PLT-1162 marker in the integration test. psycopg returns uuid.UUID objects; the value coercion now calls str(value) to produce the canonical "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" form. - tests/test_databases/test_postgresql_connector.py: update test_uuid assertion to expect pa.large_string(). - tests/test_databases/test_postgresql_connector_integration.py: remove the # TODO: PLT-1162 comment now that the assertion is fulfilled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add a shared ``coerce_record_id(str | bytes) -> bytes`` helper in ``databases/_utils.py`` and apply it at the entry of every public method that accepts a ``record_id`` parameter: - ``add_record`` / ``get_record_by_id`` / ``get_records_by_ids`` in ``InMemoryArrowDatabase``, ``ConnectorArrowDatabase``, ``DeltaTableDatabase``, and ``NoOpArrowDatabase`` - ``_ensure_record_id_column`` and ``_create_record_id_filter`` in ``DeltaTableDatabase`` The protocol signatures are updated to ``str | bytes`` to reflect that both are accepted. Internally all paths remain ``bytes`` after coercion; the Arrow storage type stays ``pa.large_binary()``. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…l only The str → bytes coercion is a convenience feature of the concrete database implementations, not part of the ArrowDatabaseProtocol contract. The protocol signatures stay bytes-only so that callers coding to the protocol get a clear type contract; the leniency lives in the implementing classes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

No leading underscore; the file is not private to the package, just a shared utility module. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eywalker · 2026-06-12T01:44:03Z

 import logging
 import re
 import sys
+import uuid as _uuid_module


What's the need to name this _uuid_module?

Fixed — removed the alias. Changed to plain import uuid and updated the usage to uuid.UUID(bytes=uuid7().bytes).

Copilot

Pull request overview

Migrates OrcaPod’s UUID handling from string-based Arrow storage toward binary representations, updating semantic type conversion, hashing/caching IDs, and database record ID APIs to work with bytes and fixed-size binary types.

Changes:

Introduces a UUIDStructConverter and registers it in the default semantic registry; updates UUIDHandler to hash UUIDs via raw bytes.
Switches many internal identifiers (system tags, pipeline entry IDs, observers, DB record IDs) from str to bytes and updates tests accordingly.
Expands type conversion support for fixed_size_binary → bytes, and updates multiple DB implementations/connectors to store record IDs in binary columns.

Reviewed changes

Copilot reviewed 46 out of 46 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tests/test_utils/test_arrow_utils.py	Updates system tag record_id test input to use UUID bytes.
tests/test_semantic_types/test_uuid_struct_converter.py	Adds full unit coverage for `UUIDStructConverter`.
tests/test_semantic_types/test_semantic_registry.py	Verifies UUID converter registration and reverse lookup.
tests/test_hashing/test_uuid_handler.py	Adds tests asserting UUID handler returns 16-byte form.
tests/test_databases/test_postgresql_connector.py	Updates tests around PG type mapping/coercion (UUID mapping still asserted as string).
tests/test_databases/test_postgresql_connector_integration.py	Updates integration expectations (UUID mapping still asserted as string).
tests/test_databases/test_noop_database.py	Updates record_id arguments to bytes.
tests/test_databases/test_in_memory_database.py	Updates record_id arguments/expectations to bytes.
tests/test_databases/test_delta_table_database.py	Updates record_id arguments/expectations to bytes.
tests/test_databases/test_delta_table_database_s3.py	Updates record_id arguments/expectations to bytes.
tests/test_databases/test_connector_arrow_database.py	Updates record_id arguments/expectations to bytes.
tests/test_core/operators/test_operators.py	Adds assertions that system tag schemas preserve bytes vs str types.
tests/test_core/nodes/test_function_node_iteration.py	Updates cached entry-id key type to bytes.
tests/test_core/function_pod/test_function_pod_node.py	Updates data record-id assertions to 16-byte values.
tests/test_core/function_pod/test_function_node_caching.py	Updates pipeline entry-id type assertions to bytes.
tests/test_core/data_function/test_data_function.py	Adjusts source-info UUID segment expectations to hex form.
test-objective/unit/test_databases.py	Updates objective tests to use bytes record IDs and binary Arrow arrays.
test-objective/unit/test_arrow_data_utils.py	Updates system-tag tests to pass UUID bytes as record IDs.
superpowers/specs/2026-06-11-plt-1162-uuid-arrow-type-design.md	Adds design spec for UUID Arrow type migration.
superpowers/plans/2026-06-11-plt-1162-uuid-arrow-type.md	Adds implementation plan/checklist for migration steps.
src/orcapod/utils/arrow_utils.py	Changes system tag record_ids typing to bytes; updates record_id column storage type.
src/orcapod/types.py	Adds method-prefixed raw digest bytes API (`to_prefixed_digest`).
src/orcapod/semantic_types/universal_converter.py	Maps `fixed_size_binary` Arrow types to Python `bytes`.
src/orcapod/semantic_types/semantic_struct_converters.py	Adds `UUIDStructConverter` implementation.
src/orcapod/protocols/database_protocols.py	Updates Arrow database protocol record_id types to bytes.
src/orcapod/protocols/core_protocols/datagrams.py	Changes datagram_id protocol type to `uuid.UUID`.
src/orcapod/pipeline/status_observer.py	Stores `_status_id` as `binary(16)` UUID bytes.
src/orcapod/pipeline/logging_observer.py	Stores `_log_id` as `binary(16)` UUID bytes.
src/orcapod/hashing/versioned_hashers.py	Registers `UUIDStructConverter` in default semantic registry.
src/orcapod/hashing/semantic_hashing/builtin_handlers.py	Updates `UUIDHandler` to return UUID bytes.
src/orcapod/databases/utils.py	Adds helper to coerce `record_id` from str→bytes.
src/orcapod/databases/postgresql_connector.py	Adds value coercion helper; updates batch conversion pipeline; UUID mapping remains string.
src/orcapod/databases/noop_database.py	Broadens record_id parameter types to accept bytes.
src/orcapod/databases/in_memory_databases.py	Stores record IDs in binary Arrow columns; coerces input IDs to bytes.
src/orcapod/databases/delta_lake_databases.py	Stores record IDs in binary Arrow columns; coerces input IDs to bytes.
src/orcapod/databases/connector_arrow_database.py	Stores record IDs in binary Arrow columns; coerces input IDs to bytes.
src/orcapod/core/sources/stream_builder.py	Makes deterministic per-row UUID record IDs and stores bytes in system tags.
src/orcapod/core/sources/cached_source.py	Updates dedup hashing comment to reflect system-tag semantics.
src/orcapod/core/result_cache.py	Reads/writes DB record IDs as UUID bytes; converts back to `uuid.UUID` on lookup.
src/orcapod/core/operators/merge_join.py	Preserves original system-tag Python types (bytes vs str) in predicted schema.
src/orcapod/core/operators/join.py	Preserves original system-tag Python types; updates sort key defaults for bytes record IDs.
src/orcapod/core/nodes/function_node.py	Switches pipeline entry IDs and caches from str→bytes using `to_prefixed_digest()`.
src/orcapod/core/datagrams/tag_data.py	Changes datagram record_id parameter type to `uuid.UUID
src/orcapod/core/datagrams/datagram.py	Normalizes UUIDv7 IDs to stdlib `uuid.UUID` on generation.
src/orcapod/core/data_function.py	Stores record_id as `uuid.UUID` and source_info embeds UUID hex.
src/orcapod/init.py	Minor formatting-only change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kurodo3 · 2026-06-12T05:33:40Z

    source_id_array = pa.array(source_ids, type=pa.large_string())
-    record_id_array = pa.array(record_ids, type=pa.large_string())
+    record_id_array = pa.array(record_ids, type=pa.large_binary())



Fixed — changed add_system_tag_columns() to use pa.binary(16) for the record_id column. Record IDs in system tags are always UUID v5 bytes (exactly 16 bytes), so pa.binary(16) is the precise type. The docstring and the implementation now agree.

Fixed in commit e0bbc37. add_system_tag_columns() now uses pa.binary(16) for the record_id column — both the field declaration and the pa.array() call — matching the docstring.

kurodo3 · 2026-06-12T05:33:45Z

    table = table.append_column(
-        pa.field(record_id_col_name, pa.large_string(), nullable=False), record_id_array
+        pa.field(record_id_col_name, pa.large_binary(), nullable=False), record_id_array
    )


Fixed — same change as above. add_system_tag_columns() now appends the record_id column as pa.binary(16) (both the field definition and the pa.array() call). The docstring was already correct; the implementation is now consistent with it.

Fixed in commit e0bbc37 (same change as above — both the field and the array call now use pa.binary(16)).

kurodo3 · 2026-06-12T05:33:50Z

+    if t == "uuid":
+        return _pa.large_string()
+


Fixed — reverted _pg_type_to_arrow("uuid") back to pa.binary(16), and updated _coerce_pg_value to extract .bytes from psycopg's uuid.UUID objects instead of calling str(). Unit test (test_uuid) and integration test (test_all_pg_types) updated accordingly.

Fixed in commit e0bbc37. _pg_type_to_arrow("uuid") now returns pa.binary(16).

kurodo3 · 2026-06-12T05:38:01Z

+    # psycopg returns uuid columns as uuid.UUID objects; large_string() needs str
+    if arrow_type == _pa.large_string() and hasattr(value, "hex") and not isinstance(value, str):
+        return str(value)
+    return value


Fixed in commit e0bbc37. _coerce_pg_value() now checks arrow_type == pa.binary(16) and returns value.bytes (raw 16 bytes) instead of str(value).

kurodo3 · 2026-06-12T05:38:05Z

    def test_uuid(self):
-        assert _pg_type_to_arrow("uuid", "uuid") == pa.large_string()
+        result = _pg_type_to_arrow("uuid", "uuid")
+        assert result == pa.large_string()



Fixed in commit e0bbc37. test_uuid now asserts pa.binary(16).

kurodo3 · 2026-06-12T05:38:09Z

        assert infos["g"].arrow_type == pa.large_string()
        assert infos["h"].arrow_type == pa.large_binary()
-        assert infos["i"].arrow_type == pa.large_string()  # TODO: PLT-1162
+        assert infos["i"].arrow_type == pa.large_string()
        assert infos["j"].arrow_type == pa.large_string()


Fixed in commit e0bbc37. The integration test now asserts infos["i"].arrow_type == pa.binary(16) for the UUID column.

kurodo3 · 2026-06-12T05:38:25Z

+    def __init__(self) -> None:
+        super().__init__("uuid")
+        self._python_type = _uuid_module.UUID
+        self._arrow_struct_type = pa.struct([pa.field("uuid", pa.binary(16))])
+


No code change needed — updated the PR description instead. Adding module-level constants to types.py would trigger PyArrow import at types import time, defeating the LazyModule pattern already used there. Each call site keeping its own pa.binary(16) literal is clear and avoids that footgun. The PR description table row for types.py now reflects this.

kurodo3 · 2026-06-12T05:38:33Z

    @property
-    def datagram_id(self) -> str:
+    def datagram_id(self) -> uuid.UUID:
        """Return (or lazily generate) the datagram's unique ID."""
        if self._datagram_id is None:
-            self._datagram_id = str(uuid7())
+            # uuid_utils is used here because it provides UUIDv7 (time-ordered,
+            # monotonic) generation, which is not available in the stdlib before
+            # Python 3.12.  However, uuid_utils.UUID and stdlib uuid.UUID are
+            # distinct types that do not compare equal even for identical bit
+            # patterns.  We therefore normalise to stdlib uuid.UUID immediately
+            # so that the public API always returns a consistent type and
+            # equality / hashing work correctly regardless of how a UUID was
+            # originally produced.
+            self._datagram_id = uuid.UUID(bytes=uuid7().bytes)
        return self._datagram_id


The code is correct — datagram_id intentionally returns uuid.UUID (not bytes), as agreed in a prior review round. The PR description's Breaking Change note was wrong. Updated it to correctly state: datagram_id → uuid.UUID; record_id/_log_id/_status_id → bytes (16 bytes).

…op uuid module alias - Remove `import uuid as _uuid_module` alias in data_function.py — use plain `import uuid` instead - Change system-tag record_id column from pa.large_binary() to pa.binary(16) in arrow_utils.py: record IDs are always UUID v5 bytes (exactly 16 bytes), so a fixed-size type is more precise - Revert PostgreSQL UUID → Arrow mapping from pa.large_string() back to pa.binary(16), and coerce uuid.UUID values via .bytes instead of str() - Update unit and integration test assertions to match pa.binary(16) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kurodo3 · 2026-06-12T05:34:00Z

Review round 3 — changes made

Three fixes in commit e0bbc37:

import uuid as _uuid_module alias removed (src/orcapod/core/data_function.py)
Plain import uuid now used throughout; call site updated to uuid.UUID(bytes=uuid7().bytes).

System-tag record_id column type changed to pa.binary(16) (src/orcapod/utils/arrow_utils.py)
add_system_tag_columns() now uses pa.binary(16) for both the field declaration and the pa.array() call. Record IDs in system tags are always UUID v5 bytes (exactly 16 bytes), so a fixed-size binary type is precise and prevents out-of-spec values.

PostgreSQL UUID mapping reverted to pa.binary(16) (src/orcapod/databases/postgresql_connector.py)
_pg_type_to_arrow("uuid") returns pa.binary(16) again. _coerce_pg_value extracts .bytes from psycopg's uuid.UUID objects (instead of calling str()). Unit test test_uuid and integration test test_all_pg_types updated to assert pa.binary(16).

All 3616 tests pass (excluding live-postgres integration tests, which require a running instance).

kurodo3 · 2026-06-12T05:38:44Z

Copilot review round — changes made

Commit e0bbc37 addressed comments 1–6 (all were the same class of issue: pa.large_binary() / pa.large_string() where pa.binary(16) should be used):

arrow_utils.py — add_system_tag_columns() record_id column changed from pa.large_binary() to pa.binary(16) (both the field declaration and the pa.array() call).
postgresql_connector.py — _pg_type_to_arrow("uuid") reverted to pa.binary(16); _coerce_pg_value() updated to extract .bytes from psycopg uuid.UUID objects rather than calling str().
tests/test_databases/test_postgresql_connector.py — test_uuid now asserts pa.binary(16).
tests/test_databases/test_postgresql_connector_integration.py — test_all_pg_types UUID column assertion updated to pa.binary(16).

No code change for comments 7–8 — PR description corrections only:

Constants (UUID_ARROW_TYPE / UUID_STRUCT_ARROW_TYPE): Not added to types.py. Adding module-level constants would eagerly trigger PyArrow import at types import time, defeating the LazyModule pattern already used there. Each call site uses pa.binary(16) directly. PR description table updated to reflect this.
datagram_id type: The code is correct — datagram_id returns uuid.UUID (decided in a prior review round). The Breaking Change note in the PR description was wrong; updated to distinguish datagram_id → uuid.UUID from record_id/_log_id/_status_id → bytes.

The property returns a uuid.UUID, so the name datagram_uuid is more descriptive. Renames the public property and protocol method, the private backing field (_datagram_uuid), and all call sites. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kurodo3 Bot and others added 14 commits June 11, 2026 23:45

docs(plans): add UUID Arrow type implementation plan (PLT-1162)

1cca130

feat(types): add UUID_ARROW_TYPE and UUID_STRUCT_ARROW_TYPE constants…

045803c

… (PLT-1162) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(types): make UUID_ARROW_TYPE and UUID_STRUCT_ARROW_TYPE truly laz…

eaa27c2

…y via module __getattr__ (PLT-1162) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(hashing): UUIDHandler returns bytes instead of str, consistent wi…

503b730

…th binary Arrow storage (PLT-1162) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(hashing): fix stale UUIDHandler docstring; add missing test docs…

7b3101f

…tring (PLT-1162)

fix(core): correct stale comment and rid_val bytes fallback after UUI…

e391011

…D binary migration (PLT-1162) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(types): add PLT-1615 cross-reference in UUID Arrow type spec (PL…

0c922b7

…T-1162)

kurodo3 Bot force-pushed the eywalker/plt-1162-design-spike-uuid-arrow-type-mapping-large_string-vs branch from 414d87e to 0c922b7 Compare June 11, 2026 23:47

eywalker requested changes Jun 11, 2026

View reviewed changes

eywalker requested changes Jun 12, 2026

View reviewed changes

kurodo3 Bot and others added 4 commits June 12, 2026 01:53

refactor(databases): rename _utils.py → utils.py

246e5d5

No leading underscore; the file is not private to the package, just a shared utility module. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eywalker requested a review from Copilot June 12, 2026 05:09

Copilot started reviewing on behalf of eywalker June 12, 2026 05:09 View session

eywalker requested changes Jun 12, 2026

View reviewed changes

Copilot AI reviewed Jun 12, 2026

View reviewed changes



		def __getattr__(name: str) -> object:
		"""Lazy resolution for module-level constants that depend on pyarrow.

		return f"row_{row_index}"


		def _make_record_id_bytes(source_id: str, provenance_token: str) -> bytes:

Conversation

kurodo3 Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Notes

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eywalker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot commented Jun 12, 2026

Review round 1 — addressed in commit 13a1d5a

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot commented Jun 11, 2026 •

edited

Loading

codecov Bot commented Jun 11, 2026 •

edited

Loading

Review round 1 — addressed in commit `13a1d5a`

C1 — `uuid_utils` vs stdlib UUID (`datagram.py`)

C2 — `col_type` propagation through `_predict_system_tag_schema` (`join.py`)

C3 — `record_id` in dedup hash (`cached_source.py`)

C4 — Namespace URL / capitalisation (`stream_builder.py`)

C5 — `_make_record_id` return type (`stream_builder.py`)

C6 / C7 — Drop `str` from `connector_arrow_database.py`

C8 — Drop `large_string` from `delta_lake_databases.py`