Adds lossless corpus tooling + bead.interop.layers (0.6.0) by aaronstevenwhite · Pull Request #7 · FACTSlab/bead

aaronstevenwhite · 2026-05-29T17:21:28Z

Description

Releases 0.6.0, adding lossless corpus tooling and a complete, law-verified
mapping between bead and the layers
linguistic-annotation schema.

Streaming corpus ingestion (bead.corpus) is now lossless by default:
JsonlCorpusSource/CsvCorpusSource retain every field and round-trip
non-scalar values through JSON, so nothing is dropped at ingestion.
Buffering graph tier (bead.corpus.graph, bead.corpus.assemble): an
opt-in CorpusGraph typed directed multidigraph plus assemble_graph, which
reconstructs thread structure (e.g. Reddit reply trees from parent_id) on
top of the untouched streaming pipeline.
Lossless layers interop (bead.interop.layers): faithful mirror models
for the layers shared defs and record types with a generic lossless
MirrorIso, plus bridge lenses (CorpusRecord to expression, CorpusGraph
to a property graph, ParsedSentence to tokenization + annotation layers,
and resource-overlap lenses). Mappings are didactic dx.Iso/dx.Lens so
every round-trip is exact and law-verified.
Lexicon validation: every mapping's output is validated against the layers
lexicons (vendored as the vendor/layers git submodule) using the ATProto
lexicon validator (@atproto/lexicon), proving the mappings produce
schema-valid layers.

See CHANGELOG.md for the full 0.6.0 entry.

Motivation

bead needs to ingest raw corpora without discarding source structure and to
interoperate losslessly with the layers annotation schema. The mapping is
expressed as verified lenses and checked against the canonical layers lexicons
so schema drift surfaces as a failing test.

Fixes #

Type of Change

New feature (non-breaking change that adds functionality)
Documentation update
Tests (adding or updating tests)

Checklist

I have read the CONTRIBUTING guidelines
My code follows the project's style guidelines
I have run uv run ruff check . and uv run ruff format .
I have run uv run pyright with no errors
I have added tests that prove my fix/feature works
All tests pass (uv run pytest tests/)
I have updated documentation as needed

Testing

uv run pytest (full suite green; the one known spaCy-on-3.14 case skips).
uv run pyright reports 0 errors / 0 warnings; uv run ruff check clean.
Every layers mapping round-trips under the didactic lens laws and validates
against the vendored layers lexicons via @atproto/lexicon.
CI's Python test job checks out submodules so lexicon validation runs there.

panproto is a transitive dependency of didactic (never imported directly in bead); didactic 0.7.2 requires panproto>=0.48.3. Both are compatible with the existing requires-python >=3.14. The full suite passes (3330) with no source changes; the narrow didactic.api (dx) surface is unaffected by the 0.6 -> 0.7 jump. Callable field types remain unsupported in 0.7.2.

Phase 1 of corpus integration. Adds bead/tokenization/parsers.py: ParsedToken/ ParsedSentence models, SpacyParser/StanzaParser/create_parser, and parse_to_spans which projects a dependency parse onto Span + SpanRelation (one single-token Span per token carrying head_index, and one directed head -> dependent SpanRelation per arc labeled with the deprel). Field placement is aligned with the layers annotation model (formalism, tool, tokenization_id, upos/xpos/lemma/deprel/morph, char offsets) so a parse stored on an Item maps losslessly to a layers dependency AnnotationLayer. Adds STRUCTURE_FUNCTIONS to the DSL stdlib (upos, deprel, head, dependents, has_relation, root, subtree, path_to_root, tokens_with_*, any_deprel, filter_upos), enabling structural constraints like 'upos(self, root(self)) == "VERB" and len(dependents(self, root(self), "obj")) > 0' with no grammar change. Widens DSLEvaluator.evaluate context to Mapping and the DslFunction return alias to include list[int]. Lifts the shared spaCy/Stanza space_after extraction to a single canonical site reused by both the tokenizers and the new parsers (no redundancy). Includes a layers no-drop smoke test asserting every field a layers dependency annotation needs is reconstructable from a parsed Item.

Phase 2 of corpus integration. Adds bead/corpus/: CorpusRecord (provenance keyed to the layers AnnotationMetadata shape), a CorpusSource Protocol, and lazy sources JsonlCorpusSource (plain + Zstandard .zst) and CsvCorpusSource. The pipeline (parse_records, filter_by_structure, sample_corpus, record_to_item) streams records, dependency-parses them, and keeps only those whose parse satisfies a structural DSL constraint, producing Items with layers-aligned spans/relations/provenance. Fully lazy so multi-gigabyte corpora never load into memory. Lifts a shared iter_jsonl_lines helper into bead/data/serialization.py, reused by read_jsonlines, stream_jsonlines, and JsonlCorpusSource (the only corpus-specific addition is a decompressing open_fn). Adds a DependencyParser Protocol with a tool identifier on SpacyParser/StanzaParser for provenance. Adds the 'corpus' optional-dependency extra (zstandard) and zstandard to dev.

Phase 3 of corpus integration. Adds a TextGenerator Protocol to the adapter base and generate_completion to the OpenAI and Anthropic adapters (reusing their existing authenticated clients, no parallel client). Adds CompletionCorpusSource, which treats any TextGenerator as a corpus source, recording the model and prompt as layers-aligned provenance. Adds MarkdownStripTransform and RedditCleanupTransform (SpanTextTransform callables, registered in the default registry) and split_sentences (parser- backed when a spacy/stanza config is given, regex fallback otherwise). RedditCleanupTransform reuses MarkdownStripTransform rather than duplicating markup stripping.

Stanza is an installed tokenization/dev dependency and the English model (tokenize,pos,lemma,depparse) is available, so these tests run for real. The guard now skips only if the model genuinely cannot be downloaded (no network); once present, parse and projection errors surface as failures rather than being swallowed by a broad skip. Adds an end-to-end pipeline test that runs a real StanzaParser through sample_corpus and asserts only transitive sentences are kept.

…code Replaces lazy optional-dependency imports (spaCy, Stanza, zstandard) with importlib.import_module so the import-outside-top-level lint no longer needs a noqa and the messy/partial third-party stubs no longer force type: ignore. Adds docstrings to structural-typing Protocol stubs instead of suppressing the docstring lint, and makes the spaCy token Protocol read-only so a real spaCy Token satisfies it. Moves the core pandas import to module top and the internal create_parser import out of split_sentences. Replaces the object/Any-typed corpus scalar coercion with a precise recursive JSON value type, and widens DSLEvaluator.evaluate to accept a Mapping. Also clears pre-existing dead type: ignore comments in the DSL evaluator and the adapter base. pyright (strict) and ruff both pass clean with no suppressions anywhere in the changed code.

…ransforms Adds API reference pages for bead.corpus and bead.transforms, a Dependency Parsing section to the tokenization reference, and a structural-query note to the DSL reference. Adds a Corpus Ingestion user guide with end-to-end examples (sources, structural sampling, text cleanup, generated corpora) and wires all new pages into the mkdocs nav. Documents the corpus and tokenization extras in the installation guide and records the new functionality plus the didactic/ panproto version bumps in the changelog.

Replaces every Any in bead/dsl/evaluator.py with a recursive DslValue union (scalars, collections, bead models, JsonValue). The operator dispatch now narrows operands before ordering/arithmetic/membership/subscript instead of relying on a broad Any plus try/except TypeError, so the evaluator type-checks cleanly even with the dsl/ pyright exclude lifted (verified) and gives clearer EvaluationError messages. DSLEvaluator.evaluate now takes Mapping[str, DslValue] and returns DslValue; the checked callers (corpus pipeline, list partitioner, template resolver) consume it unchanged. Updates the operator type-error test to the clearer message. Makes the corpus user-guide source/cleanup examples execute against new fixtures, and extends the api-docs code-block test to skip examples needing an optional NLP parser model or model API (mirroring the existing glazing-data skip). pyright (strict) and ruff pass with no warnings and no suppressions anywhere in the changed code.

Drops an unnecessary noqa: ANN202 in a DSL test (tests already ignore ANN) and hoists a pre-existing lazy shutil import to module top in the api-docs test, leaving no type/lint suppressions anywhere in the changed code.

JsonlCorpusSource and CsvCorpusSource now retain ALL source fields by default (provenance_fields/columns=None keeps every field except the text field), so nothing - including Reddit thread edges parent_id/link_id - is silently dropped; an explicit tuple still selects a subset. Non-scalar values are JSON-serialized (json.dumps) rather than str()-ified, so they round-trip via json.loads. This guarantees corpus structure is recoverable downstream even on the fast streaming path.

On top of the streaming sources, adds bead/corpus/graph.py (CorpusNode, CorpusEdge, CorpusGraph) - a directed, typed multigraph over expressions with traversal helpers (out/in_edges, successors/predecessors, roots, descendants, reverse). Reddit reply trees are the single-edge-type special case; arbitrary typed relations between expressions are the general case. bead/corpus/assemble.py adds EdgeSpec (declarative field-to-edge rule with prefix stripping for Reddit fullnames) and assemble_graph, which buffers a record stream and reconstructs the graph from EdgeSpecs and/or a runtime edge_fn. Dangling edge targets are preserved, not dropped. The model is aligned with layers' graphNode/graphEdgeSet for lossless mapping (next phase).

First bead<->layers interop lens, establishing the law-verified template. The CorpusGraph lens projects to a faithful, standalone layers view (expression records, graph nodes, a graphEdgeSet of typed objectRef edges) and keeps a complement holding what layers' graph cannot express (bead framework identity, edge directedness, exact float confidence). Together they reconstruct the graph exactly - the didactic GetPut/PutGet laws hold. Adds bead/interop/layers/_convert.py with the shared, reversible conversions (featureMap with insertion-order + tuple preservation, objectRef, identity capture/restore via .with_, typed JsonValue accessors). Adds hypothesis (dev) and rigorous tests: deterministic round-trips over reddit threads, abstract typed multidigraphs, and provenance-bearing expressions, plus a property-based check of the GetPut law over generated graphs. pyright strict + ruff clean; no Any/object/ignores.

A dx.Lens projecting a CorpusRecord to a faithful layers expression view (kind/text/features) with the bead-only remainder (identity, source_name, record_index) in the complement. GetPut/PutGet verified by example and property tests.

A true dx.Iso (ParsedToken/ParsedSentence carry no framework identity) mapping a dependency parse to a layers tokenization plus a part-of-speech token-tag layer and a dependency relation layer (root encoded as headIndex -1, morph in the pos features). Round-trip verified by example and property tests; makes the Phase 1 parse/layers alignment executable. Also quiets the hypothesis norecursedirs warning.

…defs Mirrors all 29 pub.layers.defs object definitions as didactic models (anchor union, temporal/spatial expressions, token/text/page/external anchors, knowledgeRef, objectRef, agentRef, alignmentLink, annotationMetadata, constraint, feature map, etc.), structurally faithful to layers so a single generic snake<->camel conversion (bead/interop/layers/_mirror.py) serializes any of them to and from layers JSON losslessly. MirrorIso[T] wraps that as a didactic dx.Iso; SHARED_DEF_ISOS registers one per construct. Tests round-trip every shared def (GetPut + PutGet), guard coverage (every construct has an iso), and verify the GetPut/PutGet laws via didactic's verify_iso on representative flat models. pyright strict + ruff clean; no Any/object/ignores.

Mirrors expression, segmentation (token/tokenization), annotation (annotation/argumentRef/cluster), the polymorphic annotationLayer, the property graph (graphNode/graphEdge/graphEdgeSet/graphEdgeEntry), media descriptors (audio/video/document info), and ontology (roleSlot/typeDef) - reusing the shared-def mirrors. The generic MirrorIso serializes each to/from layers JSON losslessly. RECORD_ISOS / ALL_MIRROR_ISOS register them; tests round-trip every record type and guard coverage. pyright strict + ruff clean.

Adds a Layers Interoperability user guide (executable round-trip examples for the corpus graph, dependency parse, and mirror models), a bead.interop API reference page, and a thread/graph reconstruction + losslessness section in the corpus guide; wires both into the mkdocs nav. Adds a coverage test asserting every targeted layers construct has a registered, law-passing mirror iso.

…ayers) Maps bead's existing resource models to layers resource records via dx.Lens: LexicalItem <-> entry, Lexicon <-> collection (+ entries), Template <-> template (slots + DSL constraints). Faithful layers views; the bead-only remainder (framework identity, single language code, tags, DSL constraint context, the bead form/source fields) rides in the lens complement, so the round-trip is exact (GetPut/PutGet, tested). Per a feasibility review, the divergent experiment overlaps (judgment, corpus, persona, changelog) are intentionally not mapped - documented in the package docstring - rather than forced into low-value lenses.

Removes development-note phrasing (comparisons between overlaps, feasibility narration, what was deliberately not built) from the resource_lens, package, _mirror, and graph_lens docstrings. They now describe what each module maps and how, in the present tense, without referencing the development process.

Adds a test suite that runs every layers mapping's output through the ATProto lexicon validator (@atproto/lexicon) and asserts each record validates against its layers lexicon, proving the mappings produce schema-valid layers. Vendors the layers lexicons as the vendor/layers git submodule (layers-pub/layers, shallow, tracking main) so they update with git submodule update --remote. CI's Python test job now checks out submodules so validation runs rather than skips. Fixes conformance bugs the validator surfaced: - parse token textSpan now emits the required byteStart/byteEnd (UTF-8 byte offsets) alongside the optional char offsets. - externalTarget.selector serializes as an ATProto $type union member instead of a wrapper object, and round-trips back.

Releases 0.6.0: streaming/buffering corpus tiers, the bead.interop.layers lossless layers interop, and lexicon validation. Documents the buffering graph tier, the layers interop subpackage, and the lossless-by-default streaming change, which the changelog had not yet recorded.

- Adds hypothesis to [project.optional-dependencies].dev so the pip-based CI install (`-e .[dev,...]`) provides it; the interop tests import it at module scope. It was previously only in [dependency-groups]. - Applies `ruff format` across the tree and excludes the vendor submodule from ruff, so the Format check passes.

aaronstevenwhite added 23 commits May 28, 2026 21:05

Removes redundant test suppressions

07f7138

Drops an unnecessary noqa: ANN202 in a DSL test (tests already ignore ANN) and hoists a pre-existing lazy shutil import to module top in the api-docs test, leaving no type/lint suppressions anywhere in the changed code.

Adds CorpusRecord <-> layers expression bridge lens

785ee7a

A dx.Lens projecting a CorpusRecord to a faithful layers expression view (kind/text/features) with the bead-only remainder (identity, source_name, record_index) in the complement. GetPut/PutGet verified by example and property tests.

Updates uv.lock for 0.6.0 version bump

6425481

aaronstevenwhite merged commit e9e14ef into main May 29, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds lossless corpus tooling + bead.interop.layers (0.6.0)#7

Adds lossless corpus tooling + bead.interop.layers (0.6.0)#7
aaronstevenwhite merged 23 commits into
mainfrom
feat/corpus-ingestion

aaronstevenwhite commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aaronstevenwhite commented May 29, 2026

Description

Motivation

Type of Change

Checklist

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant