Adds lossless corpus tooling + bead.interop.layers (0.6.0)#7
Merged
Conversation
panproto is a transitive dependency of didactic (never imported directly in bead); didactic 0.7.2 requires panproto>=0.48.3. Both are compatible with the existing requires-python >=3.14. The full suite passes (3330) with no source changes; the narrow didactic.api (dx) surface is unaffected by the 0.6 -> 0.7 jump. Callable field types remain unsupported in 0.7.2.
Phase 1 of corpus integration. Adds bead/tokenization/parsers.py: ParsedToken/ ParsedSentence models, SpacyParser/StanzaParser/create_parser, and parse_to_spans which projects a dependency parse onto Span + SpanRelation (one single-token Span per token carrying head_index, and one directed head -> dependent SpanRelation per arc labeled with the deprel). Field placement is aligned with the layers annotation model (formalism, tool, tokenization_id, upos/xpos/lemma/deprel/morph, char offsets) so a parse stored on an Item maps losslessly to a layers dependency AnnotationLayer. Adds STRUCTURE_FUNCTIONS to the DSL stdlib (upos, deprel, head, dependents, has_relation, root, subtree, path_to_root, tokens_with_*, any_deprel, filter_upos), enabling structural constraints like 'upos(self, root(self)) == "VERB" and len(dependents(self, root(self), "obj")) > 0' with no grammar change. Widens DSLEvaluator.evaluate context to Mapping and the DslFunction return alias to include list[int]. Lifts the shared spaCy/Stanza space_after extraction to a single canonical site reused by both the tokenizers and the new parsers (no redundancy). Includes a layers no-drop smoke test asserting every field a layers dependency annotation needs is reconstructable from a parsed Item.
Phase 2 of corpus integration. Adds bead/corpus/: CorpusRecord (provenance keyed to the layers AnnotationMetadata shape), a CorpusSource Protocol, and lazy sources JsonlCorpusSource (plain + Zstandard .zst) and CsvCorpusSource. The pipeline (parse_records, filter_by_structure, sample_corpus, record_to_item) streams records, dependency-parses them, and keeps only those whose parse satisfies a structural DSL constraint, producing Items with layers-aligned spans/relations/provenance. Fully lazy so multi-gigabyte corpora never load into memory. Lifts a shared iter_jsonl_lines helper into bead/data/serialization.py, reused by read_jsonlines, stream_jsonlines, and JsonlCorpusSource (the only corpus-specific addition is a decompressing open_fn). Adds a DependencyParser Protocol with a tool identifier on SpacyParser/StanzaParser for provenance. Adds the 'corpus' optional-dependency extra (zstandard) and zstandard to dev.
Phase 3 of corpus integration. Adds a TextGenerator Protocol to the adapter base and generate_completion to the OpenAI and Anthropic adapters (reusing their existing authenticated clients, no parallel client). Adds CompletionCorpusSource, which treats any TextGenerator as a corpus source, recording the model and prompt as layers-aligned provenance. Adds MarkdownStripTransform and RedditCleanupTransform (SpanTextTransform callables, registered in the default registry) and split_sentences (parser- backed when a spacy/stanza config is given, regex fallback otherwise). RedditCleanupTransform reuses MarkdownStripTransform rather than duplicating markup stripping.
Stanza is an installed tokenization/dev dependency and the English model (tokenize,pos,lemma,depparse) is available, so these tests run for real. The guard now skips only if the model genuinely cannot be downloaded (no network); once present, parse and projection errors surface as failures rather than being swallowed by a broad skip. Adds an end-to-end pipeline test that runs a real StanzaParser through sample_corpus and asserts only transitive sentences are kept.
…code Replaces lazy optional-dependency imports (spaCy, Stanza, zstandard) with importlib.import_module so the import-outside-top-level lint no longer needs a noqa and the messy/partial third-party stubs no longer force type: ignore. Adds docstrings to structural-typing Protocol stubs instead of suppressing the docstring lint, and makes the spaCy token Protocol read-only so a real spaCy Token satisfies it. Moves the core pandas import to module top and the internal create_parser import out of split_sentences. Replaces the object/Any-typed corpus scalar coercion with a precise recursive JSON value type, and widens DSLEvaluator.evaluate to accept a Mapping. Also clears pre-existing dead type: ignore comments in the DSL evaluator and the adapter base. pyright (strict) and ruff both pass clean with no suppressions anywhere in the changed code.
…ransforms Adds API reference pages for bead.corpus and bead.transforms, a Dependency Parsing section to the tokenization reference, and a structural-query note to the DSL reference. Adds a Corpus Ingestion user guide with end-to-end examples (sources, structural sampling, text cleanup, generated corpora) and wires all new pages into the mkdocs nav. Documents the corpus and tokenization extras in the installation guide and records the new functionality plus the didactic/ panproto version bumps in the changelog.
Replaces every Any in bead/dsl/evaluator.py with a recursive DslValue union (scalars, collections, bead models, JsonValue). The operator dispatch now narrows operands before ordering/arithmetic/membership/subscript instead of relying on a broad Any plus try/except TypeError, so the evaluator type-checks cleanly even with the dsl/ pyright exclude lifted (verified) and gives clearer EvaluationError messages. DSLEvaluator.evaluate now takes Mapping[str, DslValue] and returns DslValue; the checked callers (corpus pipeline, list partitioner, template resolver) consume it unchanged. Updates the operator type-error test to the clearer message. Makes the corpus user-guide source/cleanup examples execute against new fixtures, and extends the api-docs code-block test to skip examples needing an optional NLP parser model or model API (mirroring the existing glazing-data skip). pyright (strict) and ruff pass with no warnings and no suppressions anywhere in the changed code.
Drops an unnecessary noqa: ANN202 in a DSL test (tests already ignore ANN) and hoists a pre-existing lazy shutil import to module top in the api-docs test, leaving no type/lint suppressions anywhere in the changed code.
JsonlCorpusSource and CsvCorpusSource now retain ALL source fields by default (provenance_fields/columns=None keeps every field except the text field), so nothing - including Reddit thread edges parent_id/link_id - is silently dropped; an explicit tuple still selects a subset. Non-scalar values are JSON-serialized (json.dumps) rather than str()-ified, so they round-trip via json.loads. This guarantees corpus structure is recoverable downstream even on the fast streaming path.
On top of the streaming sources, adds bead/corpus/graph.py (CorpusNode, CorpusEdge, CorpusGraph) - a directed, typed multigraph over expressions with traversal helpers (out/in_edges, successors/predecessors, roots, descendants, reverse). Reddit reply trees are the single-edge-type special case; arbitrary typed relations between expressions are the general case. bead/corpus/assemble.py adds EdgeSpec (declarative field-to-edge rule with prefix stripping for Reddit fullnames) and assemble_graph, which buffers a record stream and reconstructs the graph from EdgeSpecs and/or a runtime edge_fn. Dangling edge targets are preserved, not dropped. The model is aligned with layers' graphNode/graphEdgeSet for lossless mapping (next phase).
First bead<->layers interop lens, establishing the law-verified template. The CorpusGraph lens projects to a faithful, standalone layers view (expression records, graph nodes, a graphEdgeSet of typed objectRef edges) and keeps a complement holding what layers' graph cannot express (bead framework identity, edge directedness, exact float confidence). Together they reconstruct the graph exactly - the didactic GetPut/PutGet laws hold. Adds bead/interop/layers/_convert.py with the shared, reversible conversions (featureMap with insertion-order + tuple preservation, objectRef, identity capture/restore via .with_, typed JsonValue accessors). Adds hypothesis (dev) and rigorous tests: deterministic round-trips over reddit threads, abstract typed multidigraphs, and provenance-bearing expressions, plus a property-based check of the GetPut law over generated graphs. pyright strict + ruff clean; no Any/object/ignores.
A dx.Lens projecting a CorpusRecord to a faithful layers expression view (kind/text/features) with the bead-only remainder (identity, source_name, record_index) in the complement. GetPut/PutGet verified by example and property tests.
A true dx.Iso (ParsedToken/ParsedSentence carry no framework identity) mapping a dependency parse to a layers tokenization plus a part-of-speech token-tag layer and a dependency relation layer (root encoded as headIndex -1, morph in the pos features). Round-trip verified by example and property tests; makes the Phase 1 parse/layers alignment executable. Also quiets the hypothesis norecursedirs warning.
…defs Mirrors all 29 pub.layers.defs object definitions as didactic models (anchor union, temporal/spatial expressions, token/text/page/external anchors, knowledgeRef, objectRef, agentRef, alignmentLink, annotationMetadata, constraint, feature map, etc.), structurally faithful to layers so a single generic snake<->camel conversion (bead/interop/layers/_mirror.py) serializes any of them to and from layers JSON losslessly. MirrorIso[T] wraps that as a didactic dx.Iso; SHARED_DEF_ISOS registers one per construct. Tests round-trip every shared def (GetPut + PutGet), guard coverage (every construct has an iso), and verify the GetPut/PutGet laws via didactic's verify_iso on representative flat models. pyright strict + ruff clean; no Any/object/ignores.
Mirrors expression, segmentation (token/tokenization), annotation (annotation/argumentRef/cluster), the polymorphic annotationLayer, the property graph (graphNode/graphEdge/graphEdgeSet/graphEdgeEntry), media descriptors (audio/video/document info), and ontology (roleSlot/typeDef) - reusing the shared-def mirrors. The generic MirrorIso serializes each to/from layers JSON losslessly. RECORD_ISOS / ALL_MIRROR_ISOS register them; tests round-trip every record type and guard coverage. pyright strict + ruff clean.
Adds a Layers Interoperability user guide (executable round-trip examples for the corpus graph, dependency parse, and mirror models), a bead.interop API reference page, and a thread/graph reconstruction + losslessness section in the corpus guide; wires both into the mkdocs nav. Adds a coverage test asserting every targeted layers construct has a registered, law-passing mirror iso.
…ayers) Maps bead's existing resource models to layers resource records via dx.Lens: LexicalItem <-> entry, Lexicon <-> collection (+ entries), Template <-> template (slots + DSL constraints). Faithful layers views; the bead-only remainder (framework identity, single language code, tags, DSL constraint context, the bead form/source fields) rides in the lens complement, so the round-trip is exact (GetPut/PutGet, tested). Per a feasibility review, the divergent experiment overlaps (judgment, corpus, persona, changelog) are intentionally not mapped - documented in the package docstring - rather than forced into low-value lenses.
Removes development-note phrasing (comparisons between overlaps, feasibility narration, what was deliberately not built) from the resource_lens, package, _mirror, and graph_lens docstrings. They now describe what each module maps and how, in the present tense, without referencing the development process.
Adds a test suite that runs every layers mapping's output through the ATProto lexicon validator (@atproto/lexicon) and asserts each record validates against its layers lexicon, proving the mappings produce schema-valid layers. Vendors the layers lexicons as the vendor/layers git submodule (layers-pub/layers, shallow, tracking main) so they update with git submodule update --remote. CI's Python test job now checks out submodules so validation runs rather than skips. Fixes conformance bugs the validator surfaced: - parse token textSpan now emits the required byteStart/byteEnd (UTF-8 byte offsets) alongside the optional char offsets. - externalTarget.selector serializes as an ATProto $type union member instead of a wrapper object, and round-trips back.
Releases 0.6.0: streaming/buffering corpus tiers, the bead.interop.layers lossless layers interop, and lexicon validation. Documents the buffering graph tier, the layers interop subpackage, and the lossless-by-default streaming change, which the changelog had not yet recorded.
- Adds hypothesis to [project.optional-dependencies].dev so the pip-based CI install (`-e .[dev,...]`) provides it; the interop tests import it at module scope. It was previously only in [dependency-groups]. - Applies `ruff format` across the tree and excludes the vendor submodule from ruff, so the Format check passes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Releases 0.6.0, adding lossless corpus tooling and a complete, law-verified
mapping between bead and the layers
linguistic-annotation schema.
bead.corpus) is now lossless by default:JsonlCorpusSource/CsvCorpusSourceretain every field and round-tripnon-scalar values through JSON, so nothing is dropped at ingestion.
bead.corpus.graph,bead.corpus.assemble): anopt-in
CorpusGraphtyped directed multidigraph plusassemble_graph, whichreconstructs thread structure (e.g. Reddit reply trees from
parent_id) ontop of the untouched streaming pipeline.
bead.interop.layers): faithful mirror modelsfor the layers shared defs and record types with a generic lossless
MirrorIso, plus bridge lenses (CorpusRecordtoexpression,CorpusGraphto a property graph,
ParsedSentencetotokenization+ annotation layers,and resource-overlap lenses). Mappings are didactic
dx.Iso/dx.Lenssoevery round-trip is exact and law-verified.
lexicons (vendored as the
vendor/layersgit submodule) using the ATProtolexicon validator (
@atproto/lexicon), proving the mappings produceschema-valid layers.
See
CHANGELOG.mdfor the full 0.6.0 entry.Motivation
bead needs to ingest raw corpora without discarding source structure and to
interoperate losslessly with the layers annotation schema. The mapping is
expressed as verified lenses and checked against the canonical layers lexicons
so schema drift surfaces as a failing test.
Fixes #
Type of Change
Checklist
uv run ruff check .anduv run ruff format .uv run pyrightwith no errorsuv run pytest tests/)Testing
uv run pytest(full suite green; the one known spaCy-on-3.14 case skips).uv run pyrightreports 0 errors / 0 warnings;uv run ruff checkclean.against the vendored layers lexicons via
@atproto/lexicon.