Skip to content

Adds lossless corpus tooling + bead.interop.layers (0.6.0)#7

Merged
aaronstevenwhite merged 23 commits into
mainfrom
feat/corpus-ingestion
May 29, 2026
Merged

Adds lossless corpus tooling + bead.interop.layers (0.6.0)#7
aaronstevenwhite merged 23 commits into
mainfrom
feat/corpus-ingestion

Conversation

@aaronstevenwhite

Copy link
Copy Markdown
Collaborator

Description

Releases 0.6.0, adding lossless corpus tooling and a complete, law-verified
mapping between bead and the layers
linguistic-annotation schema.

  • Streaming corpus ingestion (bead.corpus) is now lossless by default:
    JsonlCorpusSource/CsvCorpusSource retain every field and round-trip
    non-scalar values through JSON, so nothing is dropped at ingestion.
  • Buffering graph tier (bead.corpus.graph, bead.corpus.assemble): an
    opt-in CorpusGraph typed directed multidigraph plus assemble_graph, which
    reconstructs thread structure (e.g. Reddit reply trees from parent_id) on
    top of the untouched streaming pipeline.
  • Lossless layers interop (bead.interop.layers): faithful mirror models
    for the layers shared defs and record types with a generic lossless
    MirrorIso, plus bridge lenses (CorpusRecord to expression, CorpusGraph
    to a property graph, ParsedSentence to tokenization + annotation layers,
    and resource-overlap lenses). Mappings are didactic dx.Iso/dx.Lens so
    every round-trip is exact and law-verified.
  • Lexicon validation: every mapping's output is validated against the layers
    lexicons (vendored as the vendor/layers git submodule) using the ATProto
    lexicon validator (@atproto/lexicon), proving the mappings produce
    schema-valid layers.

See CHANGELOG.md for the full 0.6.0 entry.

Motivation

bead needs to ingest raw corpora without discarding source structure and to
interoperate losslessly with the layers annotation schema. The mapping is
expressed as verified lenses and checked against the canonical layers lexicons
so schema drift surfaces as a failing test.

Fixes #

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Documentation update
  • Tests (adding or updating tests)

Checklist

  • I have read the CONTRIBUTING guidelines
  • My code follows the project's style guidelines
  • I have run uv run ruff check . and uv run ruff format .
  • I have run uv run pyright with no errors
  • I have added tests that prove my fix/feature works
  • All tests pass (uv run pytest tests/)
  • I have updated documentation as needed

Testing

  • uv run pytest (full suite green; the one known spaCy-on-3.14 case skips).
  • uv run pyright reports 0 errors / 0 warnings; uv run ruff check clean.
  • Every layers mapping round-trips under the didactic lens laws and validates
    against the vendored layers lexicons via @atproto/lexicon.
  • CI's Python test job checks out submodules so lexicon validation runs there.

panproto is a transitive dependency of didactic (never imported directly in
bead); didactic 0.7.2 requires panproto>=0.48.3. Both are compatible with the
existing requires-python >=3.14. The full suite passes (3330) with no source
changes; the narrow didactic.api (dx) surface is unaffected by the 0.6 -> 0.7
jump. Callable field types remain unsupported in 0.7.2.
Phase 1 of corpus integration. Adds bead/tokenization/parsers.py: ParsedToken/
ParsedSentence models, SpacyParser/StanzaParser/create_parser, and
parse_to_spans which projects a dependency parse onto Span + SpanRelation
(one single-token Span per token carrying head_index, and one directed
head -> dependent SpanRelation per arc labeled with the deprel). Field
placement is aligned with the layers annotation model (formalism, tool,
tokenization_id, upos/xpos/lemma/deprel/morph, char offsets) so a parse stored
on an Item maps losslessly to a layers dependency AnnotationLayer.

Adds STRUCTURE_FUNCTIONS to the DSL stdlib (upos, deprel, head, dependents,
has_relation, root, subtree, path_to_root, tokens_with_*, any_deprel,
filter_upos), enabling structural constraints like
'upos(self, root(self)) == "VERB" and len(dependents(self, root(self), "obj")) > 0'
with no grammar change. Widens DSLEvaluator.evaluate context to Mapping and
the DslFunction return alias to include list[int].

Lifts the shared spaCy/Stanza space_after extraction to a single canonical
site reused by both the tokenizers and the new parsers (no redundancy).

Includes a layers no-drop smoke test asserting every field a layers
dependency annotation needs is reconstructable from a parsed Item.
Phase 2 of corpus integration. Adds bead/corpus/: CorpusRecord (provenance
keyed to the layers AnnotationMetadata shape), a CorpusSource Protocol, and
lazy sources JsonlCorpusSource (plain + Zstandard .zst) and CsvCorpusSource.
The pipeline (parse_records, filter_by_structure, sample_corpus, record_to_item)
streams records, dependency-parses them, and keeps only those whose parse
satisfies a structural DSL constraint, producing Items with layers-aligned
spans/relations/provenance. Fully lazy so multi-gigabyte corpora never load
into memory.

Lifts a shared iter_jsonl_lines helper into bead/data/serialization.py, reused
by read_jsonlines, stream_jsonlines, and JsonlCorpusSource (the only
corpus-specific addition is a decompressing open_fn). Adds a DependencyParser
Protocol with a tool identifier on SpacyParser/StanzaParser for provenance.

Adds the 'corpus' optional-dependency extra (zstandard) and zstandard to dev.
Phase 3 of corpus integration. Adds a TextGenerator Protocol to the adapter
base and generate_completion to the OpenAI and Anthropic adapters (reusing
their existing authenticated clients, no parallel client). Adds
CompletionCorpusSource, which treats any TextGenerator as a corpus source,
recording the model and prompt as layers-aligned provenance.

Adds MarkdownStripTransform and RedditCleanupTransform (SpanTextTransform
callables, registered in the default registry) and split_sentences (parser-
backed when a spacy/stanza config is given, regex fallback otherwise).
RedditCleanupTransform reuses MarkdownStripTransform rather than duplicating
markup stripping.
Stanza is an installed tokenization/dev dependency and the English model
(tokenize,pos,lemma,depparse) is available, so these tests run for real. The
guard now skips only if the model genuinely cannot be downloaded (no network);
once present, parse and projection errors surface as failures rather than
being swallowed by a broad skip. Adds an end-to-end pipeline test that runs a
real StanzaParser through sample_corpus and asserts only transitive sentences
are kept.
…code

Replaces lazy optional-dependency imports (spaCy, Stanza, zstandard) with
importlib.import_module so the import-outside-top-level lint no longer needs a
noqa and the messy/partial third-party stubs no longer force type: ignore. Adds
docstrings to structural-typing Protocol stubs instead of suppressing the
docstring lint, and makes the spaCy token Protocol read-only so a real spaCy
Token satisfies it. Moves the core pandas import to module top and the internal
create_parser import out of split_sentences.

Replaces the object/Any-typed corpus scalar coercion with a precise recursive
JSON value type, and widens DSLEvaluator.evaluate to accept a Mapping. Also
clears pre-existing dead type: ignore comments in the DSL evaluator and the
adapter base. pyright (strict) and ruff both pass clean with no suppressions
anywhere in the changed code.
…ransforms

Adds API reference pages for bead.corpus and bead.transforms, a Dependency
Parsing section to the tokenization reference, and a structural-query note to
the DSL reference. Adds a Corpus Ingestion user guide with end-to-end examples
(sources, structural sampling, text cleanup, generated corpora) and wires all
new pages into the mkdocs nav. Documents the corpus and tokenization extras in
the installation guide and records the new functionality plus the didactic/
panproto version bumps in the changelog.
Replaces every Any in bead/dsl/evaluator.py with a recursive DslValue union
(scalars, collections, bead models, JsonValue). The operator dispatch now
narrows operands before ordering/arithmetic/membership/subscript instead of
relying on a broad Any plus try/except TypeError, so the evaluator type-checks
cleanly even with the dsl/ pyright exclude lifted (verified) and gives clearer
EvaluationError messages. DSLEvaluator.evaluate now takes Mapping[str, DslValue]
and returns DslValue; the checked callers (corpus pipeline, list partitioner,
template resolver) consume it unchanged.

Updates the operator type-error test to the clearer message. Makes the corpus
user-guide source/cleanup examples execute against new fixtures, and extends the
api-docs code-block test to skip examples needing an optional NLP parser model
or model API (mirroring the existing glazing-data skip).

pyright (strict) and ruff pass with no warnings and no suppressions anywhere in
the changed code.
Drops an unnecessary noqa: ANN202 in a DSL test (tests already ignore ANN) and
hoists a pre-existing lazy shutil import to module top in the api-docs test,
leaving no type/lint suppressions anywhere in the changed code.
JsonlCorpusSource and CsvCorpusSource now retain ALL source fields by default
(provenance_fields/columns=None keeps every field except the text field), so
nothing - including Reddit thread edges parent_id/link_id - is silently dropped;
an explicit tuple still selects a subset. Non-scalar values are JSON-serialized
(json.dumps) rather than str()-ified, so they round-trip via json.loads. This
guarantees corpus structure is recoverable downstream even on the fast streaming
path.
On top of the streaming sources, adds bead/corpus/graph.py (CorpusNode,
CorpusEdge, CorpusGraph) - a directed, typed multigraph over expressions with
traversal helpers (out/in_edges, successors/predecessors, roots, descendants,
reverse). Reddit reply trees are the single-edge-type special case; arbitrary
typed relations between expressions are the general case.

bead/corpus/assemble.py adds EdgeSpec (declarative field-to-edge rule with
prefix stripping for Reddit fullnames) and assemble_graph, which buffers a
record stream and reconstructs the graph from EdgeSpecs and/or a runtime
edge_fn. Dangling edge targets are preserved, not dropped. The model is aligned
with layers' graphNode/graphEdgeSet for lossless mapping (next phase).
First bead<->layers interop lens, establishing the law-verified template. The
CorpusGraph lens projects to a faithful, standalone layers view (expression
records, graph nodes, a graphEdgeSet of typed objectRef edges) and keeps a
complement holding what layers' graph cannot express (bead framework identity,
edge directedness, exact float confidence). Together they reconstruct the graph
exactly - the didactic GetPut/PutGet laws hold.

Adds bead/interop/layers/_convert.py with the shared, reversible conversions
(featureMap with insertion-order + tuple preservation, objectRef, identity
capture/restore via .with_, typed JsonValue accessors). Adds hypothesis (dev)
and rigorous tests: deterministic round-trips over reddit threads, abstract
typed multidigraphs, and provenance-bearing expressions, plus a property-based
check of the GetPut law over generated graphs. pyright strict + ruff clean; no
Any/object/ignores.
A dx.Lens projecting a CorpusRecord to a faithful layers expression view
(kind/text/features) with the bead-only remainder (identity, source_name,
record_index) in the complement. GetPut/PutGet verified by example and
property tests.
A true dx.Iso (ParsedToken/ParsedSentence carry no framework identity) mapping
a dependency parse to a layers tokenization plus a part-of-speech token-tag
layer and a dependency relation layer (root encoded as headIndex -1, morph in
the pos features). Round-trip verified by example and property tests; makes the
Phase 1 parse/layers alignment executable. Also quiets the hypothesis
norecursedirs warning.
…defs

Mirrors all 29 pub.layers.defs object definitions as didactic models (anchor
union, temporal/spatial expressions, token/text/page/external anchors,
knowledgeRef, objectRef, agentRef, alignmentLink, annotationMetadata,
constraint, feature map, etc.), structurally faithful to layers so a single
generic snake<->camel conversion (bead/interop/layers/_mirror.py) serializes any
of them to and from layers JSON losslessly. MirrorIso[T] wraps that as a
didactic dx.Iso; SHARED_DEF_ISOS registers one per construct.

Tests round-trip every shared def (GetPut + PutGet), guard coverage (every
construct has an iso), and verify the GetPut/PutGet laws via didactic's
verify_iso on representative flat models. pyright strict + ruff clean; no
Any/object/ignores.
Mirrors expression, segmentation (token/tokenization), annotation
(annotation/argumentRef/cluster), the polymorphic annotationLayer, the property
graph (graphNode/graphEdge/graphEdgeSet/graphEdgeEntry), media descriptors
(audio/video/document info), and ontology (roleSlot/typeDef) - reusing the
shared-def mirrors. The generic MirrorIso serializes each to/from layers JSON
losslessly. RECORD_ISOS / ALL_MIRROR_ISOS register them; tests round-trip every
record type and guard coverage. pyright strict + ruff clean.
Adds a Layers Interoperability user guide (executable round-trip examples for
the corpus graph, dependency parse, and mirror models), a bead.interop API
reference page, and a thread/graph reconstruction + losslessness section in the
corpus guide; wires both into the mkdocs nav. Adds a coverage test asserting
every targeted layers construct has a registered, law-passing mirror iso.
…ayers)

Maps bead's existing resource models to layers resource records via dx.Lens:
LexicalItem <-> entry, Lexicon <-> collection (+ entries), Template <-> template
(slots + DSL constraints). Faithful layers views; the bead-only remainder
(framework identity, single language code, tags, DSL constraint context, the
bead form/source fields) rides in the lens complement, so the round-trip is
exact (GetPut/PutGet, tested).

Per a feasibility review, the divergent experiment overlaps (judgment, corpus,
persona, changelog) are intentionally not mapped - documented in the package
docstring - rather than forced into low-value lenses.
Removes development-note phrasing (comparisons between overlaps, feasibility
narration, what was deliberately not built) from the resource_lens, package,
_mirror, and graph_lens docstrings. They now describe what each module maps and
how, in the present tense, without referencing the development process.
Adds a test suite that runs every layers mapping's output through the
ATProto lexicon validator (@atproto/lexicon) and asserts each record
validates against its layers lexicon, proving the mappings produce
schema-valid layers.

Vendors the layers lexicons as the vendor/layers git submodule
(layers-pub/layers, shallow, tracking main) so they update with
git submodule update --remote. CI's Python test job now checks out
submodules so validation runs rather than skips.

Fixes conformance bugs the validator surfaced:
- parse token textSpan now emits the required byteStart/byteEnd (UTF-8
  byte offsets) alongside the optional char offsets.
- externalTarget.selector serializes as an ATProto $type union member
  instead of a wrapper object, and round-trips back.
Releases 0.6.0: streaming/buffering corpus tiers, the bead.interop.layers
lossless layers interop, and lexicon validation. Documents the buffering
graph tier, the layers interop subpackage, and the lossless-by-default
streaming change, which the changelog had not yet recorded.
- Adds hypothesis to [project.optional-dependencies].dev so the pip-based
  CI install (`-e .[dev,...]`) provides it; the interop tests import it at
  module scope. It was previously only in [dependency-groups].
- Applies `ruff format` across the tree and excludes the vendor submodule
  from ruff, so the Format check passes.
@aaronstevenwhite aaronstevenwhite merged commit e9e14ef into main May 29, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant