Skip to content

Pivot plugin to simple XML element parser#1

Merged
rjrodger merged 15 commits into
mainfrom
claude/xml-parser-elements-bNQB2
Apr 19, 2026
Merged

Pivot plugin to simple XML element parser#1
rjrodger merged 15 commits into
mainfrom
claude/xml-parser-elements-bNQB2

Conversation

@rjrodger
Copy link
Copy Markdown
Contributor

Replace the CSV grammar and plugin with an XML element-only variant.
Parses a single rooted XML document into { name, children } where
children is an array of strings (text nodes) and nested elements.

Handles open/close tags, self-closing tags (<tag/>), nested and
mixed content, and reports mismatched close tags as an error. A
custom lexer matcher tokenizes <tag>, </tag>, and <tag/> as
single tokens; whitespace and JSON structural tokens are disabled
so text is preserved verbatim between tags.

claude added 15 commits April 18, 2026 22:36
Replace the CSV grammar and plugin with an XML element-only variant.
Parses a single rooted XML document into `{ name, children }` where
`children` is an array of strings (text nodes) and nested elements.

Handles open/close tags, self-closing tags (`<tag/>`), nested and
mixed content, and reports mismatched close tags as an error. A
custom lexer matcher tokenizes `<tag>`, `</tag>`, and `<tag/>` as
single tokens; whitespace and JSON structural tokens are disabled
so text is preserved verbatim between tags.
Extend the simple element parser to cover the commonly used parts of
the XML specification beyond bare elements.

Lexer changes:
  - The custom tag matcher now parses attributes (including single- and
    double-quoted values with entity decoding) and recognises comments
    (<!-- ... -->), CDATA sections (<![CDATA[ ... ]]>), processing
    instructions (<? ... ?>, including the XML declaration) and
    DOCTYPE declarations with optional internal subsets. Comments, PIs
    and DOCTYPEs are emitted as an #XIG token and dropped via IGNORE;
    CDATA is emitted verbatim as #TX with no entity processing.
  - A text modifier decodes the five predefined entities (amp, lt, gt,
    quot, apos) plus numeric character references (&#N; and &#xN;)
    from text nodes; attribute values are decoded inline.

Data structure changes:
  - Each element now has `attributes`, `localName` and optional
    `prefix`/`namespace` fields in addition to `name` and `children`.
  - A post-parse walk resolves namespace URIs from xmlns/xmlns:*
    declarations across nested scopes with proper inheritance and
    override semantics.

Options:
  - `namespaces` (default true)     - enable namespace resolution
  - `entities`   (default true)     - enable entity decoding
  - `customEntities`                - additional named entities

Grammar:
  - `xml` rule skips whitespace text nodes between the document
    prolog (declaration, DOCTYPE, comments) and the root element, and
    after the root element, so real-world documents with blank lines
    parse cleanly.
- go/xml.go: Go port of the XML plugin, with the same data shape and
  feature set as the TypeScript version (elements + attributes +
  mixed content, predefined + numeric entity decoding, namespace
  resolution, comments/CDATA/PI/DOCTYPE handling). Goes through
  jsonic/go's declarative GrammarSpec with auto-wired @xml-bc /
  @child-bc state actions.
- go/xml_test.go: Go test suite driven by the shared TSV spec files
  plus an explicit jsonic-embedded-XML test case.
- test/spec/*.tsv: shared parse fixtures with four columns
  (name, input, expected, opts). Input uses escape sequences
  (\n \r \t \\); expected is raw JSON or `ERROR`/`ERROR:code`; opts
  is optional plugin options JSON. Splits cases across basic,
  attributes, entities, namespaces, structure, errors, and a w3c
  spec of standardised/real-world XML documents (Atom, SOAP, SVG,
  RSS, XHTML, DOCTYPEs, not-well-formed).
- test/xml.test.ts: the TypeScript test suite now auto-discovers
  and runs every TSV spec file, and adds the jsonic-embedded-XML
  test case.
- Remove the leftover CSV Go package, CSV docs, CSV fixtures,
  and coverage artifact from the original repo layout.
embed mode
----------
Adds an `embed: true` plugin option that extends Jsonic's own grammar so
a literal XML element (`<tag>…</tag>` or `<tag/>`) can appear wherever
a Jsonic value is expected — inside maps, lists, or at the top level.
Default behaviour (`embed: false`) remains pure-XML parsing with the
JSON rules stripped.

When `embed: true` the plugin:
  - keeps the full JSON/JSONIC grammar in place, including the
    structural fixed tokens `{ } [ ] : ,`;
  - splices two alternates into the `val` rule so `#XOP`/`#XSC` tokens
    dispatch to the `element` rule;
  - tracks XML nesting depth in `ctx.u.xmlDepth`: while depth > 0 the
    custom matcher also claims any run of non-`<` characters as a
    single `#TX` node (optionally entity-decoded), so Jsonic's lexer
    can't reinterpret a comma or colon inside XML text as a JSON
    separator;
  - resolves namespaces on close of an `element` rule that sits
    directly under a `val` rule.

The embedded-XML tests in both `test/xml.test.ts` and `go/xml_test.go`
now use real literal XML in Jsonic source instead of stuffing the
document inside a string.

W3C XML Conformance Test Suite
------------------------------
Added `scripts/fetch-xml-suite.sh` to download the 2013-09-23 snapshot
of the W3C XML Test Suite (xmltest) into `test/xmlconf/` (gitignored —
the suite is owned by W3C and contributors and is not redistributed).
Both languages pick it up automatically when present:

  - `go/xmlconf_test.go` iterates `xmltest/valid/sa/*.xml` and
    `xmltest/not-wf/sa/*.xml`, counting successful parses and
    expected rejections, and asserts each count stays above a
    regression floor.
  - An equivalent `describe(..., { skip: ... })` block in
    `test/xml.test.ts` does the same for Node.

Current numbers (regression guard in parentheses):
  - valid/sa      : 116 / 120 parsed   (floor 110)
  - not-wf/sa     :  39 / 186 rejected (floor 30)

The handful of `valid/sa` misses are UTF-16 BOM files and tests that
use non-Latin tag names — both out of scope for the current parser.
Many `not-wf/sa` tests hinge on character-level WF constraints our
structural parser doesn't enforce, hence the conservative floor.
Add a batch of XML 1.0 well-formedness constraints that the structural
matcher can enforce without DTD support. The custom matcher now also
owns text-token emission whenever the parser is inside an open
element (depth > 0) in both pure and embed modes, so the same
validation path applies to all character data.

New errors raised at lex time:

  comment_double_dash       -- inside a comment body
  cdata_terminator_in_text  ]]> in non-CDATA character data
  pi_target_invalid         <? ?> with missing/invalid target
  lt_in_attr_value          literal `<` in an attribute value
  bad_entity_ref            malformed `&...;` reference
                              (in text or attribute values)
  duplicate_attribute       same attribute name twice in one tag
  xml_invalid_tag           </> empty close tag

W3C xmltest/not-wf/sa rejection rate climbs from 39/186 to 54/186 with
no regression in valid/sa (116/120). All 87 TS tests and the Go suite
still pass.
XML 1.0 §2.2 only allows tab (#x09), LF (#x0A), CR (#x0D) and code
points >= #x20 in document content. Other C0 controls (form feed,
ESC, ...) make a document not well-formed.

Apply a `checkChars` validation in every place the matcher emits
text-like content: character data inside an open element, CDATA
section bodies, comment bodies, processing instruction bodies, and
attribute values. The new error code is `invalid_xml_char`.

W3C xmltest/not-wf/sa rejection rate moves from 54/186 to 58/186 with
no regression in valid/sa.
XML 1.0 §2.1 requires exactly one root (document) element. The xml
rule previously allowed `r: xml` to skip trailing whitespace and then
re-attempt `{ p: element }`, which let documents with multiple
top-level elements parse with the last one winning.

Track a per-parse `ctx.u.rootSeen` flag, set in `@xml-bc` once the
root element's node has been hoisted. Add a new `@no-root-yet`
condition gating the `{ p: element }` alternate so subsequent attempts
fail with "unexpected" rather than silently producing a wrong tree.

W3C xmltest/not-wf/sa rejection rate moves from 58/186 to 60/186.
Replace the ASCII-only NameStartChar/NameChar regex with the full XML
1.0 Fifth Edition character set:

  NameStartChar = ':' | [A-Z] | '_' | [a-z] | the Unicode letter and
                  ideograph blocks listed in §2.3 [4]
  NameChar      = NameStartChar | '-' | '.' | [0-9] | #xB7 |
                  combining-mark blocks (§2.3 [4a])

Surrogate pairs in the JS implementation and multibyte UTF-8 sequences
in the Go implementation are now read as single characters via a
shared `readName` helper. The same predicates also gate entity-
reference name validation.

W3C xmltest/valid/sa pass count moves from 116/120 to 117/120 (the
Thai-named element test); the remaining three misses are UTF-16/
UTF-32 BOM files. Two new TSV cases (`tag-name-unicode-thai`,
`tag-name-unicode-greek`) lock in the behaviour for both runtimes.
The custom matcher now strips a UTF-8 BOM (raw bytes EF BB BF or the
single character U+FEFF) at sI=0 in both runtimes, so files saved with
a BOM parse cleanly without the caller having to massage the input.

For UTF-16 / UTF-32 encoded input the runtime can't sniff bytes from
inside an already-decoded JS / Go string. The plugin therefore exposes
a public `decodeBOM(src)` helper:

  - Go:  `xml.DecodeBOM(string)` — accepts a string of raw bytes,
          detects UTF-8/16/32 BOMs and transcodes to a UTF-8 Go
          string (with `unicode/utf16` for the UTF-16 paths).
  - TS:  `decodeBOM(src)` — accepts either a Node Buffer / Uint8Array
          or a "binary" JS string, detects UTF-8/16/32 BOMs, falls
          back to UTF-8 when no BOM is present, and returns a decoded
          Unicode string. A leading U+FEFF is stripped if the input
          is already a Unicode string.

The W3C conformance test runners pass file contents through this
helper before parsing, so the three UTF-16 documents in
xmltest/valid/sa now parse:

  valid/sa: 120 / 120 parsed successfully
XML 1.0 §2.11 requires that any literal CR (#xD) or CR-LF (#xD #xA)
be replaced with a single LF (#xA) before parsing. §3.3.3 further
requires that for CDATA-typed attributes (the default in the absence
of a DTD) every TAB / LF / CR in the source be replaced with a single
SPACE before entity references are decoded.

Add `normaliseLineEndings` and `normaliseAttrWhitespace` helpers in
both runtimes and apply them in the matcher's text, CDATA and
attribute paths. Five new TSV cases (`text-crlf-normalised`,
`text-cr-normalised`, `attr-tab-normalised`, `attr-newline-normalised`,
`attr-crlf-normalised`) cover the new behaviour.
XML 1.0 §2.10 (xml:space) and §2.12 (xml:lang) define two special
attributes that an element may use to signal whitespace handling and
language identification. Both inherit down through descendants,
just like xmlns declarations.

Fold xml:space and xml:lang propagation into the existing namespace
walk via a shared `xmlScope` value that carries the active prefix
map plus the current `space` and `lang`. Each element gains a
`space` field when the active value differs from the default
"default" (typically "preserve"), and a `lang` field when any
in-scope element specifies xml:lang.

A new test/spec/xmlspace-lang.tsv covers the inherited-and-overridden
cases for both runtimes.
Per Namespaces in XML 1.0 §2 ("Reserved prefixes and namespace names"):

  - The "xml" prefix is fixed to http://www.w3.org/XML/1998/namespace.
    It MAY be redeclared but only to that exact URI; redeclaring it
    to any other namespace name is an error.
  - The "xmlns" prefix is fixed to http://www.w3.org/2000/xmlns/ and
    MUST NOT be declared.
  - Neither URI may be bound to any other prefix or used as the
    default namespace.
  - A prefixed element or attribute name that has no in-scope binding
    is an error (xmlns and xml are implicitly bound).

The namespace resolver now pre-binds the xml prefix, validates each
xmlns/xmlns:* declaration, and walks the tree checking that every
prefixed element and attribute has a binding. Errors short-circuit
resolution and surface as parse errors via two new codes:

  reserved_namespace
  unbound_prefix

Six new TSV cases under test/spec/errors.tsv exercise the
reserved-prefix and unbound-prefix paths; two positive cases under
test/spec/namespaces.tsv lock in the implicit `xml:` binding and the
correct explicit `xmlns:xml=` redeclaration.
Extract `<!ENTITY name "value">` general internal entity declarations
from the DOCTYPE internal subset and use them when resolving entity
references in text and attribute values. Parameter entity
declarations (`<!ENTITY % name ...>`) and external entity
declarations (`<!ENTITY name SYSTEM "...">` / `PUBLIC "..."`) are
recognised but skipped — we don't fetch external resources.

Implementation:

  - The DOCTYPE matcher path now records the byte range of the `[ ]`
    internal subset and runs `parseDoctypeEntities` over it. The
    extracted map is stored on the per-parse context
    (`ctx.u.dtdEntities` in TS, `ctx.U["dtdEntities"]` in Go).
  - The entity decoder is now a closure that takes an optional `dtd`
    map. The matcher's text and attribute paths look the map up
    via `lex.ctx` and pass it through. The five predefined
    entities and any plugin-time `customEntities` always take
    precedence, matching the XML 1.0 rule that the predefined
    entities are always available.
  - Recursive entity expansion is supported, with cycle detection
    via a `seen` set: a cyclic reference breaks the cycle (the
    original `&name;` is left in place) instead of looping.

Entity values are stored verbatim. Character and entity references
inside an entity value are expanded only when the outer entity is
referenced (matches XML 1.0 §4.4 "Bypassed" treatment for general
entity declarations).

A new test/spec/dtd-entities.tsv covers the basic, recursive, and
edge cases (single-quoted values, parameter-entity skip, external
entity skip, predefined-entity precedence) for both runtimes. All
118 TS tests and the Go suite pass.
XML 1.0 §4.1 requires every named entity reference to resolve to a
declared entity (predefined, custom, or DOCTYPE-declared). Add a new
`strictEntities` option (default `true`) that enforces this in
`checkEntityRefs`. When set to `false`, references to unknown names
pass through unexpanded (legacy behaviour useful for templating).

While testing the new check, the DOCTYPE depth tracker was found to
treat `]` and `>` characters inside quoted entity values as if they
ended the internal subset, which made declarations like
`<!ENTITY rsqb "]">` cut the subset short and any subsequent
`&rsqb;` references reach the validator as undeclared. The tracker
now skips over single- and double-quoted strings while walking the
DOCTYPE, restoring the W3C valid/sa pass count to 120/120.

Conformance changes:
  - valid/sa     : 120/120 (unchanged)
  - not-wf/sa    :  60/186 -> 64/186 (+4 strict-entity catches)

The legacy "unknown-passthrough" test was renamed to
"unknown-rejected" with a new "unknown-passthrough-lenient" variant
that opts in via `{strictEntities: false}`.
Parse `<!ATTLIST element attr type defaultDecl>` declarations from
the DOCTYPE internal subset and use them to fill in attributes that
are missing from element instances. Both literal defaults and the
`#FIXED "value"` form are honoured; `#REQUIRED` and `#IMPLIED`
declarations contribute nothing because they have no default value.

Implementation:

  - `parseDoctypeAttlists` scans for each `<!ATTLIST>` declaration,
    skips the AttType (a bare uppercase identifier, an enumeration
    `( ... )`, or `NOTATION ( ... )`), and collects the default
    value. The result is keyed by element name then attribute
    name and stored on the per-parse context as
    `dtdAttrDefaults`.
  - The `@element-open` and `@element-selfclose` actions consult
    that map via `applyAttrDefaults` and merge in any defaults
    that the parsed element does not already provide.

A new test/spec/dtd-attlist.tsv exercises basic defaults, override
by an instance attribute, multiple declarations on one element,
`#FIXED`, enumeration types, the no-default `#REQUIRED`/`#IMPLIED`
forms, and per-element scoping. All 126 TS tests and the Go suite
pass; W3C conformance numbers are unchanged.
@rjrodger rjrodger merged commit d34ff56 into main Apr 19, 2026
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants